Core Concepts
This page defines the key terms and concepts referenced throughout the Stronghold documentation.
Decision
Every scan returns a decision — the final verdict on whether the scanned content is safe.
| Decision | Meaning |
|---|---|
ALLOW | No threats detected. Content is safe to process. |
WARN | Elevated risk detected but confidence is not high enough to block. Content is passed through with warning metadata. |
BLOCK | High-confidence threat detected. Content should not be processed. The transparent proxy withholds blocked content from the agent entirely. |
Scores
Each scanning layer produces a score between 0 and 1, where higher values indicate greater threat confidence. The score keys differ between content scanning and output scanning.
Content scan scores (/v1/scan/content)
| Score | Layer | Description |
|---|---|---|
combined | All layers | Weighted combination of all detection layers. This is the primary score used for the final decision. |
heuristic | Heuristic | Pattern-matching score based on known injection signatures. |
semantic | Semantic Similarity | Cosine similarity to known attack embeddings. |
ml_confidence | ML/LLM Classification | Confidence from the LLM analysis layer. 0.0 when the LLM layer is disabled. |
The
combinedkey is only present when the hybrid detector is active (semantic or LLM layers enabled). In heuristic-only mode, the primary score isheuristic.
Output scan scores (/v1/scan/output)
| Score | Layer | Description |
|---|---|---|
credential_score | Credential Detection | Confidence that the output contains leaked credentials or secrets (0 to 1). |
findings_count | Credential Detection | Number of distinct credential/secret patterns found in the output. |
The API response includes all individual scores alongside the decision, so you can inspect which layers contributed to the verdict.
Scanning Layers
Stronghold uses a 4-layer scanning pipeline. Each layer adds detection capability at the cost of additional latency:
| Layer | Method | Typical Latency | Description |
|---|---|---|---|
| Heuristic | Pattern matching | <1ms | Regex and signature-based detection of known injection patterns. Fastest layer, catches obvious attacks. |
| ML Classification | Citadel/Hugot | ~5ms | Neural network trained on prompt injection datasets. Catches attacks that evade simple patterns. |
| Semantic Similarity | Embedding comparison | ~10ms | Computes embeddings and measures cosine similarity against a database of known attack vectors. Catches novel phrasings of known attack types. |
| LLM Analysis (optional) | OpenRouter LLM reasoning | ~500ms | Any OpenRouter-compatible LLM reasons about whether the content is an attack. Most capable but slowest. Disabled by default. |
Layers run in order. If an early layer produces a high-confidence BLOCK, later layers may be skipped for performance.
Threat Categories
Stronghold classifies detected threats into the following categories:
| Category | Description |
|---|---|
instruction_override | Attempts to replace or override the agent’s system prompt or instructions. |
system_extraction | Attempts to extract the agent’s system prompt, internal instructions, or configuration. |
context_manipulation | Attempts to alter the agent’s understanding of its context, conversation history, or role. |
jailbreak | Attempts to remove safety constraints or behavioral guardrails from the agent. |
roleplay_attack | Uses fictional scenarios or role assignment to bypass the agent’s restrictions. |
data_exfil | Attempts to exfiltrate data by encoding it in URLs, requests, or other output channels. |
credential_leak | Agent response contains API keys, passwords, tokens, or other secrets. |
obfuscation | Uses encoding, Unicode tricks, or other techniques to disguise an attack payload. |
multiturn_attack | A coordinated attack spread across multiple turns of conversation to gradually shift agent behavior. |
The reason field in scan responses references these categories when a threat is detected.
x402
x402 is an HTTP-native payment protocol. Instead of requiring API keys or subscriptions, Stronghold uses x402 for per-request payment:
- The client sends a request to a paid endpoint.
- The server responds with 402 Payment Required and a JSON body specifying the price, token, network, and recipient address.
- The client signs an EIP-712
TransferWithAuthorizationmessage (for EVM networks) or an equivalent authorization (for Solana), authorizing the exact payment amount. - The client retries the original request with the signed payment in the
X-PAYMENTheader. - The server verifies the authorization and processes the request.
Client libraries like x402-fetch handle this flow automatically. See x402 Protocol for the full specification.
microUSDC
All money amounts in Stronghold are represented as microUSDC — string-encoded integers where 1 microUSDC equals 0.000001 USDC.
| microUSDC value | USDC equivalent | USD equivalent |
|---|---|---|
"1" | 0.000001 USDC | $0.000001 |
"1000" | 0.001 USDC | $0.001 |
"1000000" | 1.0 USDC | $1.00 |
microUSDC values are always transmitted as strings, not numbers, to avoid floating-point precision issues. This is the canonical format for all money fields in the API, CLI output, and configuration.
Content Scanning vs. Output Scanning
Stronghold provides two distinct scan types for bidirectional protection:
Content Scanning (/v1/scan/content) scans incoming content for prompt injection attacks. This is what the transparent proxy does automatically — it scans every HTTP response before the agent reads it.
Output Scanning (/v1/scan/output) scans outgoing agent responses for credential leaks — API keys, passwords, tokens, connection strings, and other secrets that the agent might inadvertently include in its output.
Transparent Proxy
The transparent proxy is a network-level interceptor that sits between the agent and the internet. It uses operating system firewall rules (iptables/nftables on Linux, pf on macOS) to redirect all HTTP/HTTPS traffic from the agent’s dedicated system user through the Stronghold scanning pipeline.
Key properties:
- Operates outside the agent’s cognition — cannot be bypassed by prompt injection
- Scans content before the agent receives it
- Requires no code changes to the agent
- Works with any agent framework or language
- Adds
X-Stronghold-*response headers with scan metadata
See Why Network-Level Scanning for the motivation behind this approach and Proxy Architecture for implementation details.