Skip to content

Core Concepts

This page defines the key terms and concepts referenced throughout the Stronghold documentation.

Decision

Every scan returns a decision — the final verdict on whether the scanned content is safe.

DecisionMeaning
ALLOWNo threats detected. Content is safe to process.
WARNElevated risk detected but confidence is not high enough to block. Content is passed through with warning metadata.
BLOCKHigh-confidence threat detected. Content should not be processed. The transparent proxy withholds blocked content from the agent entirely.

Scores

Each scanning layer produces a score between 0 and 1, where higher values indicate greater threat confidence. The score keys differ between content scanning and output scanning.

Content scan scores (/v1/scan/content)

ScoreLayerDescription
combinedAll layersWeighted combination of all detection layers. This is the primary score used for the final decision.
heuristicHeuristicPattern-matching score based on known injection signatures.
semanticSemantic SimilarityCosine similarity to known attack embeddings.
ml_confidenceML/LLM ClassificationConfidence from the LLM analysis layer. 0.0 when the LLM layer is disabled.

The combined key is only present when the hybrid detector is active (semantic or LLM layers enabled). In heuristic-only mode, the primary score is heuristic.

Output scan scores (/v1/scan/output)

ScoreLayerDescription
credential_scoreCredential DetectionConfidence that the output contains leaked credentials or secrets (0 to 1).
findings_countCredential DetectionNumber of distinct credential/secret patterns found in the output.

The API response includes all individual scores alongside the decision, so you can inspect which layers contributed to the verdict.

Scanning Layers

Stronghold uses a 4-layer scanning pipeline. Each layer adds detection capability at the cost of additional latency:

LayerMethodTypical LatencyDescription
HeuristicPattern matching<1msRegex and signature-based detection of known injection patterns. Fastest layer, catches obvious attacks.
ML ClassificationCitadel/Hugot~5msNeural network trained on prompt injection datasets. Catches attacks that evade simple patterns.
Semantic SimilarityEmbedding comparison~10msComputes embeddings and measures cosine similarity against a database of known attack vectors. Catches novel phrasings of known attack types.
LLM Analysis (optional)OpenRouter LLM reasoning~500msAny OpenRouter-compatible LLM reasons about whether the content is an attack. Most capable but slowest. Disabled by default.

Layers run in order. If an early layer produces a high-confidence BLOCK, later layers may be skipped for performance.

Threat Categories

Stronghold classifies detected threats into the following categories:

CategoryDescription
instruction_overrideAttempts to replace or override the agent’s system prompt or instructions.
system_extractionAttempts to extract the agent’s system prompt, internal instructions, or configuration.
context_manipulationAttempts to alter the agent’s understanding of its context, conversation history, or role.
jailbreakAttempts to remove safety constraints or behavioral guardrails from the agent.
roleplay_attackUses fictional scenarios or role assignment to bypass the agent’s restrictions.
data_exfilAttempts to exfiltrate data by encoding it in URLs, requests, or other output channels.
credential_leakAgent response contains API keys, passwords, tokens, or other secrets.
obfuscationUses encoding, Unicode tricks, or other techniques to disguise an attack payload.
multiturn_attackA coordinated attack spread across multiple turns of conversation to gradually shift agent behavior.

The reason field in scan responses references these categories when a threat is detected.

x402

x402 is an HTTP-native payment protocol. Instead of requiring API keys or subscriptions, Stronghold uses x402 for per-request payment:

  1. The client sends a request to a paid endpoint.
  2. The server responds with 402 Payment Required and a JSON body specifying the price, token, network, and recipient address.
  3. The client signs an EIP-712 TransferWithAuthorization message (for EVM networks) or an equivalent authorization (for Solana), authorizing the exact payment amount.
  4. The client retries the original request with the signed payment in the X-PAYMENT header.
  5. The server verifies the authorization and processes the request.

Client libraries like x402-fetch handle this flow automatically. See x402 Protocol for the full specification.

microUSDC

All money amounts in Stronghold are represented as microUSDC — string-encoded integers where 1 microUSDC equals 0.000001 USDC.

microUSDC valueUSDC equivalentUSD equivalent
"1"0.000001 USDC$0.000001
"1000"0.001 USDC$0.001
"1000000"1.0 USDC$1.00

microUSDC values are always transmitted as strings, not numbers, to avoid floating-point precision issues. This is the canonical format for all money fields in the API, CLI output, and configuration.

Content Scanning vs. Output Scanning

Stronghold provides two distinct scan types for bidirectional protection:

Content Scanning (/v1/scan/content) scans incoming content for prompt injection attacks. This is what the transparent proxy does automatically — it scans every HTTP response before the agent reads it.

Output Scanning (/v1/scan/output) scans outgoing agent responses for credential leaks — API keys, passwords, tokens, connection strings, and other secrets that the agent might inadvertently include in its output.

Transparent Proxy

The transparent proxy is a network-level interceptor that sits between the agent and the internet. It uses operating system firewall rules (iptables/nftables on Linux, pf on macOS) to redirect all HTTP/HTTPS traffic from the agent’s dedicated system user through the Stronghold scanning pipeline.

Key properties:

  • Operates outside the agent’s cognition — cannot be bypassed by prompt injection
  • Scans content before the agent receives it
  • Requires no code changes to the agent
  • Works with any agent framework or language
  • Adds X-Stronghold-* response headers with scan metadata

See Why Network-Level Scanning for the motivation behind this approach and Proxy Architecture for implementation details.