Detection Layers
Stronghold scans content through a four-layer pipeline, ordered from fastest to slowest. Each layer adds a different detection capability, and their scores combine into a final decision.
Pipeline Overview
| Layer | Method | Speed | Description |
|---|---|---|---|
| Heuristic | Pattern matching, keyword weights | <1ms | Detects known injection patterns, DAN jailbreaks, system prompt extraction |
| ML | Citadel via Hugot ONNX backend | ~5ms | BERT-based binary classification (INJECTION vs BENIGN) |
| Semantic | Embedding similarity search | ~10ms | Compares content against known attack embeddings |
| LLM | Optional analysis via OpenRouter | ~500ms | Final classification for ambiguous cases (requires config) |
Scoring and Thresholds
Each layer produces a score between 0.0 and 1.0. These scores are combined into a final decision using the following thresholds:
| Decision | Condition | Default Threshold |
|---|---|---|
| BLOCK | Combined score >= threshold | 0.55 |
| WARN | Combined score >= threshold | 0.35 |
| ALLOW | Combined score < warn threshold | < 0.35 |
Thresholds are configurable via environment variables:
STRONGHOLD_BLOCK_THRESHOLD— score at or above which content is blocked (default:0.55)STRONGHOLD_WARN_THRESHOLD— score at or above which a warning is issued (default:0.35)
These thresholds are applied directly in heuristic-only mode. When the hybrid detector is active, Citadel makes its own BLOCK/WARN/ALLOW decision internally using these thresholds as configuration.
Layer 1: Heuristic
The heuristic layer uses Citadel’s ThreatScorer with weighted keyword matching. It runs in under a millisecond and catches well-known attack patterns:
- “Ignore previous instructions” and variants
- DAN jailbreak patterns
- “Act as” / “pretend to be” roleplay triggers
- System prompt extraction phrases (“repeat your instructions”, “what is your system prompt”)
- Markdown image exfiltration patterns
- High-entropy payloads (potential encoded attacks)
This layer is fast enough to run on every request with negligible latency impact.
Layer 2: ML Classification
The ML layer uses Citadel’s HybridDetector, which runs ONNX inference locally via the Hugot backend. Hugot is the Go ONNX runtime that Citadel uses under the hood to run a BERT-based model trained on prompt injection datasets. It classifies content as either INJECTION or BENIGN with a confidence score.
Key characteristics:
- Runs entirely locally — no external API calls
- ~5ms per classification
- Catches attacks that do not match known keyword patterns
- The ML classification layer runs automatically as part of Citadel’s hybrid detector. It cannot be independently toggled.
Layer 3: Semantic Similarity
The semantic layer computes embedding similarity between the input content and a database of known attack vectors. This catches paraphrased attacks that keyword matching misses — for example, an attacker who rephrases “ignore your instructions” as “disregard the directives you were given”.
Key characteristics:
- ~10ms per comparison
- Effective against novel phrasings of known attack types
- Enabled by default (
STRONGHOLD_ENABLE_SEMANTICS=true)
Layer 4: LLM Classification (Optional)
The LLM layer sends ambiguous content to an external language model via OpenRouter for nuanced classification. This is the most accurate but also the slowest and most expensive layer.
This layer is disabled by default and requires explicit configuration:
STRONGHOLD_LLM_PROVIDER— the OpenRouter model to use (e.g.,anthropic/claude-sonnet-4,openai/gpt-4o)STRONGHOLD_LLM_API_KEY— your OpenRouter API key
When enabled, this layer only runs for content that earlier layers flagged as ambiguous (scores in the WARN range). Clear BLOCK or ALLOW decisions from earlier layers skip the LLM entirely.
When this layer is disabled (the default), the ml_confidence score in the response will be 0.0. This score only reflects the LLM layer’s confidence and is distinct from the ONNX/Hugot ML classification in Layer 2.
Output Scanning
Output scanning (credential detection) uses a separate pipeline based on Citadel’s OutputScanner. It uses pattern-based detection for:
- API keys and tokens (AWS, GCP, GitHub, Stripe, etc.)
- Passwords and secrets
- Database connection strings
- Private keys (SSH, PGP, cryptocurrency)
- Environment variable dumps
Output scanning does not use the four-layer pipeline described above — it relies on purpose-built pattern matching optimized for secret detection.