Skip to content

Threat Model

Stronghold protects AI agents at the network level by scanning both incoming content (what the agent reads) and outgoing responses (what the agent produces). This page describes the specific threats it detects and the boundaries of that protection.

Incoming Threats (Content Scanning)

The /v1/scan/content endpoint and the transparent proxy detect the following attack categories in content that an agent is about to consume:

  • Prompt injection — instructions embedded in web pages, emails, or API responses designed to hijack the agent’s behavior (e.g., “ignore previous instructions and…”)
  • DAN jailbreaks — “Do Anything Now” and similar jailbreak patterns that attempt to remove the agent’s safety constraints
  • System prompt extraction — attacks that trick the agent into revealing its system prompt or internal instructions
  • Instruction override / context manipulation — attempts to redefine the agent’s role, goals, or context window
  • Roleplay attacks — “act as”, “pretend to be”, “you are now” patterns that shift the agent into an unsafe persona
  • Data exfiltration via markdown image injection — hidden markdown images with URLs that encode sensitive data, causing the agent to leak information when rendering
  • Obfuscated attacks — encoded, Base64, ROT13, Unicode tricks, or other obfuscation techniques hiding malicious instructions
  • Multi-turn attack sequences — coordinated attacks spread across multiple messages or interactions

Outgoing Threats (Output Scanning)

The /v1/scan/output endpoint detects credentials and secrets that an agent might inadvertently include in its responses:

  • API keys and tokens
  • Passwords and secrets
  • Database connection strings
  • Private keys (SSH, PGP, crypto wallet keys)
  • AWS/cloud credentials (access keys, secret keys, session tokens)
  • Environment variable dumps

Threat Categories

Each detection is classified into one of the following categories:

CategoryDirectionDescription
instruction_overrideIncomingAttempts to override the agent’s instructions
system_extractionIncomingAttempts to extract the system prompt
context_manipulationIncomingAttempts to manipulate the agent’s context
jailbreakIncomingDAN and similar jailbreak patterns
roleplay_attackIncomingPersona-shifting attacks
data_exfilIncomingData exfiltration via image injection or similar
obfuscationIncomingEncoded or obfuscated attack payloads
multiturn_attackIncomingMulti-turn coordinated attacks
credential_leakOutgoingSecrets, keys, or credentials in output

Explicit Limitations

Stronghold is not a silver bullet. The following limitations are important to understand:

  • Novel zero-day techniques — Stronghold cannot detect prompt injection techniques that are not represented in its training data or heuristic rules. New attack methods may evade detection until the models are updated.
  • Binary content — Images, PDFs, audio, video, and other binary formats are not scanned. Only text-based content is analyzed.
  • Content size limit — The API rejects content larger than 500KB with an error. The transparent proxy uses a 1MB threshold and silently passes oversized content through without scanning.
  • Network-level protection only — The transparent proxy protects network traffic. It does not protect against attacks delivered via local files, in-process memory, or other non-network channels.
  • API endpoint timing — The /v1/scan/content API endpoint does not provide the same protection level as the proxy. When calling the API directly, the agent has already read the content before the scan happens. The proxy intercepts content before it reaches the agent.