AI Red Teaming Playbook (2026) — Methodology, Tools, Deliverables

AI red teaming is not 'jailbreak the chatbot for fun'. It is a structured assurance exercise with a scope document, a threat model, a measurable attack plan, and a deliverable a CISO can sign. This is the playbook used by labs and serious consultancies in 2026.

Step 1 — Scope the engagement properly

Asset: is it the base model, the fine-tune, the wrapping application, or the agent + tools surface?
Threat actors: external user, malicious tenant, insider, supply-chain compromise.
Harm categories in scope: CBRN, CSAM, fraud, self-harm, privacy, IP leakage, security misuse.
Out-of-scope: brand-safety, hallucination quality, latency.
Rules of engagement: rate limits, allowed tooling, kill switch, contact tree.

Step 2 — Build a threat model with MITRE ATLAS

MITRE ATLAS gives you a tactic/technique vocabulary for ML systems. Map each in-scope asset to relevant tactics: ML Model Access, Initial Access, Execution, Persistence, Defense Evasion, Exfiltration.

Skip OWASP LLM Top 10 for threat modelling — it is a finding taxonomy, not an attack-chain language.

Step 3 — Execute attacks across four planes

Prompt plane: direct jailbreaks, role-play, encoding, AFL, many-shot.
Context plane: indirect injection through retrieved docs, tool outputs, files, images.
Tool plane: argument injection, confused deputy, tool-rug-pull, MCP exploits.
Training-data plane: backdoor probing, data-extraction membership inference (where in scope).

# Automated coverage starters
garak --model_type openai --model_name gpt-5 --probes encoding,promptinject
pyrit run --orchestrator red_team --target claude-3-5-sonnet
# Pair with the 300+ prompt corpus at /prompts

Step 4 — Score with a real rubric

Use HarmBench or AILuminate categories for harm severity, and a binary 'attack succeeded' label per probe. Report Attack Success Rate (ASR) per category, per attack technique, and per defence layer (raw model, system prompt, classifier, output filter).

Step 5 — Write a deliverable that drives change

Executive summary with ASR table and top three risks.
Threat model diagram with attack paths annotated.
Per-finding write-ups: technique, reproducer, evidence, severity, recommended fix.
Defence recommendations mapped to OWASP LLM Top 10 controls.
Re-test plan with regression prompts the team can run in CI.

FAQ

How long does a real AI red team take?

Two to six weeks for a single product. Frontier-model evaluations run 8–12 weeks with multiple specialists.

Do I need a HarmBench license?

HarmBench is open. AILuminate is open with a registration. Both are free for non-commercial benchmarking.

The AI Red Teaming Playbook: Methodology, Tools, and Deliverables