The AI Red Teaming Playbook: Methodology, Tools, and Deliverables
AI red teaming is not 'jailbreak the chatbot for fun'. It is a structured assurance exercise with a scope document, a threat model, a measurable attack plan, and a deliverable a CISO can sign. This is the playbook used by labs and serious consultancies in 2026.
Step 1 — Scope the engagement properly
- Asset: is it the base model, the fine-tune, the wrapping application, or the agent + tools surface?
- Threat actors: external user, malicious tenant, insider, supply-chain compromise.
- Harm categories in scope: CBRN, CSAM, fraud, self-harm, privacy, IP leakage, security misuse.
- Out-of-scope: brand-safety, hallucination quality, latency.
- Rules of engagement: rate limits, allowed tooling, kill switch, contact tree.
Step 2 — Build a threat model with MITRE ATLAS
MITRE ATLAS gives you a tactic/technique vocabulary for ML systems. Map each in-scope asset to relevant tactics: ML Model Access, Initial Access, Execution, Persistence, Defense Evasion, Exfiltration.
Step 3 — Execute attacks across four planes
- Prompt plane: direct jailbreaks, role-play, encoding, AFL, many-shot.
- Context plane: indirect injection through retrieved docs, tool outputs, files, images.
- Tool plane: argument injection, confused deputy, tool-rug-pull, MCP exploits.
- Training-data plane: backdoor probing, data-extraction membership inference (where in scope).
# Automated coverage starters
garak --model_type openai --model_name gpt-5 --probes encoding,promptinject
pyrit run --orchestrator red_team --target claude-3-5-sonnet
# Pair with the 300+ prompt corpus at /promptsStep 4 — Score with a real rubric
Use HarmBench or AILuminate categories for harm severity, and a binary 'attack succeeded' label per probe. Report Attack Success Rate (ASR) per category, per attack technique, and per defence layer (raw model, system prompt, classifier, output filter).
Step 5 — Write a deliverable that drives change
- Executive summary with ASR table and top three risks.
- Threat model diagram with attack paths annotated.
- Per-finding write-ups: technique, reproducer, evidence, severity, recommended fix.
- Defence recommendations mapped to OWASP LLM Top 10 controls.
- Re-test plan with regression prompts the team can run in CI.
FAQ
How long does a real AI red team take?
Two to six weeks for a single product. Frontier-model evaluations run 8–12 weeks with multiple specialists.
Do I need a HarmBench license?
HarmBench is open. AILuminate is open with a registration. Both are free for non-commercial benchmarking.
Browse 300+ cybersecurity prompts, 40+ Claude-compatible tools, and daily AI-security intel.