// the scoreboard

Frontier Model Security Benchmarks

Where Claude and other frontier models stand on the benchmarks that matter for offensive and defensive security. Every number links to the primary source — no cherry-picking, no marketing claims.

last reviewed: 2026-05-30

Cybench

40 professional-grade CTF tasks (web, crypto, rev, pwn, forensics) curated from HackTheBox, Sekai, Glacier, HKCert. The standard academic test for autonomous offensive security.

higher is better
Claude 3.5 Sonnet
17.5%
GPT-4o
12.5%
Claude 3 Opus
10%
Llama 3.1 405B
7.5%
Gemini 1.5 Pro
7.5%
Mixtral 8x22B
5%
source: Cybench paper · Stanford CRFM (arXiv:2408.08926)
// the launch log

Every AI Model Shipped to Market

Frontier and notable open-weight releases from every major lab — Anthropic, OpenAI, Google, Meta, xAI, DeepSeek, Mistral, Alibaba, Cohere, NVIDIA and more. Updated as new models drop.

57 of 57 models
202614 releases
202523 releases
Claude Sonnet 4.5
Anthropic · 2025-09
multimodal + computer usectx 200K

Best on SWE-bench Verified (77.2%)

Claude Opus 4.1
Anthropic · 2025-08
multimodalctx 200K
Claude Opus 4 / Sonnet 4
Anthropic · 2025-05
multimodalctx 200K
Claude 3.7 Sonnet
Anthropic · 2025-02
extended thinkingctx 200K
GPT-5 / GPT-5 Mini / Nano
OpenAI · 2025-08
multimodal + reasoningctx 400K
GPT-4.1 / 4.1 Mini / Nano
OpenAI · 2025-04
multimodalctx 1M
o3 / o3-mini / o4-mini
OpenAI · 2025-01
reasoningctx 200K
Gemini 2.5 Pro / Flash / Flash-Lite
Google DeepMind · 2025-03
multimodal + reasoningctx 2M
Gemini 2.5 Flash Image (Nano Banana)
Google DeepMind · 2025-08
image gen + editctx 1M
Gemini 2.0 Flash / Pro
Google DeepMind · 2025-02
multimodal + toolsctx 1M
Grok 3 / Grok 3 Reasoning
xAI · 2025-02
multimodalctx 1M
DeepSeek V3.1 / V3.2
DeepSeek · 2025-08
MoE · open weightsctx 128K
DeepSeek R1
DeepSeek · 2025-01
reasoning · open weightsctx 128K

Disrupted reasoning-model pricing

Qwen3 / Qwen3-Coder / VL
Alibaba · 2025-05
open weightsctx 256K
Llama 3.3 70B / Llama 4
Meta · 2025-04
MoE · open weightsctx 128K–10M
Mistral Large 2.1 / Codestral 25
Mistral AI · 2025-03
code + textctx 128K
Hermes 4 (405B)
Nous Research · 2025-07
uncensored · open weightsctx 131K
Phi-4 / Phi-4-mini
Microsoft · 2025-01
small reasoning · open weightsctx 128K
Command A
Cohere · 2025-03
agenticctx 256K
Reka Flash 3
Reka AI · 2025-03
multimodal · open weightsctx 128K
Kimi K2
Moonshot AI · 2025-07
agentic · open weightsctx 2M
GLM-4.5 / GLM-4.6
Zhipu AI · 2025-07
open weightsctx 128K
Falcon 3
TII · 2025-01
open weightsctx 32K
202415 releases
20235 releases
// fresh research · auto-refreshed daily

Latest research & evals

Recent papers, benchmark updates and red-team write-ups relevant to frontier-model security.

updated 00:00 IST