At a Glance — Who Leads Where
Reasoning & Knowledge
ARC-AGI-2
Abstract reasoning, novel patterns — human avg ~60%.
Top scores verified by ARC Prize Foundation. † = not yet independently verified (GPT-5.4 Pro
released Mar 5).
GPQA Diamond
PhD-level science Q&A — human expert ~65%, non-expert ~34%.
Epoch AI
confirms provider-reported scores are accurate.
† = self-reported, not yet independently verified. Benchmark approaching saturation.
HLE
Humanity's Last Exam — 2,500 expert questions (multimodal).
Sources vary: AA = Artificial
Analysis standardized eval;
SEAL = Scale AI;
† = provider self-reported.
With tools (provider-reported): Opus 4.6 53.0%, Gemini 3.1 Pro 51.4%.
AIME 2025
Competition math (no tools). Effectively saturated — multiple models at or
near 100%.
† = provider self-reported (Opus 4.6 score flagged for potential data contamination).
Gemini 3.1 Pro: 91.2% without code execution, 100% with.
Software Engineering & Coding
SWE-bench Verified
Fixing real GitHub issues — independently verified via standardized
submission to swebench.com.
OpenAI declined to submit GPT-5.4 citing data contamination concerns.
Agentic Tasks & Computer Use
BrowseComp
Hard-to-find web info retrieval — OpenAI-authored benchmark.
GPT-5.4 "Pro" is the high-compute variant (base GPT-5.4: 82.7%). Non-OpenAI scores from provider
system cards.
Professional & Knowledge Work
MMMU-Pro
Visual & multimodal reasoning — 3,460 questions across 6 disciplines.
Scores from provider system cards and technical reports.
Frontier Science & Composite Index
Artificial Analysis Intelligence Index
Independent composite of 10 evaluations (v4.0): GDPval-AA, τ²-Bench Telecom,
Terminal-Bench Hard,
SciCode, AA-LCR, AA-Omniscience, IFBench, HLE, GPQA Diamond, CritPt.
All scores independently evaluated by Artificial Analysis.
API Pricing & Speed
| Model | Input | Output | Speed | Source |
|---|
Prices per 1M tokens via direct provider API. Speed = median output tokens/sec from Artificial
Analysis where available.
* Gemini Deep Think requires Google AI Ultra ($249.99/mo). Yuan3.0-Ultra pricing pending independent verification.
* Gemini Deep Think requires Google AI Ultra ($249.99/mo). Yuan3.0-Ultra pricing pending independent verification.
A Note about Benchmark Selection
Certain benchmarks have been omitted from the visual comparison. These exclusions were made either because
the evaluating labs have proprietary testing datasets that hinder open independent reproduction, or because
robust scores across all the surveyed frontier models are currently incomplete. We prioritize fully
verifiable metrics with universal participation.
Proprietary / Closed Datasets:
Proprietary / Closed Datasets:
- GDPval: GPT-5.4 currently holds the win (83.0% pass-rate).
- OSWorld: GPT-5.4 leads (75.0%), surpassing the human baseline.
- MCP Atlas: Gemini 3.1 Pro currently leads (69.2%).
- Grok 4.20 (xAI, Feb 17, 2026): Multi-agent architecture model. Community-sourced benchmark figures are circulating (GPQA ~87.5%, AIME reportedly perfect) but xAI has not published an official system card or evaluation report as of March 7, 2026. Held pending confirmation.
- Terminal-Bench: Gemini 3.1 Pro leads (68.5%).
- CritPt: Gemini 3.1 Pro leads (17.7%), pending GPT-5.4 results.
Detailed Bibliography & System Cards
[1] Anthropic. "Introducing Claude Opus 4.6." Feb 5, 2026. anthropic.com/news/claude-opus-4-6.
System Card (213 pp.): anthropic.com/claude-opus-4-6-system-card.
— SWE-bench Verified 80.8%, ARC-AGI-2 68.8%, GPQA Diamond 91.3%, HLE 40.0% (no tools), HLE 53.0%
(with tools), Terminal-Bench 2.0 65.4%, OSWorld 72.7%, BrowseComp 84.0%, GDPval-AA 1606 Elo.
SEAL leaderboard independently measures 34.44% (claude-opus-4-6-thinking-max).
[2] Google. "Gemini 3 Deep Think: Advancing science, research and engineering." Feb 12, 2026. blog.google/.../gemini-3-deep-think.
— ARC-AGI-2 84.6% (verified by ARC Prize Foundation), HLE no-tools 48.4%, GPQA Diamond 93.8%, AIME 2025
~95%, Codeforces Elo 3455, gold-medal-level IMO 2025.
[3] Google DeepMind. "Gemini 3.1 Pro — Model Card." Feb 19, 2026. deepmind.google/models/model-cards/gemini-3-1-pro.
PDF: Model Card PDF.
— GPQA Diamond 94.3%, ARC-AGI-2 77.1%, SWE-bench Verified 80.6%, SWE-bench Pro 54.2%, Terminal-Bench 2.0
68.5%, BrowseComp 85.9%, MMMU-Pro 81.0%, APEX-Agents 33.5%, MCP Atlas 69.2%, HLE 37.52% (SEAL:
gemini-3-pro-preview).
[4] OpenAI. "Introducing GPT-5.4." Mar 5, 2026. openai.com/index/introducing-gpt-5-4.
System Card: openai.com/gpt-5-4-system-card.
— AIME 2025 100%, OSWorld 75.0% (surpasses human 72.4%), BrowseComp 89.3% (Pro), SWE-bench Pro 57.7%,
MMMU-Pro 81.2%, GPQA Diamond 92.8%, GDPval pass-rate 83%.
Note: OpenAI declined to release SWE-bench Verified scores citing data contamination concerns.
BrowseComp 89.3% is the GPT-5.4 Pro score; base GPT-5.4 scores 82.7%.
[5] Artificial Analysis. Intelligence Index v4.0 & CritPt.
Index: artificialanalysis.ai/.../intelligence-index |
CritPt leaderboard: artificialanalysis.ai/evaluations/critpt |
Paper: arXiv:2509.26574 |
GitHub: github.com/CritPt-Benchmark/CritPt |
CritPt announcement: @ArtificialAnlys tweet.
— CritPt (Complex Research using Integrated Thinking – Physics Test): 71 unpublished research-level
problems across 11 subfields, created by 60+ researchers from 30+ institutions. Gemini 3.1 Pro leads at
17.7%; Opus 4.6 12.6%. GPT-5.4 evaluation pending.
— Intelligence Index v4.0 composites 10 evals: GDPval-AA, τ²-Bench Telecom, Terminal-Bench Hard,
SciCode, AA-LCR, AA-Omniscience, IFBench, HLE, GPQA Diamond, CritPt. GPT-5.4 and Gemini 3.1 Pro tied at
57; Opus 4.6 at 53.
[6] Independent benchmarks. SWE-bench: swebench.com |
ARC Prize: arcprize.org |
HLE: last-exam.ai |
GPQA: github.com/idavidrein/gpqa |
Terminal-Bench: tbench.ai |
APEX-Agents: mercor.com/apex (arXiv:2601.14242) |
MCP Atlas: scale.com/leaderboard/mcp_atlas (arXiv:2602.00933) |
MMMU-Pro: mmmu-benchmark.github.io |
OSWorld: osworld.ai |
BrowseComp: browsecomp.github.io |
AIME 2025: artificialanalysis.ai/evaluations/aime-2025
[7] Pricing sources. Anthropic API: anthropic.com/pricing |
Google AI subscriptions: gemini.google/subscriptions |
OpenAI API: developers.openai.com/api/docs/pricing
† AIME 2025 notes: Scores vary by evaluation conditions (with/without code execution
tools). Opus 4.6 reports 99.8% — nearly matching GPT-5.4's 100%, but multiple sources flag potential
data contamination for this benchmark on Opus. Gemini 3.1 Pro scores 91.2% without code execution (100%
with). Deep Think ~95% without tools. See IntuitionLabs AIME 2025 analysis.
Note on METR: While OpenAI evaluates GPT-5.4 against METR logic, METR evaluates frontier models
independently. As of March 6, 2026, METR has not published an official leaderboard update
confirming GPT-5.4 as leading the 6.6-hour Time Horizon.
