Frontier Model Benchmark Comparison

At a Glance — Who Leads Where

Opus 4.6

confirmed leads

Gemini Deep Think

confirmed leads

Gemini 3.1 Pro & Flash

confirmed leads

GPT-5.4

confirmed leads

Reasoning & Knowledge

ARC-AGI-2

Abstract reasoning, novel patterns — human avg ~60%. Top scores verified by ARC Prize Foundation. † = not yet independently verified (GPT-5.4 Pro released Mar 5).

ARC Prize Official Leaderboard | OpenAI GPT-5.4 announcement | Google Deep Think blog

GPQA Diamond

PhD-level science Q&A — human expert ~65%, non-expert ~34%. Epoch AI confirms provider-reported scores are accurate. † = self-reported, not yet independently verified. Benchmark approaching saturation.

Epoch AI GPQA Diamond Leaderboard | AA GPQA Leaderboard | GPQA Dataset (GitHub)

HLE

Humanity's Last Exam — 2,500 expert questions (multimodal). Sources vary: AA = Artificial Analysis standardized eval; SEAL = Scale AI; † = provider self-reported. With tools (provider-reported): Opus 4.6 53.0%, Gemini 3.1 Pro 51.4%.

Artificial Analysis HLE Leaderboard | SEAL Leaderboard | Google/Deep Think announcement | Opus 4.6 System Card

AIME 2025

Competition math (no tools). Effectively saturated — multiple models at or near 100%. † = provider self-reported (Opus 4.6 score flagged for potential data contamination). Gemini 3.1 Pro: 91.2% without code execution, 100% with.

AA AIME 2025 Leaderboard | OpenAI GPT-5.4 | Opus 4.6 System Card | IntuitionLabs AIME analysis

Software Engineering & Coding

SWE-bench Verified

Fixing real GitHub issues — independently verified via standardized submission to swebench.com. OpenAI declined to submit GPT-5.4 citing data contamination concerns.

SWE-bench Official Leaderboard | Results Viewer | Opus 4.6 System Card

Agentic Tasks & Computer Use

BrowseComp

Hard-to-find web info retrieval — OpenAI-authored benchmark. GPT-5.4 "Pro" is the high-compute variant (base GPT-5.4: 82.7%). Non-OpenAI scores from provider system cards.

BrowseComp (GitHub) | OpenAI GPT-5.4 | Gemini 3.1 Pro Model Card | Opus 4.6 System Card

Professional & Knowledge Work

MMMU-Pro

Visual & multimodal reasoning — 3,460 questions across 6 disciplines. Scores from provider system cards and technical reports.

MMMU-Pro (GitHub) | AA MMMU-Pro Leaderboard | OpenAI GPT-5.4 | Kimi K2.5 blog

Frontier Science & Composite Index

Artificial Analysis Intelligence Index

Independent composite of 10 evaluations (v4.0): GDPval-AA, τ²-Bench Telecom, Terminal-Bench Hard, SciCode, AA-LCR, AA-Omniscience, IFBench, HLE, GPQA Diamond, CritPt. All scores independently evaluated by Artificial Analysis.

AA Intelligence Index | AA Methodology

API Pricing & Speed

Model	Input	Output	Speed	Source

                Prices per 1M tokens via direct provider API. Speed = median output tokens/sec from Artificial
                    Analysis where available.

                * Gemini Deep Think requires Google AI Ultra ($249.99/mo).
                Yuan3.0-Ultra pricing pending independent verification.
            

A Note about Benchmark Selection

Certain benchmarks have been omitted from the visual comparison. These exclusions were made either because the evaluating labs have proprietary testing datasets that hinder open independent reproduction, or because robust scores across all the surveyed frontier models are currently incomplete. We prioritize fully verifiable metrics with universal participation.

Proprietary / Closed Datasets:

GDPval: GPT-5.4 currently holds the win (83.0% pass-rate).
OSWorld: GPT-5.4 leads (75.0%), surpassing the human baseline.
MCP Atlas: Gemini 3.1 Pro currently leads (69.2%).

Pending Official Disclosure:

Grok 4.20 (xAI, Feb 17, 2026): Multi-agent architecture model. Community-sourced benchmark figures are circulating (GPQA ~87.5%, AIME reportedly perfect) but xAI has not published an official system card or evaluation report as of March 7, 2026. Held pending confirmation.

Incomplete Cross-Model Evaluation:

Terminal-Bench: Gemini 3.1 Pro leads (68.5%).
CritPt: Gemini 3.1 Pro leads (17.7%), pending GPT-5.4 results.

Detailed Bibliography & System Cards

[1] Anthropic. "Introducing Claude Opus 4.6." Feb 5, 2026. anthropic.com/news/claude-opus-4-6. System Card (213 pp.): anthropic.com/claude-opus-4-6-system-card. — SWE-bench Verified 80.8%, ARC-AGI-2 68.8%, GPQA Diamond 91.3%, HLE 40.0% (no tools), HLE 53.0% (with tools), Terminal-Bench 2.0 65.4%, OSWorld 72.7%, BrowseComp 84.0%, GDPval-AA 1606 Elo. SEAL leaderboard independently measures 34.44% (claude-opus-4-6-thinking-max).

[2] Google. "Gemini 3 Deep Think: Advancing science, research and engineering." Feb 12, 2026. blog.google/.../gemini-3-deep-think. — ARC-AGI-2 84.6% (verified by ARC Prize Foundation), HLE no-tools 48.4%, GPQA Diamond 93.8%, AIME 2025 ~95%, Codeforces Elo 3455, gold-medal-level IMO 2025.

[3] Google DeepMind. "Gemini 3.1 Pro — Model Card." Feb 19, 2026. deepmind.google/models/model-cards/gemini-3-1-pro. PDF: Model Card PDF. — GPQA Diamond 94.3%, ARC-AGI-2 77.1%, SWE-bench Verified 80.6%, SWE-bench Pro 54.2%, Terminal-Bench 2.0 68.5%, BrowseComp 85.9%, MMMU-Pro 81.0%, APEX-Agents 33.5%, MCP Atlas 69.2%, HLE 37.52% (SEAL: gemini-3-pro-preview).

[4] OpenAI. "Introducing GPT-5.4." Mar 5, 2026. openai.com/index/introducing-gpt-5-4. System Card: openai.com/gpt-5-4-system-card. — AIME 2025 100%, OSWorld 75.0% (surpasses human 72.4%), BrowseComp 89.3% (Pro), SWE-bench Pro 57.7%, MMMU-Pro 81.2%, GPQA Diamond 92.8%, GDPval pass-rate 83%. Note: OpenAI declined to release SWE-bench Verified scores citing data contamination concerns. BrowseComp 89.3% is the GPT-5.4 Pro score; base GPT-5.4 scores 82.7%.

[5] Artificial Analysis. Intelligence Index v4.0 & CritPt. Index: artificialanalysis.ai/.../intelligence-index | CritPt leaderboard: artificialanalysis.ai/evaluations/critpt | Paper: arXiv:2509.26574 | GitHub: github.com/CritPt-Benchmark/CritPt | CritPt announcement: @ArtificialAnlys tweet. — CritPt (Complex Research using Integrated Thinking – Physics Test): 71 unpublished research-level problems across 11 subfields, created by 60+ researchers from 30+ institutions. Gemini 3.1 Pro leads at 17.7%; Opus 4.6 12.6%. GPT-5.4 evaluation pending. — Intelligence Index v4.0 composites 10 evals: GDPval-AA, τ²-Bench Telecom, Terminal-Bench Hard, SciCode, AA-LCR, AA-Omniscience, IFBench, HLE, GPQA Diamond, CritPt. GPT-5.4 and Gemini 3.1 Pro tied at 57; Opus 4.6 at 53.

[7] Pricing sources. Anthropic API: anthropic.com/pricing | Google AI subscriptions: gemini.google/subscriptions | OpenAI API: developers.openai.com/api/docs/pricing

† AIME 2025 notes: Scores vary by evaluation conditions (with/without code execution tools). Opus 4.6 reports 99.8% — nearly matching GPT-5.4's 100%, but multiple sources flag potential data contamination for this benchmark on Opus. Gemini 3.1 Pro scores 91.2% without code execution (100% with). Deep Think ~95% without tools. See IntuitionLabs AIME 2025 analysis.

Note on METR: While OpenAI evaluates GPT-5.4 against METR logic, METR evaluates frontier models independently. As of March 6, 2026, METR has not published an official leaderboard update confirming GPT-5.4 as leading the 6.6-hour Time Horizon.

Frontier ModelsBenchmark Comparison

Detailed Bibliography & System Cards

Frontier Models
Benchmark Comparison