Glia Logo

Frontier Models
Benchmark Comparison

REPORT #1: March 7, 2026
TL;DR:

The AI intelligence frontier remains intensely competitive: Open-source efficient models from Chinese labs are rapidly closing the gap with previous-generation proprietary leads at a fraction of the cost.

Artificial Analysis composite has GPT-5.4 and Gemini 3.1 Pro tied for average intelligence.

GPQA and AIME are increasingly irrelevant as benchmarks because they are saturated: models exceed human capacity and are near perfect.

Yuan3.0-Ultra Assessment: YuanLabAI's open-source MoE (released Mar 4, 2026) — scores omitted pending independent verification; release benchmarks (Docmatix, ChatRAG, SummEval, BFCL V3) are self-reported domain metrics, not universal structural evals.
At a Glance — Who Leads Where
Gemini Deep Think
2
confirmed leads
Gemini 3.1 Pro & Flash
2
confirmed leads
Reasoning & Knowledge
ARC-AGI-2
Abstract reasoning, novel patterns — human avg ~60%. Top scores verified by ARC Prize Foundation. † = not yet independently verified (GPT-5.4 Pro released Mar 5).
GPQA Diamond
PhD-level science Q&A — human expert ~65%, non-expert ~34%. Epoch AI confirms provider-reported scores are accurate. † = self-reported, not yet independently verified. Benchmark approaching saturation.
HLE
Humanity's Last Exam — 2,500 expert questions (multimodal). Sources vary: AA = Artificial Analysis standardized eval; SEAL = Scale AI; † = provider self-reported. With tools (provider-reported): Opus 4.6 53.0%, Gemini 3.1 Pro 51.4%.
AIME 2025
Competition math (no tools). Effectively saturated — multiple models at or near 100%. † = provider self-reported (Opus 4.6 score flagged for potential data contamination). Gemini 3.1 Pro: 91.2% without code execution, 100% with.
Software Engineering & Coding
SWE-bench Verified
Fixing real GitHub issues — independently verified via standardized submission to swebench.com. OpenAI declined to submit GPT-5.4 citing data contamination concerns.
Agentic Tasks & Computer Use
BrowseComp
Hard-to-find web info retrieval — OpenAI-authored benchmark. GPT-5.4 "Pro" is the high-compute variant (base GPT-5.4: 82.7%). Non-OpenAI scores from provider system cards.
Professional & Knowledge Work
MMMU-Pro
Visual & multimodal reasoning — 3,460 questions across 6 disciplines. Scores from provider system cards and technical reports.
Frontier Science & Composite Index
Artificial Analysis Intelligence Index
Independent composite of 10 evaluations (v4.0): GDPval-AA, τ²-Bench Telecom, Terminal-Bench Hard, SciCode, AA-LCR, AA-Omniscience, IFBench, HLE, GPQA Diamond, CritPt. All scores independently evaluated by Artificial Analysis.
API Pricing & Speed
Model Input Output Speed Source
Prices per 1M tokens via direct provider API. Speed = median output tokens/sec from Artificial Analysis where available.
* Gemini Deep Think requires Google AI Ultra ($249.99/mo). Yuan3.0-Ultra pricing pending independent verification.
A Note about Benchmark Selection
Certain benchmarks have been omitted from the visual comparison. These exclusions were made either because the evaluating labs have proprietary testing datasets that hinder open independent reproduction, or because robust scores across all the surveyed frontier models are currently incomplete. We prioritize fully verifiable metrics with universal participation.

Proprietary / Closed Datasets: Pending Official Disclosure: Incomplete Cross-Model Evaluation:

Detailed Bibliography & System Cards

[1] Anthropic. "Introducing Claude Opus 4.6." Feb 5, 2026. anthropic.com/news/claude-opus-4-6. System Card (213 pp.): anthropic.com/claude-opus-4-6-system-card. — SWE-bench Verified 80.8%, ARC-AGI-2 68.8%, GPQA Diamond 91.3%, HLE 40.0% (no tools), HLE 53.0% (with tools), Terminal-Bench 2.0 65.4%, OSWorld 72.7%, BrowseComp 84.0%, GDPval-AA 1606 Elo. SEAL leaderboard independently measures 34.44% (claude-opus-4-6-thinking-max).
[2] Google. "Gemini 3 Deep Think: Advancing science, research and engineering." Feb 12, 2026. blog.google/.../gemini-3-deep-think. — ARC-AGI-2 84.6% (verified by ARC Prize Foundation), HLE no-tools 48.4%, GPQA Diamond 93.8%, AIME 2025 ~95%, Codeforces Elo 3455, gold-medal-level IMO 2025.
[3] Google DeepMind. "Gemini 3.1 Pro — Model Card." Feb 19, 2026. deepmind.google/models/model-cards/gemini-3-1-pro. PDF: Model Card PDF. — GPQA Diamond 94.3%, ARC-AGI-2 77.1%, SWE-bench Verified 80.6%, SWE-bench Pro 54.2%, Terminal-Bench 2.0 68.5%, BrowseComp 85.9%, MMMU-Pro 81.0%, APEX-Agents 33.5%, MCP Atlas 69.2%, HLE 37.52% (SEAL: gemini-3-pro-preview).
[4] OpenAI. "Introducing GPT-5.4." Mar 5, 2026. openai.com/index/introducing-gpt-5-4. System Card: openai.com/gpt-5-4-system-card. — AIME 2025 100%, OSWorld 75.0% (surpasses human 72.4%), BrowseComp 89.3% (Pro), SWE-bench Pro 57.7%, MMMU-Pro 81.2%, GPQA Diamond 92.8%, GDPval pass-rate 83%. Note: OpenAI declined to release SWE-bench Verified scores citing data contamination concerns. BrowseComp 89.3% is the GPT-5.4 Pro score; base GPT-5.4 scores 82.7%.
[5] Artificial Analysis. Intelligence Index v4.0 & CritPt. Index: artificialanalysis.ai/.../intelligence-index | CritPt leaderboard: artificialanalysis.ai/evaluations/critpt | Paper: arXiv:2509.26574 | GitHub: github.com/CritPt-Benchmark/CritPt | CritPt announcement: @ArtificialAnlys tweet. — CritPt (Complex Research using Integrated Thinking – Physics Test): 71 unpublished research-level problems across 11 subfields, created by 60+ researchers from 30+ institutions. Gemini 3.1 Pro leads at 17.7%; Opus 4.6 12.6%. GPT-5.4 evaluation pending. — Intelligence Index v4.0 composites 10 evals: GDPval-AA, τ²-Bench Telecom, Terminal-Bench Hard, SciCode, AA-LCR, AA-Omniscience, IFBench, HLE, GPQA Diamond, CritPt. GPT-5.4 and Gemini 3.1 Pro tied at 57; Opus 4.6 at 53.
[6] Independent benchmarks. SWE-bench: swebench.com | ARC Prize: arcprize.org | HLE: last-exam.ai | GPQA: github.com/idavidrein/gpqa | Terminal-Bench: tbench.ai | APEX-Agents: mercor.com/apex (arXiv:2601.14242) | MCP Atlas: scale.com/leaderboard/mcp_atlas (arXiv:2602.00933) | MMMU-Pro: mmmu-benchmark.github.io | OSWorld: osworld.ai | BrowseComp: browsecomp.github.io | AIME 2025: artificialanalysis.ai/evaluations/aime-2025
[7] Pricing sources. Anthropic API: anthropic.com/pricing | Google AI subscriptions: gemini.google/subscriptions | OpenAI API: developers.openai.com/api/docs/pricing
† AIME 2025 notes: Scores vary by evaluation conditions (with/without code execution tools). Opus 4.6 reports 99.8% — nearly matching GPT-5.4's 100%, but multiple sources flag potential data contamination for this benchmark on Opus. Gemini 3.1 Pro scores 91.2% without code execution (100% with). Deep Think ~95% without tools. See IntuitionLabs AIME 2025 analysis.
Note on METR: While OpenAI evaluates GPT-5.4 against METR logic, METR evaluates frontier models independently. As of March 6, 2026, METR has not published an official leaderboard update confirming GPT-5.4 as leading the 6.6-hour Time Horizon.