Frontier Benchmark Matrix

Source-backed May 2026 benchmark results taken from the Gemini 3.5 Flash evaluation sheet and linked public leaderboards where available.

Gemini 3.5 Flash leads MCP Atlas among the compared models at 83.6%.

Gemini 3.5 Flash materially improves over Gemini 3 Flash across coding, UI control, expert tasks, and reasoning.

GPT-5.5 still leads several hard benchmarks including Terminal-Bench 2.1, GDPval-AA, Blueprint-Bench 2, MRCR v2 (128k), and ARC-AGI-2.

BenchmarkAreaGemini 3.5 FlashGemini 3 FlashGemini 3.1 ProClaude Sonnet 4.6Claude Opus 4.7GPT-5.5
Terminal-Bench 2.1%Source
Coding76.2%58.0%70.3%-66.1%78.2%
SWE-Bench Pro (Public)%Google eval sheet
Coding53.9%48.4%54.2%53.0%64.3%58.6%
MCP Atlas%Source
Agentic83.6%62.0%78.2%69.5%79.1%75.3%
Toolathon%Google eval sheet
Agentic56.5%49.4%---55.6%
OSWorld-Verified%Google eval sheet
UI Control78.4%65.1%76.2%72.5%78.0%78.7%
Finance Agent v2%Source
Expert Tasks57.9%42.6%43.0%51.0%51.5%51.8%
GDPval-AAEloSource
Expert Tasks165612041314167417531773
CharXiv Reasoning%Google eval sheet
Multimodal84.2%80.3%83.3%70.5%82.1%84.1%
MMMU-Pro%Google eval sheet
Multimodal83.6%81.2%80.5%74.5%75.2%81.2%
Blueprint-Bench 2%Source
Multimodal33.6%0.0%26.5%6.7%24.5%36.2%
MRCR v2 (128k avg)%Google eval sheet
Long Context77.3%67.2%84.9%84.9%59.3%94.8%
MRCR v2 (1M pointwise)%Google eval sheet
Long Context26.6%22.1%26.3%Not supportedNot supportedNot supported
Humanity's Last Exam%Google eval sheet
Reasoning40.2%33.7%44.4%33.2%46.9%41.4%
ARC-AGI-2%Source
Reasoning72.1%33.6%77.1%58.3%75.8%85.0%

Methodology Notes

This page intentionally stays separate from the main LLM Leaderboard. The main leaderboard aggregates only the sources it explicitly claims to use. This matrix is a benchmark-by-benchmark reference built from the Gemini 3.5 Flash evaluation PDF you provided, plus direct links to the public benchmark pages cited inside that PDF.