Frontier Benchmark Matrix
Source-backed May 2026 benchmark results taken from the Gemini 3.5 Flash evaluation sheet and linked public leaderboards where available.
Gemini 3.5 Flash leads MCP Atlas among the compared models at 83.6%.
Gemini 3.5 Flash materially improves over Gemini 3 Flash across coding, UI control, expert tasks, and reasoning.
GPT-5.5 still leads several hard benchmarks including Terminal-Bench 2.1, GDPval-AA, Blueprint-Bench 2, MRCR v2 (128k), and ARC-AGI-2.
| Benchmark | Area | Gemini 3.5 Flash | Gemini 3 Flash | Gemini 3.1 Pro | Claude Sonnet 4.6 | Claude Opus 4.7 | GPT-5.5 |
|---|---|---|---|---|---|---|---|
| Coding | 76.2% | 58.0% | 70.3% | - | 66.1% | 78.2% | |
SWE-Bench Pro (Public)%Google eval sheet | Coding | 53.9% | 48.4% | 54.2% | 53.0% | 64.3% | 58.6% |
| Agentic | 83.6% | 62.0% | 78.2% | 69.5% | 79.1% | 75.3% | |
Toolathon%Google eval sheet | Agentic | 56.5% | 49.4% | - | - | - | 55.6% |
OSWorld-Verified%Google eval sheet | UI Control | 78.4% | 65.1% | 76.2% | 72.5% | 78.0% | 78.7% |
| Expert Tasks | 57.9% | 42.6% | 43.0% | 51.0% | 51.5% | 51.8% | |
| Expert Tasks | 1656 | 1204 | 1314 | 1674 | 1753 | 1773 | |
CharXiv Reasoning%Google eval sheet | Multimodal | 84.2% | 80.3% | 83.3% | 70.5% | 82.1% | 84.1% |
MMMU-Pro%Google eval sheet | Multimodal | 83.6% | 81.2% | 80.5% | 74.5% | 75.2% | 81.2% |
| Multimodal | 33.6% | 0.0% | 26.5% | 6.7% | 24.5% | 36.2% | |
MRCR v2 (128k avg)%Google eval sheet | Long Context | 77.3% | 67.2% | 84.9% | 84.9% | 59.3% | 94.8% |
MRCR v2 (1M pointwise)%Google eval sheet | Long Context | 26.6% | 22.1% | 26.3% | Not supported | Not supported | Not supported |
Humanity's Last Exam%Google eval sheet | Reasoning | 40.2% | 33.7% | 44.4% | 33.2% | 46.9% | 41.4% |
| Reasoning | 72.1% | 33.6% | 77.1% | 58.3% | 75.8% | 85.0% |
Methodology Notes
This page intentionally stays separate from the main LLM Leaderboard. The main leaderboard aggregates only the sources it explicitly claims to use. This matrix is a benchmark-by-benchmark reference built from the Gemini 3.5 Flash evaluation PDF you provided, plus direct links to the public benchmark pages cited inside that PDF.