Gemini 3.5 Flash leads MCP Atlas among the compared models at 83.6%.

Gemini 3.5 Flash materially improves over Gemini 3 Flash across coding, UI control, expert tasks, and reasoning.

GPT-5.5 still leads several hard benchmarks including Terminal-Bench 2.1, GDPval-AA, Blueprint-Bench 2, MRCR v2 (128k), and ARC-AGI-2.

Benchmark	Area	Gemini 3.5 Flash	Gemini 3 Flash	Gemini 3.1 Pro	Claude Sonnet 4.6	Claude Opus 4.7	GPT-5.5
Terminal-Bench 2.1%Source	Coding	76.2%	58.0%	70.3%	-	66.1%	78.2%
SWE-Bench Pro (Public)%Google eval sheet	Coding	53.9%	48.4%	54.2%	53.0%	64.3%	58.6%
MCP Atlas%Source	Agentic	83.6%	62.0%	78.2%	69.5%	79.1%	75.3%
Toolathon%Google eval sheet	Agentic	56.5%	49.4%	-	-	-	55.6%
OSWorld-Verified%Google eval sheet	UI Control	78.4%	65.1%	76.2%	72.5%	78.0%	78.7%
Finance Agent v2%Source	Expert Tasks	57.9%	42.6%	43.0%	51.0%	51.5%	51.8%
GDPval-AAEloSource	Expert Tasks	1656	1204	1314	1674	1753	1773
CharXiv Reasoning%Google eval sheet	Multimodal	84.2%	80.3%	83.3%	70.5%	82.1%	84.1%
MMMU-Pro%Google eval sheet	Multimodal	83.6%	81.2%	80.5%	74.5%	75.2%	81.2%
Blueprint-Bench 2%Source	Multimodal	33.6%	0.0%	26.5%	6.7%	24.5%	36.2%
MRCR v2 (128k avg)%Google eval sheet	Long Context	77.3%	67.2%	84.9%	84.9%	59.3%	94.8%
MRCR v2 (1M pointwise)%Google eval sheet	Long Context	26.6%	22.1%	26.3%	Not supported	Not supported	Not supported
Humanity's Last Exam%Google eval sheet	Reasoning	40.2%	33.7%	44.4%	33.2%	46.9%	41.4%
ARC-AGI-2%Source	Reasoning	72.1%	33.6%	77.1%	58.3%	75.8%	85.0%

Methodology Notes

This page intentionally stays separate from the main LLM Leaderboard. The main leaderboard aggregates only the sources it explicitly claims to use. This matrix is a benchmark-by-benchmark reference built from the Gemini 3.5 Flash evaluation PDF you provided, plus direct links to the public benchmark pages cited inside that PDF.

Frontier Benchmark Matrix

Methodology Notes