LLM Benchmark Comparison

Updated: January 2025

Models

-

Top Reasoning

-

Top Coding

-

Max Context

-

Context Window (tokens)

Maximum input tokens the model can process

API Pricing ($ per 1M tokens)

Input token cost - output typically 2-4x higher

Reasoning Score (GPQA Diamond)

Performance on complex reasoning benchmark

Coding Score (LiveCodeBench)

Performance on coding challenges and code generation

Math Score (AIME 2025)

Performance on mathematical reasoning problems

Provider Colors