Charts

Charts

Homepage and track pages carry the main benchmark story. Use this page for deeper comparison: multi-model selection, reasoning-specific views, per-game analysis, and economics when Game 09+ is in the benchmark. Bookmark or share this page (including the address bar) to keep the same model selection across visits.

Which models appear

Filter, toggle checkboxes, or use Top 12 / All / Clear. The same selection updates every chart (relative per-game scores only).

Applies to Scores, Reasoning levels, Head-to-head, Per game heatmap, and Economics (when available).

Overall min–max and mean; small dots are each game (averaged across reasoning levels where scores exist)
Reasoning levels
Summed across Highest, Medium, and None match pools (disjoint runs).

Off-diagonal cells use win rate for the row model against the column opponent (W–L–D in the tooltip). Use Reasoning view to show one pool.

Strength heatmap (per-game scores averaged across reasoning levels)
Coming soon

Estimated cost versus score will come with benchmarks of Game 09 or later.