Model set
Which models appear
Filter, toggle checkboxes, or use Top 12 / All / Clear. The same selection updates every chart (relative per-game scores only).
Reasoning view
Applies to Scores, Reasoning levels, Head-to-head, Per game heatmap, and Economics (when available).
Scores
Overall min–max and mean; small dots are each game (averaged across reasoning levels where scores exist)Overall
Reasoning levelsHead-to-head
Summed across Highest, Medium, and None match pools (disjoint runs).Off-diagonal cells use win rate for the row model against the column opponent (W–L–D in the tooltip). Use Reasoning view to show one pool.
Per game
Strength heatmap (per-game scores averaged across reasoning levels)Economics
Coming soonEstimated cost versus score will come with benchmarks of Game 09 or later.
Costs and scores include only economics-segment games (Game 09+). Estimated cost uses token usage and published list prices at the time of the benchmark release. “All” pools official reasoning runs for those games; Highest / Medium / None use cost from runs at that level only, plotted against mean score on those games at that level. Use the Reasoning view control above to switch modes. Not every model has an estimate; repair attempts can increase totals.