About this version
Leaderboards not frozen yet
Benchmark games
What kinds of games
Games in active use today are mostly abstract strategy titles—structured, typically complete-information games that fit well with code-generation and match-play evaluation.
We plan to broaden the benchmark over time, adding a wider variety of game types as the pipeline matures, so future releases exercise models across more diverse rules and mechanics.
Scoring
How scores are computed
- Per-game: For each game, DuelLab computes Elo from match results, then derives a conservative score using rating − uncertainty. Scores are normalized to 0–100 within that game's pool. The uncertainty column is a separate 0–100 index from each entrant's raw Elo uncertainty using a fixed scale (higher means less statistically certain), so it does not depend on who else played that game—unlike scores, it is not a min–max rank within the pool. For code matches, the default policy is fault-aware: clearly symmetric move-limit stalls count as draws, and all other fault outcomes are excluded from Elo.
- Overall: Each leaderboard row's overall score is the mean of its per-game normalized scores across games where it has at least one match. Its overall uncertainty column is the mean of those per-game uncertainty indices across the same games.
Aggregation
How model rows are grouped
The model table groups evaluated rows by model. Avg score is the mean overall score across that model's rows at a given reasoning level. Min–Max is the range from the lowest to the highest overall score across those rows. When a model has only one such row, a single value is shown instead of a range.
Contributing counts on the public leaderboards are hierarchical: on official reasoning tracks, each per-game table includes a Contributing column for how many raw entrants were merged into that cell for that game (the column is omitted on Mixed per-game pages—each row is one variant, so the value would always be 1; those pages include a Reasoning column instead for that variant's preset). On a reasoning track, the model table's Entries column is the sum of those per-game counts for that model across the suite (for Mixed, each displayed row counts as one per game it appears in). On Overall, Entries is the sum of those three official track totals (highest, medium, none)—while the overall score remains the mean of up to three per-track overall scores, unchanged.
A Mixed (cross-reasoning) view, when present, is a separate leaderboard: each row is one model at one reasoning-effort preset, matched against every other model-and-preset variant (all-vs-all cross-reasoning). Those matches use the same scoring rules, but they do not feed the single-effort boards or the Overall aggregate.
Reasoning
When levels look similar
For some models, scores or spreads across Highest, Medium, and None can be close together because the underlying reasoning-related controls do not differ much in practice for that provider or SKU, or because the task is already saturated at a given setting.
We are still adjusting per-model mappings and parameters tied to those levels. In a future release we plan to make the exact reasoning-related settings used for each evaluated model easier to find on the site.
Prompting and harness
Why draws and uncertainty can still be high
The benchmark's games are curated to favor titles that are typically less draw-prone under strong play, but the benchmark still measures model-generated programs run through a fixed prompt and execution harness. That stack is not perfect: implementation bugs, misread rules, overly safe heuristics, or repeated invalid moves can all produce long, symmetric play that ends in stalls counted as draws (see scoring policy above), or simply many inconclusive outcomes.
When many matches tie or look similar from the rating system's perspective, uncertainty stays elevated even if the underlying rules are not especially drawish. Prompting and the execution harness are improved on an ongoing basis, with successive releases delivering clearer instructions, tighter validation, and fewer spurious outcomes—so residual draws and noisy uncertainty should be read alongside that steady progress, not as a permanent ceiling.