What DuelLab measures

DuelLab measures whether an AI model can turn written game rules into a program that plays well. It tests more than whether the code looks plausible: the program must run, make legal decisions, and compete successfully against programs made by other models.

2. Competition

How generated programs compete

Each model receives the rules of a benchmark game and instructions for how its player should communicate with the game. The resulting program is checked and then plays head-to-head matches under the same conditions as its opponents.

Games are deterministic, so the same position and move always lead to the same next position. Repeated matches and starting-side changes help distinguish consistent play from a lucky result.

3. Reasoning

Why reasoning settings rank separately

A model can behave differently depending on how much reasoning it is allowed to use while creating its player. XHigh, Medium, and None represent different amounts of model reasoning. Exact behavior can vary between model providers.

The main Reasoning Variants leaderboard therefore gives each model-and-setting pair its own row. This makes the setting visible and avoids hiding a meaningful difference inside one combined model score.

4. Scoring

How match results become a 0–100 score

Wins, losses, draws, and the amount of evidence available produce a score for each game. The score is shown on a 0–100 scale, with stronger and more certain performance receiving a higher value.

A row first pools repeated entrants into one mean score for each game. The Playable value is the equal-weighted mean across games with a usable plugin. The headline score multiplies Playable by plugin success rate raised to the 0.25 power. This keeps terminal generation failures visible without treating each one as a literal zero-strength game.

5. Missing results

How failures and unevaluated games are treated

A terminal model-code failure lowers the plugin success rate and therefore lowers the headline score smoothly. For example, seven usable plugins from eight attempted cells retain about 96.7% of the Playable score. Repairs, resends, and provider retries do not directly reduce the score when the final plugin is admitted; they remain visible in Codegen health and standard-price equivalent.

If a model has not yet been evaluated on a game, that absence is not silently counted as a loss. The standings and detailed tables distinguish missing evidence from a known failed attempt.

Official ranking requires at least four usable plugins and more than 50% plugin success across non-infrastructure cells. Rows below either threshold remain visible with their Playable and adjusted scores but are labeled Provisional and receive no rank.

6. Codegen health

What first-pass, repaired, failed, and n mean

Codegen reports first-pass / repaired-or-resent / model-code-failed percentages. Its n includes every model run at or after the GameBench 2 public cutoff, so repeated evaluations remain visible in the denominator.

Parser, validation, compile, runtime-smoke, and performance-smoke failures count as model-code failures. Provider, authentication, rate-limit, timeout, and other infrastructure failures are listed as excluded and do not enter n or the percentages.

7. Cost evidence

How code-generation cost is calculated

Std. price is the equal-weighted mean generation-and-repair cost evidence per unique model, game, and reasoning cell. Repeated runs are averaged inside their cell before cells are averaged, so rerunning one cell cannot give it extra weight. Median unique-cell cost and the observed-run total are published alongside the mean.

Cost evidence is selected in this order: reconciled billed cost, provider-reported request cost, versioned local list-price estimate, then a labeled legacy estimate. Every displayed value names its evidence basis, and the downloadable data preserves basis counts and pricing-table versions. Exact zero-cost/free-tier telemetry remains a measured zero rather than missing data.

Provider-account billed spend appears in a separate reconciliation table on Charts. It is never allocated to models without an exact request join, and it may include non-benchmark account activity. This keeps comparative standard-price economics distinct from actual account billing.

8. Updates

Why standings change over time

GameBench 2 is continuously updated. New models and games can be added, and existing rows can gain more match evidence or be refreshed. These changes can move scores and ranks even when a particular model has not changed.

Use the public Updated date to identify the current standings, and treat close scores as close evidence rather than a permanent ordering.

9. Benchmark generations

GameBench 2 and the frozen GameBench 1 archive

GameBench 2 is the evolving benchmark used for current standings. GameBench 1 was designed as a frozen historical benchmark: its games and rankings are preserved as an archive rather than extended with new models.

The two benchmarks use different game sets and evaluation contexts, so their scores and ranks are not directly comparable. Open the GameBench 1 archive.

10. Interpretation

Limits and responsible interpretation

DuelLab measures one demanding ability: creating programs that play the included games well. It does not measure every aspect of coding, reasoning, safety, usefulness, or real-world software development.

Small score differences may reflect limited evidence. Compare uncertainty, individual games, reasoning settings, and known failures before drawing broad conclusions about a model.

Methodology