Leaderboard scores (mean relative per-game score, 0–100)

Showing top 24 of 50 benchmarked models (updates when chart loads)

Scale: relative 0-100

  1. 82.5
    GPT-5.4
  2. 82.5
    Deepseek V4 Pro
  3. 77.5
    Claude Opus 4.7
  4. 72.5
    Gemini 3.1 Pro Preview
  5. 69.8
    Kimi K2.6
  6. 68.3
    GPT-5.2
  7. 68.2
    GLM-5.1
  8. 63.6
    Claude Opus 4.6
  9. 63.3
    GPT-5.4 Nano
  10. 63.3
    GPT-5.5
  11. 62.7
    Hy3 Preview
  12. 62.2
    GPT-5.3 Codex
  13. 60.5
    GPT-5.4 Mini
  14. 59.6
    Minimax M2.7
  15. 58.8
    Qwen3.6 Plus
  16. 57.3
    Owl Alpha
  17. 54.9
    Claude Sonnet 4.6
  18. 53.3
    Deepseek V4 Flash
  19. 52.3
    MiMo-V2.5-Pro
  20. 51.8
    Ling-2.6-1T
  21. 51.6
    Ring 2.6 1T
  22. 50.9
    Gemma 4 31B
  23. 50.9
    Qwen3.6 Max Preview
  24. 50.1
    Qwen3.6 Plus Preview

Model leaderboard

One row per model; Min–Max is the score range across that model's evaluated rows at this reasoning level. Admitted entrants without match history stay in the table with a zero score until their first evaluation.

Reasoning level: Highest Games: 8
Highest reasoning leaderboard for DuelLab Benchmark
Rank Model Avg score Min–Max Entries
1GPT-5.482.566.5 – 100.016
2Deepseek V4 Pro82.571.9 – 100.07
3Claude Opus 4.777.556.0 – 96.17
4Gemini 3.1 Pro Preview72.550.9 – 95.911
5Kimi K2.669.822.2 – 94.68
6GPT-5.268.334.0 – 82.414
7GLM-5.168.243.8 – 92.87
8Claude Opus 4.663.625.4 – 91.921
9GPT-5.4 Nano63.330.0 – 90.620
10GPT-5.563.337.7 – 94.016
11Hy3 Preview62.727.7 – 91.314
12GPT-5.3 Codex62.228.6 – 84.98
13GPT-5.4 Mini60.52.5 – 95.99
14Minimax M2.759.619.2 – 87.09
15Qwen3.6 Plus58.814.4 – 78.59
16Owl Alpha57.322.2 – 96.77
17Claude Sonnet 4.654.929.2 – 83.06
18Deepseek V4 Flash53.317.4 – 95.48
19MiMo-V2.5-Pro52.325.0 – 80.416
20Ling-2.6-1T51.822.4 – 95.47
21Ring 2.6 1T51.60.0 – 95.56
22Gemma 4 31B50.924.4 – 71.421
23Qwen3.6 Max Preview50.912.4 – 90.38
24Qwen3.6 Plus Preview50.118.2 – 100.08
25DeepSeek V3.249.39.7 – 96.77
26Step 3.5 Flash48.618.4 – 75.19
27Qwen3.5 122B A10B45.912.6 – 66.010
28GLM-545.69.8 – 84.77
29Kimi K2.545.422.6 – 66.015
30MiMo-V2-Pro45.026.0 – 98.615
31Minimax M2.543.80.0 – 72.07
32Gemini 3.1 Flash Lite Preview42.49.7 – 77.67
33MiMo-V2.542.39.9 – 74.115
34GPT-5 Mini42.217.6 – 61.98
35Gemini 2.5 Flash41.39.8 – 70.28
36Gemini 3 Flash Preview40.522.8 – 92.37
37Grok 4.2040.111.1 – 77.716
38Gemma 4 26B A4B39.63.0 – 88.27
39Qwen3.6 35B A3B39.09.3 – 73.18
40Mistral Small 260338.30.0 – 100.08
41Qwen3 Max Thinking37.58.8 – 90.89
42Nemotron 3 Super35.315.5 – 72.06
43Qwen3.6 Flash33.418.8 – 55.28
44GPT-5 Nano32.815.6 – 63.48
45MiMo-V2-Omni29.08.4 – 47.67
46Ling-2.6-Flash28.87.0 – 74.97
47Cobuddy27.53.2 – 84.97
48Nemotron 3 Nano Omni 30B A3B Reasoning26.90.0 – 63.85
49Seed 2.0 Mini18.118.11
50Trinity Large Preview6.20.0 – 12.32