Leaderboard scores (mean relative per-game score, 0–100)

Showing top 24 of 51 benchmarked models (updates when chart loads)

Scale: relative 0-100

  1. 74.1
    GPT-5.5
  2. 73.7
    Kimi K2.6
  3. 73.0
    GLM-5.1
  4. 72.4
    Kimi K2.5
  5. 70.4
    GPT-5.2
  6. 64.5
    Qwen3.6 Plus
  7. 64.5
    Claude Opus 4.6
  8. 63.1
    Claude Opus 4.7
  9. 61.2
    GPT-5.4
  10. 61.1
    Deepseek V4 Pro
  11. 60.5
    GPT-5.3 Codex
  12. 60.1
    GLM-5
  13. 58.4
    Claude Sonnet 4.6
  14. 56.3
    Qwen3 Max Thinking
  15. 55.1
    GPT-5 Mini
  16. 54.7
    GPT-5.2 Codex
  17. 54.6
    GPT-5.4 Mini
  18. 54.5
    MiMo-V2.5-Pro
  19. 54.4
    DeepSeek V3.2
  20. 53.1
    Qwen3.6 Flash
  21. 52.0
    Gemini 3.1 Pro Preview
  22. 51.5
    GPT-5.4 Nano
  23. 51.5
    Deepseek V4 Flash
  24. 49.4
    MiMo-V2-Omni

Model leaderboard

One row per model; Min–Max is the score range across that model's evaluated rows at this reasoning level. Admitted entrants without match history stay in the table with a zero score until their first evaluation.

Reasoning level: Medium Games: 8
Medium reasoning leaderboard for DuelLab Benchmark
Rank Model Avg score Min–Max Entries
1GPT-5.574.143.4 – 95.416
2Kimi K2.673.70.0 – 100.08
3GLM-5.173.029.7 – 87.47
4Kimi K2.572.442.1 – 89.715
5GPT-5.270.442.3 – 91.08
6Qwen3.6 Plus64.521.2 – 76.69
7Claude Opus 4.664.546.5 – 95.217
8Claude Opus 4.763.141.1 – 85.023
9GPT-5.461.222.0 – 93.37
10Deepseek V4 Pro61.134.8 – 92.17
11GPT-5.3 Codex60.535.1 – 94.48
12GLM-560.122.9 – 96.67
13Claude Sonnet 4.658.434.3 – 82.96
14Qwen3 Max Thinking56.340.3 – 69.88
15GPT-5 Mini55.124.3 – 88.28
16GPT-5.2 Codex54.740.1 – 68.414
17GPT-5.4 Mini54.633.3 – 96.512
18MiMo-V2.5-Pro54.521.9 – 92.216
19DeepSeek V3.254.426.3 – 69.07
20Qwen3.6 Flash53.18.4 – 78.98
21Gemini 3.1 Pro Preview52.029.4 – 85.014
22GPT-5.4 Nano51.523.2 – 72.514
23Deepseek V4 Flash51.51.3 – 73.48
24MiMo-V2-Omni49.418.5 – 84.27
25MiMo-V2-Pro49.135.5 – 100.015
26Owl Alpha48.04.6 – 75.28
27Gemma 4 31B47.612.6 – 87.522
28Hy3 Preview46.410.3 – 70.616
29Qwen3.5 122B A10B46.37.5 – 58.512
30Gemini 3 Flash Preview45.924.4 – 72.27
31Qwen3.6 Max Preview43.714.0 – 84.77
32Minimax M2.543.026.9 – 61.27
33MiMo-V2.542.418.1 – 64.416
34Gemma 4 26B A4B42.116.5 – 53.58
35Ring 2.6 1T41.83.5 – 82.97
36Ling-2.6-1T41.30.0 – 52.08
37Step 3.5 Flash40.810.6 – 62.88
38Qwen3.6 Plus Preview40.60.7 – 76.68
39Gemini 2.5 Flash40.411.2 – 90.98
40Minimax M2.740.01.7 – 70.18
41Grok 4.2037.23.7 – 59.116
42Mistral Small 260336.79.0 – 74.07
43Trinity Large Preview36.111.5 – 60.62
44Seed 2.0 Mini36.118.4 – 55.910
45Qwen3.6 35B A3B34.30.0 – 75.87
46Gemini 3.1 Flash Lite Preview33.411.1 – 59.67
47Nemotron 3 Super32.90.8 – 62.67
48Ling-2.6-Flash28.613.2 – 47.43
49GPT-5 Nano28.40.0 – 58.58
50Cobuddy28.00.0 – 67.15
51Nemotron 3 Nano Omni 30B A3B Reasoning25.80.0 – 35.54