Leaderboard scores (mean relative per-game score, 0–100)

Showing top 24 of 51 benchmarked models (updates when chart loads)

Scale: relative 0-100

  1. 74.4
    Claude Opus 4.7
  2. 74.2
    GPT-5.4
  3. 69.9
    Kimi K2.6
  4. 66.5
    GPT-5.5
  5. 66.1
    GPT-5.2
  6. 65.2
    Claude Opus 4.6
  7. 65.0
    Deepseek V4 Pro
  8. 63.1
    GLM-5.1
  9. 62.4
    GPT-5.3 Codex
  10. 62.3
    Gemini 3.1 Pro Preview
  11. 60.9
    Claude Sonnet 4.6
  12. 58.6
    Qwen3.6 Plus
  13. 58.2
    Kimi K2.5
  14. 55.6
    GPT-5.4 Nano
  15. 55.5
    GLM-5
  16. 54.2
    MiMo-V2.5-Pro
  17. 53.3
    Owl Alpha
  18. 52.5
    Deepseek V4 Flash
  19. 51.1
    GPT-5.2 Codex
  20. 50.5
    MiMo-V2-Pro
  21. 49.8
    Minimax M2.7
  22. 49.3
    GPT-5 Mini
  23. 48.9
    Hy3 Preview
  24. 48.7
    Gemini 3 Flash Preview

Overall leaderboard

One row per model. Raw Elo is a pooled mean across public games (advanced; compare within a game on per-game pages, not across games). By reasoning level lists relative per-game scores for highest, medium, and none.

View: Overall Reasoning levels: 3 Per-game: Not shown
Overall leaderboard for DuelLab Benchmark
Rank Model Avg score Raw Elo By reasoning level Entries
1Claude Opus 4.774.41685.377.5 / 63.1 / 82.854
2GPT-5.474.21694.982.5 / 61.2 / 78.935
3Kimi K2.669.91634.569.8 / 73.7 / 66.124
4GPT-5.566.51646.763.3 / 74.1 / 62.248
5GPT-5.266.11604.068.3 / 70.4 / 59.739
6Claude Opus 4.665.21639.263.6 / 64.5 / 67.661
7Deepseek V4 Pro65.01588.782.5 / 61.1 / 51.622
8GLM-5.163.11573.168.2 / 73.0 / 48.129
9GPT-5.3 Codex62.41581.762.2 / 60.5 / 64.639
10Gemini 3.1 Pro Preview62.31642.272.5 / 52.0 / —25
11Claude Sonnet 4.660.91591.454.9 / 58.4 / 69.427
12Qwen3.6 Plus58.61485.658.8 / 64.5 / 52.426
13Kimi K2.558.21531.745.4 / 72.4 / 56.753
14GPT-5.4 Nano55.61547.363.3 / 51.5 / 51.948
15GLM-555.51543.745.6 / 60.1 / 60.730
16MiMo-V2.5-Pro54.21522.252.3 / 54.5 / 55.748
17Owl Alpha53.31526.757.3 / 48.0 / 54.622
18Deepseek V4 Flash52.51516.853.3 / 51.5 / 52.624
19GPT-5.2 Codex51.11503.1— / 54.7 / 47.419
20MiMo-V2-Pro50.51499.745.0 / 49.1 / 57.448
21Minimax M2.749.81456.159.6 / 40.0 / —17
22GPT-5 Mini49.31494.442.2 / 55.1 / 50.735
23Hy3 Preview48.91467.062.7 / 46.4 / 37.746
24Gemini 3 Flash Preview48.71480.240.5 / 45.9 / 59.526
25DeepSeek V3.248.51460.649.3 / 54.4 / 41.728
26GPT-5.4 Mini48.51471.460.5 / 54.6 / 30.330
27Gemma 4 31B48.31464.750.9 / 47.6 / 46.363
28Ling-2.6-1T48.21497.951.8 / 41.3 / 51.523
29Qwen3.6 Max Preview47.61470.550.9 / 43.7 / 48.323
30Ring 2.6 1T46.71430.851.6 / 41.8 / —13
31Qwen3.6 Flash46.41440.633.4 / 53.1 / 52.822
32Qwen3.6 Plus Preview45.31421.650.1 / 40.6 / —16
33MiMo-V2.545.31438.942.3 / 42.4 / 51.247
34Qwen3 Max Thinking44.41488.637.5 / 56.3 / 39.327
35MiMo-V2-Omni43.11429.029.0 / 49.4 / 50.825
36Qwen3.5 122B A10B42.31379.745.9 / 46.3 / 34.932
37Step 3.5 Flash42.21461.848.6 / 40.8 / 37.124
38Gemini 2.5 Flash41.81403.541.3 / 40.4 / 43.623
39Grok 4.2040.51390.140.1 / 37.2 / 44.146
40Qwen3.6 35B A3B40.41425.539.0 / 34.3 / 48.021
41Minimax M2.540.21423.843.8 / 43.0 / 33.918
42Nemotron 3 Super39.41418.335.3 / 32.9 / 50.025
43Gemini 3.1 Flash Lite Preview38.51370.542.4 / 33.4 / 39.824
44Mistral Small 260337.11431.538.3 / 36.7 / 36.322
45Gemma 4 26B A4B35.71400.439.6 / 42.1 / 25.522
46Seed 2.0 Mini32.21425.418.1 / 36.1 / 42.521
47GPT-5 Nano32.11359.532.8 / 28.4 / 35.138
48Cobuddy27.81317.127.5 / 28.0 / —12
49Ling-2.6-Flash27.71343.728.8 / 28.6 / 25.815
50Nemotron 3 Nano Omni 30B A3B Reasoning26.41309.426.9 / 25.8 / —9
51Trinity Large Preview21.41198.06.2 / 36.1 / 22.019