Leaderboard

Highest reasoning

See who is leading on the highest reasoning first, then use the detailed table, the Normalized/Elo control on the chart, and the Charts page for deeper context.

Leaderboard ranking is the average of each model's relative per-game scores. Each game turns tournament results into a relative per-game score (0–100) for that field (uncertainty-aware; see Methodology), then we average across games.

Top model: GPT-5.4
Top score: 82.5
Gap to #2: 0.0 pts
Scope: 8 games

Charts for larger comparisons, richer model selection, and additional benchmark views.

Rankings

Leaderboard scores (mean relative per-game score, 0–100)

Showing top 24 of 50 benchmarked models (updates when chart loads)

Scale: relative 0-100

82.5
GPT-5.4
82.5
Deepseek V4 Pro
77.5
Claude Opus 4.7
72.5
Gemini 3.1 Pro Preview
69.8
Kimi K2.6
68.3
GPT-5.2
68.2
GLM-5.1
63.6
Claude Opus 4.6
63.3
GPT-5.4 Nano
63.3
GPT-5.5
62.7
Hy3 Preview
62.2
GPT-5.3 Codex
60.5
GPT-5.4 Mini
59.6
Minimax M2.7
58.8
Qwen3.6 Plus
57.3
Owl Alpha
54.9
Claude Sonnet 4.6
53.3
Deepseek V4 Flash
52.3
MiMo-V2.5-Pro
51.8
Ling-2.6-1T
51.6
Ring 2.6 1T
50.9
Gemma 4 31B
50.9
Qwen3.6 Max Preview
50.1
Qwen3.6 Plus Preview

Models

Model leaderboard

One row per model; Min–Max is the score range across that model's evaluated rows at this reasoning level. Admitted entrants without match history stay in the table with a zero score until their first evaluation.

Reasoning level: Highest Games: 8

Highest reasoning leaderboard for DuelLab Benchmark
Rank	Model	Avg score	Min–Max	Entries
1	GPT-5.4	82.5	66.5 – 100.0	16
2	Deepseek V4 Pro	82.5	71.9 – 100.0	7
3	Claude Opus 4.7	77.5	56.0 – 96.1	7
4	Gemini 3.1 Pro Preview	72.5	50.9 – 95.9	11
5	Kimi K2.6	69.8	22.2 – 94.6	8
6	GPT-5.2	68.3	34.0 – 82.4	14
7	GLM-5.1	68.2	43.8 – 92.8	7
8	Claude Opus 4.6	63.6	25.4 – 91.9	21
9	GPT-5.4 Nano	63.3	30.0 – 90.6	20
10	GPT-5.5	63.3	37.7 – 94.0	16
11	Hy3 Preview	62.7	27.7 – 91.3	14
12	GPT-5.3 Codex	62.2	28.6 – 84.9	8
13	GPT-5.4 Mini	60.5	2.5 – 95.9	9
14	Minimax M2.7	59.6	19.2 – 87.0	9
15	Qwen3.6 Plus	58.8	14.4 – 78.5	9
16	Owl Alpha	57.3	22.2 – 96.7	7
17	Claude Sonnet 4.6	54.9	29.2 – 83.0	6
18	Deepseek V4 Flash	53.3	17.4 – 95.4	8
19	MiMo-V2.5-Pro	52.3	25.0 – 80.4	16
20	Ling-2.6-1T	51.8	22.4 – 95.4	7
21	Ring 2.6 1T	51.6	0.0 – 95.5	6
22	Gemma 4 31B	50.9	24.4 – 71.4	21
23	Qwen3.6 Max Preview	50.9	12.4 – 90.3	8
24	Qwen3.6 Plus Preview	50.1	18.2 – 100.0	8
25	DeepSeek V3.2	49.3	9.7 – 96.7	7
26	Step 3.5 Flash	48.6	18.4 – 75.1	9
27	Qwen3.5 122B A10B	45.9	12.6 – 66.0	10
28	GLM-5	45.6	9.8 – 84.7	7
29	Kimi K2.5	45.4	22.6 – 66.0	15
30	MiMo-V2-Pro	45.0	26.0 – 98.6	15
31	Minimax M2.5	43.8	0.0 – 72.0	7
32	Gemini 3.1 Flash Lite Preview	42.4	9.7 – 77.6	7
33	MiMo-V2.5	42.3	9.9 – 74.1	15
34	GPT-5 Mini	42.2	17.6 – 61.9	8
35	Gemini 2.5 Flash	41.3	9.8 – 70.2	8
36	Gemini 3 Flash Preview	40.5	22.8 – 92.3	7
37	Grok 4.20	40.1	11.1 – 77.7	16
38	Gemma 4 26B A4B	39.6	3.0 – 88.2	7
39	Qwen3.6 35B A3B	39.0	9.3 – 73.1	8
40	Mistral Small 2603	38.3	0.0 – 100.0	8
41	Qwen3 Max Thinking	37.5	8.8 – 90.8	9
42	Nemotron 3 Super	35.3	15.5 – 72.0	6
43	Qwen3.6 Flash	33.4	18.8 – 55.2	8
44	GPT-5 Nano	32.8	15.6 – 63.4	8
45	MiMo-V2-Omni	29.0	8.4 – 47.6	7
46	Ling-2.6-Flash	28.8	7.0 – 74.9	7
47	Cobuddy	27.5	3.2 – 84.9	7
48	Nemotron 3 Nano Omni 30B A3B Reasoning	26.9	0.0 – 63.8	5
49	Seed 2.0 Mini	18.1	18.1	1
50	Trinity Large Preview	6.2	0.0 – 12.3	2