Leaderboard

Medium reasoning

See who is leading on the medium reasoning first, then use the detailed table, the Normalized/Elo control on the chart, and the Charts page for deeper context.

Leaderboard ranking is the average of each model's relative per-game scores. Each game turns tournament results into a relative per-game score (0–100) for that field (uncertainty-aware; see Methodology), then we average across games.

Top model: GPT-5.5
Top score: 74.1
Gap to #2: 0.4 pts
Scope: 8 games

Charts for larger comparisons, richer model selection, and additional benchmark views.

Rankings

Leaderboard scores (mean relative per-game score, 0–100)

Showing top 24 of 51 benchmarked models (updates when chart loads)

Scale: relative 0-100

74.1
GPT-5.5
73.7
Kimi K2.6
73.0
GLM-5.1
72.4
Kimi K2.5
70.4
GPT-5.2
64.5
Qwen3.6 Plus
64.5
Claude Opus 4.6
63.1
Claude Opus 4.7
61.2
GPT-5.4
61.1
Deepseek V4 Pro
60.5
GPT-5.3 Codex
60.1
GLM-5
58.4
Claude Sonnet 4.6
56.3
Qwen3 Max Thinking
55.1
GPT-5 Mini
54.7
GPT-5.2 Codex
54.6
GPT-5.4 Mini
54.5
MiMo-V2.5-Pro
54.4
DeepSeek V3.2
53.1
Qwen3.6 Flash
52.0
Gemini 3.1 Pro Preview
51.5
GPT-5.4 Nano
51.5
Deepseek V4 Flash
49.4
MiMo-V2-Omni

Models

Model leaderboard

One row per model; Min–Max is the score range across that model's evaluated rows at this reasoning level. Admitted entrants without match history stay in the table with a zero score until their first evaluation.

Reasoning level: Medium Games: 8

Medium reasoning leaderboard for DuelLab Benchmark
Rank	Model	Avg score	Min–Max	Entries
1	GPT-5.5	74.1	43.4 – 95.4	16
2	Kimi K2.6	73.7	0.0 – 100.0	8
3	GLM-5.1	73.0	29.7 – 87.4	7
4	Kimi K2.5	72.4	42.1 – 89.7	15
5	GPT-5.2	70.4	42.3 – 91.0	8
6	Qwen3.6 Plus	64.5	21.2 – 76.6	9
7	Claude Opus 4.6	64.5	46.5 – 95.2	17
8	Claude Opus 4.7	63.1	41.1 – 85.0	23
9	GPT-5.4	61.2	22.0 – 93.3	7
10	Deepseek V4 Pro	61.1	34.8 – 92.1	7
11	GPT-5.3 Codex	60.5	35.1 – 94.4	8
12	GLM-5	60.1	22.9 – 96.6	7
13	Claude Sonnet 4.6	58.4	34.3 – 82.9	6
14	Qwen3 Max Thinking	56.3	40.3 – 69.8	8
15	GPT-5 Mini	55.1	24.3 – 88.2	8
16	GPT-5.2 Codex	54.7	40.1 – 68.4	14
17	GPT-5.4 Mini	54.6	33.3 – 96.5	12
18	MiMo-V2.5-Pro	54.5	21.9 – 92.2	16
19	DeepSeek V3.2	54.4	26.3 – 69.0	7
20	Qwen3.6 Flash	53.1	8.4 – 78.9	8
21	Gemini 3.1 Pro Preview	52.0	29.4 – 85.0	14
22	GPT-5.4 Nano	51.5	23.2 – 72.5	14
23	Deepseek V4 Flash	51.5	1.3 – 73.4	8
24	MiMo-V2-Omni	49.4	18.5 – 84.2	7
25	MiMo-V2-Pro	49.1	35.5 – 100.0	15
26	Owl Alpha	48.0	4.6 – 75.2	8
27	Gemma 4 31B	47.6	12.6 – 87.5	22
28	Hy3 Preview	46.4	10.3 – 70.6	16
29	Qwen3.5 122B A10B	46.3	7.5 – 58.5	12
30	Gemini 3 Flash Preview	45.9	24.4 – 72.2	7
31	Qwen3.6 Max Preview	43.7	14.0 – 84.7	7
32	Minimax M2.5	43.0	26.9 – 61.2	7
33	MiMo-V2.5	42.4	18.1 – 64.4	16
34	Gemma 4 26B A4B	42.1	16.5 – 53.5	8
35	Ring 2.6 1T	41.8	3.5 – 82.9	7
36	Ling-2.6-1T	41.3	0.0 – 52.0	8
37	Step 3.5 Flash	40.8	10.6 – 62.8	8
38	Qwen3.6 Plus Preview	40.6	0.7 – 76.6	8
39	Gemini 2.5 Flash	40.4	11.2 – 90.9	8
40	Minimax M2.7	40.0	1.7 – 70.1	8
41	Grok 4.20	37.2	3.7 – 59.1	16
42	Mistral Small 2603	36.7	9.0 – 74.0	7
43	Trinity Large Preview	36.1	11.5 – 60.6	2
44	Seed 2.0 Mini	36.1	18.4 – 55.9	10
45	Qwen3.6 35B A3B	34.3	0.0 – 75.8	7
46	Gemini 3.1 Flash Lite Preview	33.4	11.1 – 59.6	7
47	Nemotron 3 Super	32.9	0.8 – 62.6	7
48	Ling-2.6-Flash	28.6	13.2 – 47.4	3
49	GPT-5 Nano	28.4	0.0 – 58.5	8
50	Cobuddy	28.0	0.0 – 67.1	5
51	Nemotron 3 Nano Omni 30B A3B Reasoning	25.8	0.0 – 35.5	4