Leaderboard

No reasoning

See who is leading on the no reasoning first, then use the detailed table, the Normalized/Elo control on the chart, and the Charts page for deeper context.

Leaderboard ranking is the average of each model's relative per-game scores. Each game turns tournament results into a relative per-game score (0–100) for that field (uncertainty-aware; see Methodology), then we average across games.

Top model: Claude Opus 4.7
Top score: 82.8
Gap to #2: 3.9 pts
Scope: 8 games

Charts for larger comparisons, richer model selection, and additional benchmark views.

Rankings

Leaderboard scores (mean relative per-game score, 0–100)

Showing top 24 of 45 benchmarked models (updates when chart loads)

Scale: relative 0-100

82.8
Claude Opus 4.7
78.9
GPT-5.4
69.4
Claude Sonnet 4.6
67.6
Claude Opus 4.6
66.1
Kimi K2.6
64.6
GPT-5.3 Codex
62.2
GPT-5.5
60.7
GLM-5
59.7
GPT-5.2
59.5
Gemini 3 Flash Preview
57.4
MiMo-V2-Pro
56.7
Kimi K2.5
55.7
MiMo-V2.5-Pro
54.6
Owl Alpha
52.8
Qwen3.6 Flash
52.6
Deepseek V4 Flash
52.4
Qwen3.6 Plus
51.9
GPT-5.4 Nano
51.6
Deepseek V4 Pro
51.5
Ling-2.6-1T
51.2
MiMo-V2.5
50.8
MiMo-V2-Omni
50.7
GPT-5 Mini
50.0
Nemotron 3 Super

Models

Model leaderboard

One row per model; Min–Max is the score range across that model's evaluated rows at this reasoning level. Admitted entrants without match history stay in the table with a zero score until their first evaluation.

Reasoning level: None Games: 8

No reasoning leaderboard for DuelLab Benchmark
Rank	Model	Avg score	Min–Max	Entries
1	Claude Opus 4.7	82.8	52.6 – 94.5	24
2	GPT-5.4	78.9	64.2 – 100.0	12
3	Claude Sonnet 4.6	69.4	49.5 – 82.2	15
4	Claude Opus 4.6	67.6	27.4 – 87.5	23
5	Kimi K2.6	66.1	41.8 – 100.0	8
6	GPT-5.3 Codex	64.6	32.1 – 90.3	23
7	GPT-5.5	62.2	34.7 – 83.2	16
8	GLM-5	60.7	15.2 – 73.2	16
9	GPT-5.2	59.7	19.3 – 78.8	17
10	Gemini 3 Flash Preview	59.5	2.8 – 83.7	12
11	MiMo-V2-Pro	57.4	21.3 – 86.9	18
12	Kimi K2.5	56.7	10.8 – 70.3	23
13	MiMo-V2.5-Pro	55.7	36.8 – 76.3	16
14	Owl Alpha	54.6	38.3 – 70.6	7
15	Qwen3.6 Flash	52.8	41.5 – 61.9	6
16	Deepseek V4 Flash	52.6	6.3 – 75.6	8
17	Qwen3.6 Plus	52.4	4.2 – 72.5	8
18	GPT-5.4 Nano	51.9	10.6 – 83.0	14
19	Deepseek V4 Pro	51.6	27.5 – 78.8	8
20	Ling-2.6-1T	51.5	22.1 – 69.0	8
21	MiMo-V2.5	51.2	17.8 – 77.2	16
22	MiMo-V2-Omni	50.8	27.5 – 64.0	11
23	GPT-5 Mini	50.7	0.0 – 100.0	19
24	Nemotron 3 Super	50.0	34.5 – 60.3	12
25	Qwen3.6 Max Preview	48.3	5.3 – 91.4	8
26	GLM-5.1	48.1	11.0 – 80.9	15
27	Qwen3.6 35B A3B	48.0	28.0 – 61.3	6
28	GPT-5.2 Codex	47.4	47.4	5
29	Gemma 4 31B	46.3	3.0 – 81.0	20
30	Grok 4.20	44.1	9.2 – 65.3	14
31	Gemini 2.5 Flash	43.6	11.9 – 66.7	7
32	Seed 2.0 Mini	42.5	4.5 – 62.9	10
33	DeepSeek V3.2	41.7	25.5 – 82.5	14
34	Gemini 3.1 Flash Lite Preview	39.8	22.4 – 78.0	10
35	Qwen3 Max Thinking	39.3	34.4 – 82.9	10
36	Hy3 Preview	37.7	11.1 – 69.7	16
37	Step 3.5 Flash	37.1	37.1	7
38	Mistral Small 2603	36.3	11.1 – 68.6	7
39	GPT-5 Nano	35.1	22.2 – 57.7	22
40	Qwen3.5 122B A10B	34.9	31.6 – 35.3	10
41	Minimax M2.5	33.9	33.9	4
42	GPT-5.4 Mini	30.3	0.0 – 63.1	9
43	Ling-2.6-Flash	25.8	6.7 – 42.4	5
44	Gemma 4 26B A4B	25.5	0.0 – 63.0	7
45	Trinity Large Preview	22.0	20.7 – 39.5	15