Models
Model leaderboard
One row per model; Min–Max is the score range across that model's evaluated rows at this reasoning level. Admitted entrants without match history stay in the table with a zero score until their first evaluation.
| Model | Avg score | Min–Max | Entries |
|---|---|---|---|
| GPT-5.4 | 79.9 | 41.7 – 100.0 | 12 |
| Claude Sonnet 4.6 | 76.5 | 47.0 – 92.8 | 15 |
| Gemini 3.1 Pro Preview | 75.8 | 62.5 – 84.8 | 4 |
| Claude Opus 4.6 | 68.9 | 0.0 – 100.0 | 19 |
| MiMo-V2-Pro | 66.8 | 0.9 – 97.7 | 18 |
| GLM-5 | 66.1 | 28.1 – 88.9 | 16 |
| GPT-5.3 Codex | 65.4 | 23.1 – 100.0 | 23 |
| GPT-5.4 Nano | 64.6 | 13.6 – 94.0 | 8 |
| Kimi K2.5 | 63.3 | 19.1 – 89.8 | 15 |
| GPT-5.2 | 61.8 | 27.9 – 93.4 | 18 |
| MiMo-V2-Omni | 59.8 | 26.0 – 93.3 | 11 |
| Gemini 3 Flash Preview | 59.0 | 0.0 – 96.2 | 12 |
| Nemotron 3 Super | 56.4 | 41.2 – 73.6 | 11 |
| DeepSeek V3.2 | 53.8 | 6.3 – 100.0 | 15 |
| GPT-5 Mini | 52.1 | 30.4 – 100.0 | 19 |
| Gemini 2.5 Flash | 51.5 | 0.0 – 100.0 | 7 |
| Seed 2.0 Mini | 47.7 | 19.7 – 87.8 | 5 |
| Gemini 3.1 Flash Lite Preview | 42.5 | 17.7 – 87.8 | 10 |
| GPT-5.2 Codex | 41.6 | 23.7 – 54.6 | 5 |
| Qwen3 Max Thinking | 40.6 | 8.0 – 82.3 | 10 |
| GPT-5 Nano | 38.0 | 0.0 – 92.2 | 22 |
| GPT-5.4 Mini | 35.7 | 0.0 – 77.1 | 9 |
| Mistral Small 2603 | 34.1 | 0.0 – 83.5 | 7 |
| Step 3.5 Flash | 33.4 | 4.8 – 51.7 | 7 |
| Qwen3.5 122B A10B | 32.6 | 0.0 – 77.3 | 10 |
| Trinity Large Preview | 32.5 | 2.1 – 85.7 | 15 |
| Minimax M2.5 | 25.8 | 6.9 – 46.0 | 4 |