Models
Model family summary
One row per model and reasoning preset on this mixed track (mean and min–max when multiple runs exist for that pair). Cross-reasoning matches do not roll into Overall.
| Model | Reasoning | Avg score | Min–Max | Entries |
|---|---|---|---|---|
| Gemini 3.1 Pro Preview | Highest | 74.1 | 49.2 – 100.0 | 6 |
| GPT-5.2 | Highest | 72.7 | 50.2 – 95.3 | 2 |
| Qwen3.5 122B A10B | Medium | 69.6 | 69.6 | 1 |
| Kimi K2.5 | Medium | 65.2 | 20.5 – 81.6 | 7 |
| GPT-5.4 | Highest | 64.0 | 25.6 – 100.0 | 16 |
| Claude Opus 4.6 | Medium | 60.4 | 19.3 – 83.9 | 6 |
| Claude Opus 4.6 | Highest | 57.2 | 22.8 – 100.0 | 14 |
| Gemini 3.1 Pro Preview | None | 57.1 | 18.8 – 97.4 | 4 |
| DeepSeek V3.2 | Medium | 57.0 | 19.6 – 73.8 | 5 |
| GPT-5.4 Mini | Medium | 56.6 | 16.1 – 97.4 | 5 |
| Claude Sonnet 4.6 | Medium | 55.3 | 24.9 – 83.8 | 6 |
| GPT-5.4 | Medium | 55.2 | 14.5 – 87.2 | 7 |
| GLM-5 | Medium | 55.1 | 14.0 – 91.8 | 7 |
| GPT-5.3 Codex | Highest | 53.8 | 16.4 – 89.2 | 8 |
| Claude Sonnet 4.6 | Highest | 52.2 | 20.9 – 73.4 | 6 |
| GPT-5.2 | Medium | 51.5 | 16.7 – 89.6 | 8 |
| GPT-5.4 Nano | Highest | 49.6 | 3.9 – 100.0 | 13 |
| Step 3.5 Flash | Medium | 47.9 | 19.1 – 76.7 | 2 |
| GLM-5 | Highest | 47.3 | 18.1 – 84.6 | 7 |
| Qwen3 Max Thinking | Highest | 47.2 | 13.5 – 80.9 | 2 |
| GPT-5 Mini | Medium | 47.2 | 12.4 – 95.4 | 8 |
| GPT-5.4 | None | 46.8 | 13.5 – 100.0 | 12 |
| MiMo-V2-Pro | None | 46.5 | 4.0 – 85.5 | 18 |
| Minimax M2.5 | Medium | 45.8 | 11.0 – 70.5 | 7 |
| GPT-5.3 Codex | Medium | 45.0 | 20.6 – 74.2 | 8 |
| Gemini 3.1 Pro Preview | Medium | 44.8 | 10.8 – 79.3 | 7 |
| GPT-5.4 Nano | Medium | 44.7 | 4.5 – 79.0 | 8 |
| Kimi K2.5 | Highest | 44.6 | 18.6 – 75.8 | 7 |
| Minimax M2.7 | Highest | 44.6 | 11.6 – 67.3 | 9 |
| Claude Opus 4.6 | None | 44.3 | 14.6 – 85.0 | 14 |
| Minimax M2.5 | Highest | 43.8 | 6.7 – 83.0 | 7 |
| Trinity Large Preview | Medium | 42.6 | 14.8 – 70.4 | 2 |
| Claude Sonnet 4.6 | None | 42.2 | 9.6 – 87.5 | 15 |
| GPT-5.2 Codex | Medium | 42.1 | 8.8 – 87.5 | 3 |
| GPT-5.4 Nano | None | 41.9 | 0.0 – 78.1 | 8 |
| Gemini 3 Flash Preview | None | 41.8 | 7.1 – 100.0 | 12 |
| Gemini 3 Flash Preview | Medium | 41.5 | 16.2 – 83.7 | 7 |
| DeepSeek V3.2 | Highest | 40.9 | 18.7 – 71.3 | 7 |
| MiMo-V2-Pro | Medium | 39.6 | 0.6 – 98.8 | 15 |
| Gemini 3 Flash Preview | Highest | 39.3 | 19.0 – 85.0 | 7 |
| MiMo-V2-Omni | None | 39.1 | 5.1 – 66.0 | 11 |
| MiMo-V2-Pro | Highest | 38.7 | 0.2 – 74.7 | 15 |
| MiMo-V2-Omni | Medium | 38.5 | 14.9 – 100.0 | 7 |
| Nemotron 3 Super | None | 38.3 | 7.6 – 61.7 | 11 |
| Mistral Small 2603 | Highest | 38.0 | 0.6 – 74.2 | 8 |
| Gemini 2.5 Flash | None | 37.9 | 12.3 – 61.0 | 7 |
| Step 3.5 Flash | Highest | 37.4 | 12.8 – 68.0 | 3 |
| GPT-5 Mini | Highest | 35.5 | 15.1 – 71.8 | 8 |
| Gemini 3.1 Flash Lite Preview | Highest | 34.9 | 12.1 – 62.8 | 7 |
| Gemini 2.5 Flash | Medium | 33.9 | 16.2 – 79.4 | 8 |
| GPT-5.4 Mini | Highest | 33.4 | 18.4 – 48.5 | 2 |
| Gemini 2.5 Flash | Highest | 33.4 | 11.5 – 76.2 | 8 |
| Mistral Small 2603 | Medium | 33.4 | 0.0 – 81.3 | 7 |
| Nemotron 3 Super | Medium | 31.4 | 0.0 – 61.9 | 7 |
| DeepSeek V3.2 | None | 31.3 | 6.8 – 84.4 | 14 |
| Nemotron 3 Super | Highest | 30.6 | 11.4 – 61.0 | 6 |
| Minimax M2.7 | Medium | 29.9 | 0.6 – 69.1 | 8 |
| Qwen3.5 122B A10B | Highest | 29.8 | 18.5 – 41.1 | 2 |
| Gemini 3.1 Flash Lite Preview | None | 28.6 | 2.5 – 62.7 | 10 |
| Gemini 3.1 Flash Lite Preview | Medium | 27.8 | 11.8 – 44.5 | 7 |
| GLM-5 | None | 27.2 | 7.3 – 51.1 | 16 |
| GPT-5.2 | None | 26.9 | 11.3 – 78.8 | 18 |
| Mistral Small 2603 | None | 26.7 | 0.0 – 76.0 | 7 |
| GPT-5.4 Mini | None | 26.7 | 0.0 – 56.2 | 9 |
| GPT-5.3 Codex | None | 26.5 | 2.4 – 79.3 | 23 |
| MiMo-V2-Omni | Highest | 25.9 | 8.2 – 52.2 | 7 |
| GPT-5 Nano | None | 24.6 | 0.2 – 76.8 | 22 |
| Seed 2.0 Mini | Medium | 24.2 | 8.2 – 51.1 | 3 |
| GPT-5 Nano | Medium | 23.6 | 2.5 – 58.5 | 8 |
| GPT-5 Mini | None | 23.1 | 5.0 – 84.3 | 19 |
| Kimi K2.5 | None | 22.7 | 1.5 – 75.7 | 15 |
| GPT-5 Nano | Highest | 21.3 | 7.2 – 66.9 | 8 |
| Qwen3 Max Thinking | None | 17.1 | 2.5 – 64.3 | 10 |
| Qwen3.5 122B A10B | None | 14.5 | 6.4 – 22.2 | 10 |
| Seed 2.0 Mini | None | 13.9 | 7.1 – 27.3 | 4 |
| Trinity Large Preview | None | 13.1 | 2.1 – 39.8 | 15 |
| GPT-5.2 Codex | None | 11.8 | 0.0 – 16.4 | 5 |
| Step 3.5 Flash | None | 10.9 | 1.8 – 23.8 | 7 |
| Minimax M2.5 | None | 8.0 | 0.8 – 13.0 | 4 |
| Trinity Large Preview | Highest | 6.0 | 0.0 – 12.0 | 2 |