Back to CAD-Bench
Leaderboard
Per-cell results across the full benchmark run. Each row is one (agent, model) pair scored on 100 tasks. Scores are 0–100 percentage points; the combined score is the harmonic mean of geometry similarity and CAD/spec consistency.
| Rank | Model | Agent | Agent Ver. | Geom Score | Spec Score | Combined | Tokens (mean) | Tokens (total) | Cost (mean) | Cost (total) | Count | Date |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | gpt-5.5 | codex | 0.130.0 | 80.8 | 93.9 | 83.2 | 1.12M | 112.39M | $1.700 | $170.00 | 100 | 2026-05-13 |
2 | google/gemini-3.1-pro-preview | mini-swe-agent | 2.2.8 | 74.2 | 95.4 | 79.0 | 589.37K | 58.94M | $0.708 | $70.82 | 100 | 2026-05-13 |
3 | gpt-5.5 | mini-swe-agent | 2.2.8 | 71.4 | 89.3 | 74.4 | 132.47K | 13.25M | $0.423 | $42.35 | 100 | 2026-05-13 |
4 | claude-opus-4-7 | mini-swe-agent | 2.2.8 | 69.3 | 89.0 | 73.4 | 240.7K | 24.07M | $0.298 | $29.84 | 100 | 2026-05-13 |
5 | gemini-3.1-pro-preview | gemini-cli | 0.42.0 | 68.9 | 83.9 | 72.9 | 741.76K | 74.18M | $0.513 | $51.28 | 100 | 2026-05-13 |
6 | claude-opus-4-7 | claude-code | 2.1.140 | 62.4 | 83.3 | 65.5 | 709.03K | 70.9M | $0.732 | $73.25 | 100 | 2026-05-13 |
7 | claude-sonnet-4-6 | claude-code | 2.1.140 | 47.8 | 68.3 | 51.8 | 1.17M | 117.13M | $1.014 | $96.34 | 100 | 2026-05-13 |
8 | claude-haiku-4-5 | claude-code | 2.1.140 | 23.3 | 54.1 | 28.4 | 2.09M | 209.02M | $0.247 | $24.67 | 100 | 2026-05-13 |
9 | google/gemini-3.1-flash-lite-preview | mini-swe-agent | 2.2.8 | 19.6 | 47.6 | 24.1 | 177.73K | 17.77M | $0.030 | $3.00 | 100 | 2026-05-13 |
10 | gemini-3.1-flash-lite-preview | gemini-cli | 0.42.0 | 10.0 | 24.2 | 11.9 | 1.4M | 139.84M | $0.076 | $7.63 | 100 | 2026-05-13 |