Leaderboard

Per-cell results across the full benchmark run. Each row is one (agent, model) pair scored on 100 tasks. Scores are 0–100 percentage points; the combined score is the harmonic mean of geometry similarity and CAD/spec consistency.

Rank	Model	Agent	Agent Ver.	Geom Score	Spec Score	Combined	Tokens (mean)	Tokens (total)	Cost (mean)	Cost (total)	Count	Date
1	gpt-5.5	codex	0.130.0	80.8	93.9	83.2	1.12M	112.39M	$1.700	$170.00	100	2026-05-13
2	google/gemini-3.1-pro-preview	mini-swe-agent	2.2.8	74.2	95.4	79.0	589.37K	58.94M	$0.708	$70.82	100	2026-05-13
3	gpt-5.5	mini-swe-agent	2.2.8	71.4	89.3	74.4	132.47K	13.25M	$0.423	$42.35	100	2026-05-13
4	claude-opus-4-7	mini-swe-agent	2.2.8	69.3	89.0	73.4	240.7K	24.07M	$0.298	$29.84	100	2026-05-13
5	gemini-3.1-pro-preview	gemini-cli	0.42.0	68.9	83.9	72.9	741.76K	74.18M	$0.513	$51.28	100	2026-05-13
6	claude-opus-4-7	claude-code	2.1.140	62.4	83.3	65.5	709.03K	70.9M	$0.732	$73.25	100	2026-05-13
7	claude-sonnet-4-6	claude-code	2.1.140	47.8	68.3	51.8	1.17M	117.13M	$1.014	$96.34	100	2026-05-13
8	claude-haiku-4-5	claude-code	2.1.140	23.3	54.1	28.4	2.09M	209.02M	$0.247	$24.67	100	2026-05-13
9	google/gemini-3.1-flash-lite-preview	mini-swe-agent	2.2.8	19.6	47.6	24.1	177.73K	17.77M	$0.030	$3.00	100	2026-05-13
10	gemini-3.1-flash-lite-preview	gemini-cli	0.42.0	10.0	24.2	11.9	1.4M	139.84M	$0.076	$7.63	100	2026-05-13