Back to CAD-Bench

May 13, 2026

Benchmark Release

Parametric CAD Bench: Can AI agents author editable CAD?

A benchmark for AI agents that design parametric 3D mechanical parts. The leaderboard lives at cadbench.ai; the artifacts behind it are the Harbor task suite gnucleus-ai/cad-bench and the Hugging Face results dataset gnucleus-ai/cad-gen-freecad-bench.

TL;DR

We introduce Parametric CAD Bench, a new evaluation for AI agents that measures the ability to author editable FreeCAD models from natural language. Unlike previous CAD-related benchmarks, we use a multi-step agentic loop and a strict "editability gate" (harmonic mean scoring) to ensure models produce functional engineering recipes, not just static 3D shapes. Early results show GPT-5.5 via Codex leading at 0.832, with a visible harness effect: swapping the driver while keeping the model fixed shifts scores by roughly 10% in either direction. Per-cell spend ranges from $3 to $170 across 100 trials each — the cost–quality frontier is wide enough that the top-scoring cell isn't always the best-value one.

What this measures

Modern AI coding agents — Claude Code, Codex, Gemini CLI, mini-swe-agent — can drive a Linux shell, edit files, run programs, and iterate on errors. We measure how well they can use those same tools to author an editable CAD design for a mechanical part described in plain English.

For each task in the suite, an agent receives:

  • A natural-language description of a part ("a round mounting flange with 4 bolt holes on an 87.3 mm bolt circle, 49.2 mm central bore, …")
  • A table of the part's key dimensions

The agent must then produce a script that, when executed by FreeCAD, creates a saved CAD document for that part. The document is graded against a held-back reference design for geometric correctness (does the resulting shape match the reference design) and consistency between the generated CAD and the provided spec. It is also gated by feature-based editability: for example, whether a CAD engineer can open the file and change a parameter such as bolt_circle_diameter through a feature or sketch.

CAD background, in two sentences: CAD software (Computer-Aided Design) is what engineers use to design physical parts. Parametric CAD means the design isn't a static 3D shape but a recipe — a sequence of named operations (sketch a circle, extrude it into a disc, cut a hole through the center, pattern that hole around the axis) whose inputs can be changed later. Editability is the whole point of CAD; a parametric flange where you can sweep outer_diameter from 80 mm to 120 mm and watch the bolt pattern adapt is worth a lot more than a fixed solid that happens to be 107 mm wide today.

There is also a major difference between a full CAD system and a geometry kernel or kernel-level modeling tool: FreeCAD vs. OCCT, SolidWorks/NX/Onshape vs. Parasolid, or AutoCAD vs. ACIS. A CAD system adds a rich layer of persistent engineering meaning, including named features, editable operations, constraints, parameters, stable face/edge references, and topological naming / persistent naming mechanisms. A pure geometry kernel mainly provides low-level geometric operations, while most real engineering workflows happen inside CAD systems, not directly on top of geometry kernels.

Why FreeCAD

FreeCAD is open source, fully scriptable from Python, and has a stable native format (.FCStd) that preserves the full feature history. Because it runs entirely offline, it's also straightforward to drop into a sandboxed container for automated evaluation. And it's a real CAD system that engineers use — the operations, constraints, and conventions match production workflows.

FreeCAD has two solid-modeling styles: the Part Workbench (CSG — booleans on primitives) and the Part Design Workbench (feature-based — sketches plus a parametric feature tree). We use Part Design because it captures engineering intent — parameter-driven operations (Pad, Pocket, Loft, Sweep, Pattern…) on top of sketches — and is how professional CAD (Catia, NX, Creo, SolidWorks, Onshape) actually works. That structure gives a much richer evaluation signal — right features used, parameters match, part still rebuilt when a dimension changes — which CSG loses once the booleans are applied.

Why this is different

Two CAD-AI benchmarks have appeared recently (more on both in How we compare to CadBench and BenchCAD). Both measure whether a single-shot prompt — text, image, or a 3D mesh — can get a model to output some kind of CAD program. They're impressive at scale — ~18,000 evaluation samples each — and they've quantified frontier models' shape-reproduction ability nicely.

We measure something different:

  • The agent has tools. A productive CAD-with-LLM session isn't a single prompt — it's a loop. Our agents run a shell, write the FreeCAD script, execute it, inspect errors, fix them, and try again, all inside a sandboxed container. The bench measures that loop end-to-end rather than just the first model output.
  • The output has to be editable, not just correct-shaped. Plenty of models can produce a 3D solid that matches a reference volume. Far fewer can produce a FreeCAD document whose internal structure (sketch → pad → pocket → polar pattern, with named driven dimensions) reflects how a CAD engineer would have built it. Our scorer rewards the latter — and because the two halves combine as a harmonic mean (see §"How scoring works"), a model that produces perfect geometry but zero named parameters scores 0, not 0.5.
  • The harness is a first-class variable. The 10 leaderboard cells cross four agent frameworks — claude-code, codex, gemini-cli, mini-swe-agent — with the major frontier models (Anthropic Opus / Sonnet / Haiku, OpenAI GPT-5.5, Google Gemini 3.1 Pro / Flash). Same model, different driver tells you whether the harness or the model is doing the work — a question this kind of leaderboard structure (borrowed from SWE-Bench and Terminal-Bench) is uniquely good at answering.

At a glance:

FeatureCadBenchBenchCADParametric CAD Bench
Primary goalGeometric accuracyIndustrial-standard coverageParametric / editable logic
MethodSingle-shot (vision)Single-shot (text / image)Multi-step agentic loop
Model writesCAD programCadQuery (Python) scriptFreeCAD-driving script
Graded artifactSTEP / meshCadQuery script (and the B-rep it produces)Native .FCStd with feature tree
Is the output CAD parametric?NoNo — script has parameters, generated CAD doesn'tYes — native feature tree + named dimensions
Key metricVolumetric IoUCode correctnessHarmonic mean of geometry × spec and cad consistency
How scoring works

Both sub-scores below are computed by the open-source gNucleus-AI/freecad-validator (v0.1.0) — the same validator binary is baked into each task image at /opt/grader/, so leaderboard scores and any local re-grade agree by construction. To run it yourself against an .FCStd:

pip install gnucleus-freecad-validator

Each trial produces two sub-scores in [0, 1]:

Sub-scoreWhat it checks
geometry_similarityDoes the shape match the reference design? Compared on solid count, volume, surface area, bounding box, and surface-type distribution.
cad_spec_consistencyFor each key parameter in the part spec (e.g. bolt_circle_diameter), does the saved CAD file contain a corresponding named, driven dimension within tolerance? A bolt pattern that's geometrically perfect but has the holes hard-coded as fixed positions loses points here.

The composite is the harmonic mean of the two:

Combined = 2 * (geometry_similarity * cad_spec_consistency) /
           (geometry_similarity + cad_spec_consistency)

This is the editability gate, expressed as arithmetic. A model that produces the correct shape but zero named parameters (cad_spec_consistency = 0) scores 0 combined, not 0.5 — there is no partial credit for "looks right but isn't editable." Symmetrically, a model that names every parameter but builds a wrong shape also scores 0. Both halves have to land for the trial to count.

The grading is a blind test. The agent receives only the natural-language part description and the parameter table — never the reference CAD file. The held-back reference geometry and the validator's tolerances live inside the task image at root-only paths the agent process can't read, so the agent is designing against the spec, not translating a reference it's been shown.

The matrix

The 10 leaderboard cells:

Agent-Model-ComboAgentModelWhat it tests
claude-code-opusclaude-codeClaude Opus 4.7Anthropic frontier
claude-code-sonnetclaude-codeClaude Sonnet 4.6Mid-tier Anthropic
claude-code-haikuclaude-codeClaude Haiku 4.5Cheapest Anthropic
codex-gpt5.5codexGPT-5.5OpenAI frontier
gemini-cli-progemini-cliGemini 3.1 ProGoogle frontier
gemini-cli-flashgemini-cliGemini 3.1 FlashGoogle cost floor
mini-swe-claude-opusmini-swe-agentClaude Opus 4.7Harness effect on Opus
mini-swe-gemini-promini-swe-agentGemini 3.1 ProHarness effect on Gemini Pro
mini-swe-gemini-flashmini-swe-agentGemini 3.1 FlashHarness effect on Gemini Flash
mini-swe-gpt5.5mini-swe-agentGPT-5.5Harness effect on GPT-5.5

Reading the leaderboard:

  • Vendor frontier comparison — filter to claude-code-opus, codex-gpt5.5, gemini-cli-pro to see which vendor's flagship model + native CLI does best.
  • Cost vs. quality — every row reports per-task USD spend. Sort by cost ascending and plot against score to see the cost-quality knee.
  • Harness effect — pair claude-code-opus with mini-swe-claude-opus (same model, two drivers) and the same for Gemini Pro, Gemini Flash, GPT-5.5. Wide gaps within a pair say the harness matters more than the model on this task type; narrow gaps say the opposite.
Preliminary results (v1)

All 10 leaderboard cells, sorted by mean composite score:

#CellMeanExceptionsCost
1codex-gpt5.50.8320$170.00
2mini-swe-gemini-pro0.7900$70.82
3mini-swe-gpt5.50.7440$42.35
4mini-swe-claude-opus0.7340$29.84
5gemini-cli-pro0.7290$51.28
6claude-code-opus0.6551$73.25
7claude-code-sonnet0.51822$96.34
8claude-code-haiku0.2848$24.67
9mini-swe-gemini-flash0.2410$3.00
10gemini-cli-flash0.1190$7.63
Total31$569.18

Headline findings:

  • GPT-5.5 via Codex tops the leaderboard at 0.832. The same model on a vendor-neutral harness (mini-swe-gpt5.5) scores 0.744 — a 0.088 gap that's this bench's clean harness-effect measurement on a single model. When the harness is well-tuned to its model (Codex was built for GPT-5.5), it lifts the score meaningfully.
  • The harness-effect runs in both directions. Codex is the only specialized vendor CLI that beats its generic counterpart; for the other three model families, the vendor-neutral mini-swe-agent driver wins:
    ModelSpecialized CLImini-swe-agentΔ
    GPT-5.5codex 0.8320.744+0.088 to Codex
    Claude Opus 4.7claude-code 0.6550.734+0.079 to mini-swe
    Gemini 3.1 Progemini-cli 0.7290.790+0.061 to mini-swe
    Gemini 3.1 Flashgemini-cli 0.1190.241+0.122 to mini-swe
  • Anthropic's three-tier line lands at 0.655 / 0.518 / 0.284 for Opus / Sonnet / Haiku — a clean monotonic ladder by model tier. Sonnet's 22 exceptions split between two failure modes: 14 trials hit Sonnet 4.6's 32K-token output cap — the model tries to emit the entire FreeCAD script in a single assistant message and the response is truncated mid-code before any tool call lands. (This is an output-token limit, not a context-window limit; Sonnet's input context is 200K.) The remaining 8 exceptions are genuine iteration loops or wall-clock timeouts mid-tool-use. Opus runs clean (1 exception); Haiku trades scope for cost and lands cheap-but-low.
  • Reliability skews toward mini-swe-agent. The four mini-swe rows account for 0 exceptions across 400 trials. The vendor CLIs account for 31 exceptions across 600 trials, concentrated in claude-code-sonnet (22). The simpler "edit a file, run a command, look at the output" loop is more robust than the richer vendor drivers in this run.
  • Cost transparency. Each row carries token counts and per-trial USD spend. The full leaderboard cost is $569.18 across 1,000 trials. The two mini-swe-agent Gemini cells (mini-swe-gemini-{pro,flash}) emit cost_usd: 0 natively — mini-swe-agent reaches Vertex AI Gemini through LiteLLM's OpenAI-compatible Vertex AI endpoint, which doesn't return cost in the response. We backfill those from token counts × Vertex AI pricing before any results upload, so the numbers above are apples-to-apples.
Why the harness gap

Drilling into per-task deltas between same-model pairs reveals the +0.06-to-+0.09 mean gap isn't one failure mode — it's two, and mini-swe-agent's verification rhythm catches both.

claude-code-opus vs mini-swe-claude-opus (Δ = +0.079). Almost all of the gap is geometric (Δ geom = +0.068). On 11 of the 12 tasks where mini-swe wins by ≥0.57, claude-code lands at a flat geom_score ≈ 0.22 with a consistent signature: bbox matches (1.0), but volume_diff ≈ 30%, surface_area_diff ≈ 18%, surface_types ≈ 0.7. Reading the scripts, the cause is one line — claude-code declares a Pocket with Reversed = False and its sketch attached to the XY plane, so the pocket "cuts" downward into empty space below the part. The bore and bolt holes appear correctly in the parametric tree (driving spec_score to 1.0), but no material is actually removed. mini-swe-agent on the same model prints body.Shape.Volume after each recompute and either sets Reversed = True or attaches subsequent sketches to the top face of the prior feature.

gemini-cli-pro vs mini-swe-gemini-pro (Δ = +0.060). Mean geom is essentially tied (Δ = −0.001) — Gemini 3.1 Pro performs the same regardless of harness on graded trials, with 65% of tasks landing at ties. The +0.060 gap comes from 13 gemini-cli-pro trials that scored exactly 0.000: 7 produced no .FCStd at all (artifact_status = py-only or no-output — the script crashed mid-recompute before reaching doc.saveAs and the harness declared done anyway), and 6 produced an .FCStd where the validator's reason reads "all PartDesign::Body objects are empty (null shape or zero volume) — incomplete candidate." mini-swe-gemini-pro scored zero on 0 of its 100 trials.

The CAD-scripting trap is universal across models — the same task ids appear in losing-direction lists for both pairs. Task db1ef53f08 loses under claude-code-opus and under mini-swe-gemini-pro with the identical "pocket didn't cut" signature. What differs is how often each harness catches it.

Verification rhythm, not loop sophistication. mini-swe-agent's wins aren't from being smarter — they're from a mechanical "edit → run → print → check → iterate" cycle that catches "the output isn't what I expected" cheaply. The specialized vendor CLIs are optimized for software-engineering tasks where running tests and reading stderr is the natural verification step. CAD requires a different verification rhythm (inspect the saved artifact), and the specialized CLIs don't carry that habit by default. Same model, same task, different verification rhythm: +6 to +9 points of absolute composite score.

How we compare to CadBench and BenchCAD

The overarching goal is to push AI-CAD research past static geometry and mesh generation toward outputs an engineer can actually pick up and edit. Two large-scale CAD-AI benchmarks appeared in May 2026, and the technical differences come down to output format, scoring, and workflow.

Output format: what the model actually produces

CadBench (arxiv:2605.10873) — 18,000 evaluation samples across six families. Models output STEP files (a static B-rep format with no edit history) or meshes (STL, GLB). The authors explicitly note they "do not require recovery of the original designer's feature tree, sketch constraints, construction order, parameterization, or design intent." A model that produces the right final solid wins on this benchmark even if the file behind it has no operations history.

BenchCAD (arxiv:2605.10865) — 17,900 execution-verified CadQuery programs across 106 industrial part families, about half anchored to ISO/DIN/EN/ASME/IEC engineering standards. CadQuery is parametric at the script level only: the Python source has variables, functions, and parameters, but when you run it, what comes out the other end is a static B-rep (typically exported to STEP) with no feature tree, no sketch constraints, and no named driven dimensions. To change a dimension you go back to the Python and re-run — the produced CAD file itself cannot be opened in a CAD GUI and edited. The parametricity lives in the generator, not in the artifact the generator produces. BenchCAD accordingly grades the script (essential-op recall, exec rate) plus the resulting shape; it does not test whether the generated CAD is editable, because CadQuery's output format doesn't carry that information.

Parametric CAD Bench — 100 hermetic-scoring tasks where agents produce native, editable FreeCAD .FCStd files. ("Hermetic scoring" means the reference geometry, validator, and scorer are sealed into each task image at root-only paths the agent can't read, so the evaluation is fully reproducible from the image alone. The agent itself has outbound internet — needed to install its CLI and reach its model API — matching the Terminal-Bench / SWE-Bench-Verified convention.) Unlike CadQuery, where the parametric structure lives in the generator and is lost once the script runs, an .FCStd file is the parametric structure persisted to disk: the full feature history (sketch → pad → pocket → polar pattern), the underlying 2D sketches with their constraints, and named driven dimensions all live inside the file the agent saves. An engineer can open it in FreeCAD, navigate the model tree, change bolt_circle_diameter, and watch the bolt pattern adapt — without ever seeing the agent's source. That's the output we grade, and the format is what makes the grading possible.

Scoring: editability as a gate, not a bonus

CadBench's primary metric is volumetric IoU (correct solid). BenchCAD mixes geometric metrics with code-correctness signals (essential-op recall, exec rate). Both reward producing the right shape; neither makes editability a pass/fail gate — a geometrically correct output with no named driven parameters wins on CadBench and partially wins on BenchCAD.

The harmonic-mean composite Parametric CAD Bench uses (see How scoring works) puts geometric correctness and named-parameter fidelity on equal footing: the same "perfect shape, no parameters" output scores 0, not 0.5. A model can't trade one half of editable CAD off against the other.

Workflow: agentic loop vs. single-shot synthesis

CadBench and BenchCAD evaluate single-shot synthesis — typically image-to-code or text-to-code generation, with no opportunity for the model to run its own output, observe errors, and iterate. Parametric CAD Bench evaluates the agentic loop instead: the agent has a shell, runs the FreeCAD script it just wrote, reads stderr, edits the script, and tries again, all inside a sandboxed container.

A consequence of this choice: harness matters. The 10 leaderboard cells cross four agent frameworks × the major frontier models, so we can directly measure whether the lift is coming from the model or the driver wrapped around it — a question single-shot benchmarks can't answer because they don't vary the driver.

A second consequence — visible only by varying the harness against a fixed model — is that agentic loops are not interchangeable: verification rhythm dominates loop sophistication on CAD tasks. The vendor CLIs are tuned for software-engineering verification (run tests, read stderr); CAD requires a different rhythm (inspect the saved artifact). See Why the harness gap above for the per-task evidence behind that claim.

Where each is strongest
  • CadBench — the right tool to ask which input modality is hardest for vision-language models on CAD? Its mesh-vs-image comparisons across 18,000 samples are unmatched in this space.
  • BenchCAD — the right tool to ask what kind of CAD reasoning does a model lack? Its four-axis capability decomposition (recognition → operation → parameter → spatial-code) breaks low scores into specific deficits.
  • Parametric CAD Bench — the right tool to ask which agent + model combination produces editable CAD a human engineer can pick up and work with? The native-FCStd output, the harmonic-mean editability gate, and the harness-as-variable matrix all point at the same question.

The three suites are complementary, not competing.

How to submit your own results

The leaderboard accepts third-party (agent, model) submissions. The flow is three steps:

  1. Run the bench. harbor run -d gnucleus-ai/cad-bench@v1 -a <your-agent> -m <your-model> against the published task suite. Your model API costs are yours; the task suite itself is free.
  2. Push your run artifacts to a Hugging Face dataset you control. Use the same runs/<agent>/<model>/<task_id>/ layout the internal baseline uses (per-trial result.json, answer.FCStd, agent.log, trajectory.jsonl, …). The published gnucleus-ai/cad-gen-freecad-bench dataset is the canonical example of every required file.
  3. Open a PR at github.com/gNucleus-AI/cad-bench-submission adding one manifest YAML under submissions/. The manifest is a lightweight pointer that names your HF dataset, the exact commit OID to pin, and the declared summary metrics. The full schema and required fields live in submissions/_schema/manifest.schema.json; start from submissions/_template/example.yaml.

A maintainer reviews each PR by hand: schema check, spot-check re-grade against your declared reward.json with gnucleus-freecad-validator, trajectory sanity, cost re-derivation from token counts. Once merged, the manifest is the authoritative leaderboard entry. See CONTRIBUTING.md for the full contract.

Links