A small, opinionated benchmark for local agentic LLMs in engineering
domains. Pluggable adapters, deterministic scoring, a single aggregate
loss in [0, 1]. Works out of the box with Ollama; ~50 LOC to add a new
provider.
Status: alpha (v0.1.0). Released because we needed a way to compare Gemma 4, Qwen 2.5, Gemma 3, and friends on an aerospace-flavoured agent workload — and decided to make it reusable for anyone wiring LLMs to engineering tools.
Most public LLM benchmarks (MMLU, AIME, LiveCodeBench, …) measure raw text generation. They do not measure the four things that actually break when you stick an LLM in front of an engineering toolchain:
- Numerical sanity. Does the model know that a transonic
narrowbody cruises at
CL ≈ 0.55, or does it make upCL = 7.2? - Tool routing. Given "give me a STEP file of the geometry", does the model call the geometry-export tool, or the CFD tool?
- Argument extraction. When the user says "Mach 0.78, AoA 2.5°,
FL350, workstation preset", does the model put 0.78 in
mach, 2.5 inaoa_deg, 35 000 inaltitude_ft, andworkstationinpreset? Or does it scramble them? - Multi-step planning. Given "take the geometry through aero, propulsion, and mission to get block fuel", does it produce a coherent ordered plan, or a soup of out-of-order tool calls?
- Multimodal verdicts (optional). Given a CFD post-processing image, can the model judge whether the mesh looks under-resolved?
Every task type has a deterministic scorer in [0, 1], and the suite
collapses into one weighted aggregate loss so you can rank models
on a single number.
pip install -e .
# or
pip install -e .[dev] # adds pytest + ruffYou also need an LLM backend. The reference adapter is Ollama:
brew install ollama
brew services start ollama
ollama pull gemma4:e4b # the default model in the default suite
ollama pull qwen2.5:7b # comparison baselineagentic-bench run \
--backend ollama \
--model gemma4:e4b \
--suite agentic_bench/tasks/aircraft_design.yaml \
--report reports/gemma4_e4b.jsonOutput:
=== AGENTIC-BENCH REPORT ===
adapter : ollama:gemma4:e4b
suite : aircraft_design_v1
items : 19
wall (s) : 509.91
loss : 0.2537
per-category scores:
args 0.739
numerical 0.854
planning 0.525
routing 0.800
The full per-item report is written to reports/gemma4_e4b.json.
A suite is a single YAML file containing:
- A
toolslist of OpenAI-shaped function specs. Adapters that support function calling get these natively; adapters that don't see only the names in planning-style prompts. - A list of
items. Each item has akind:numerical— single-number QA, scored by relative-error decay.routing— first-tool-call must matchexpected_tool.args— first-tool-call arguments are compared toexpected_args.planning— model emits a{"plan": [...]}JSON of tool names; scored by normalised Levenshtein distance.multimodal— model is given an image and a prompt; verdict is scored by substring match againstexpected_label.
- An optional
weightsblock overriding category weights in the aggregate loss. Default weights are{numerical: 0.25, routing: 0.25, args: 0.20, planning: 0.20, multimodal: 0.10}, renormalised over the categories that actually appear in the suite.
See agentic_bench/tasks/aircraft_design.yaml
for the reference suite (19 items, aerospace-flavoured).
| Kind | Scorer |
|---|---|
numerical |
1.0 if first emitted number is within tolerance_pct; exp-decay outside. |
routing |
1.0 if first tool call name matches; 0.5 if it appears later; else 0. |
args |
Per-key score, mean over keys. Numbers use exp-decay; strings/enums exact. |
planning |
`1 - levenshtein(expected_plan, got_plan) / max( |
multimodal |
1.0 if expected_label.lower() is a substring of the reply; else 0.0. |
The aggregate loss is
L = sum_c w_c * (1 - mean_score_c) over categories c that appear
with weights renormalised over present categories.
Drop a file in agentic_bench/adapters/ that implements LLMAdapter
from agentic_bench.adapters.base. The protocol is two methods:
class LLMAdapter(Protocol):
def name(self) -> str: ...
def chat(self, messages, tools=None, temperature=0.0) -> ChatResult: ...
def chat_with_image(self, messages, image_path, temperature=0.0) -> ChatResult: ...Then register it in agentic_bench/adapters/__init__.py REGISTRY.
~50 lines for a typical provider — see
ollama_adapter.py.
Tasks are plain YAML, no Python required. The simplest is a numerical item:
- id: my_question
kind: numerical
prompt: "What is the typical Reynolds number for a 737 wing at cruise?"
expected: 2.0e7
tolerance_pct: 30For routing / args / planning items, see the reference suite.
- Every adapter calls the underlying model at
temperature=0.0. - The Ollama adapter sets
keep_alive: "10m"so the model stays loaded across items — otherwise short-task suites measure model-load latency instead of model quality. - Reports include the adapter name, the model tag, per-item raw outputs (truncated), and the aggregate loss — sufficient to diff two runs.
- Aerospace-specific suite. We expect users to write domain-specific suites for their own engineering toolchains.
- No reasoning-trace scoring. Open question whether to add it.
- No stochastic re-runs / confidence intervals yet. On the roadmap.
- Multimodal images are passed verbatim; some Ollama models silently ignore them. The adapter has no way to detect this.
Apache-2.0. See LICENSE.
If you use agentic-bench in published work, please cite the aircraft-analysis project as the originating context:
@misc{dixit2026agentic_bench,
author = {Dixit, Mayank and McComb, Christopher},
title = {agentic-bench: a small benchmark for local agentic LLMs in engineering domains},
year = {2026},
howpublished = {\url{https://github.com/cmudrc/agentic-bench}},
}