ToolCall-15 is a visual benchmark for comparing LLM tool use. It runs 15 fixed scenarios through an OpenAI-compatible chat completions interface, scores each result deterministically, and renders the full matrix in a live dashboard.
The suite is designed for practical evaluation rather than abstract benchmark math: every scenario has a clear expected behavior, a mocked tool environment, and an inspectable pass, partial, or fail outcome.
ToolCall-15 is organized into 5 categories, with 3 scenarios per category:
- Tool Selection
- Parameter Precision
- Multi-Step Chains
- Restraint and Refusal
- Error Recovery
Each scenario is scored as:
2points for a pass1point for a partial pass0points for a fail
Each category is worth 6 points. The final score is the average of the 5 category percentages, rounded to a whole number.
The benchmark spec is documented in METHODOLOGY.md and implemented in lib/benchmark.ts.
- Reproducible: the system prompt, tool schema, mocked tool outputs, and scoring logic are all versioned in the repo.
- Visual: the dashboard makes the outcome of each scenario obvious without external scoring scripts.
- Balanced: the suite spreads scenarios across distinct tool-use failure modes instead of over-indexing on one skill.
- Deterministic: tool results are mocked and the benchmark uses
temperature: 0. - Inspectable: every scenario stores a raw trace so failures can be audited.
For every scenario, each model receives:
- A shared system prompt.
- A fixed benchmark context message that sets the reference date to
2026-03-20 (Friday)for relative-time tasks. - The scenario user message.
- The same universal tool set of 12 functions.
The runner then:
- Calls the model through
/chat/completions. - Executes any requested tool calls against deterministic mock handlers.
- Appends tool results back into the conversation.
- Repeats for up to 8 assistant turns.
- Evaluates the final trace against scenario-specific scoring logic.
Provider errors matching provider returned error are retried up to 3 times with backoff. Model requests time out after 30 seconds by default, and the timeout can be overridden with MODEL_REQUEST_TIMEOUT_SECONDS in .env.
pass: the model followed the preferred tool behavior exactly enough to earn full credit.partial: the model was functional but suboptimal or overly conservative.fail: the model hallucinated, chose the wrong tool, missed required parameters, or broke the intended flow.
The dashboard also distinguishes timeout failures visually so stalled runs are easy to spot.
ToolCall-15 accepts models from five OpenAI-compatible providers:
openrouterollamallamacppmlxlmstudio
Model configuration uses comma-separated provider:model entries.
Examples:
OPENROUTER_API_KEY=...
OLLAMA_HOST=http://localhost:11434
LLAMACPP_HOST=http://localhost:8080
MLX_HOST=http://localhost:8082
LMSTUDIO_HOST=http://localhost:1234
LLM_MODELS=openrouter:openai/gpt-4.1,ollama:qwen3.5:4b,llamacpp:local-model,lmstudio:qwen3.5-0.8b
LLM_MODELS_2=mlx:mlx-community/Qwen3.5-0.8B-8bitNotes:
LLM_MODELSis the primary table.LLM_MODELS_2is an optional secondary table for a separate comparison group.OLLAMA_HOST,LLAMACPP_HOST,MLX_HOST, andLMSTUDIO_HOSTshould be configured as raw hosts. The app normalizes them to the OpenAI-compatible/v1base URL.MODEL_REQUEST_TIMEOUT_SECONDScontrols the per-request timeout for every provider. The default is30.- Every configured
provider:modelmust be unique across both env vars. - Provider support is transport-level. Actual benchmark quality still depends on the specific model's tool-calling behavior.
- Node.js 20 or newer
- npm
- At least one reachable OpenAI-compatible provider
npm install
cp .env.example .envThen edit .env with your providers and models.
npm run devOpen http://localhost:3000.
npm run lint
npm run typecheck- The runner advances scenario-by-scenario, not model-by-model. Every displayed model completes the current scenario before the dashboard moves to the next column.
- The run button starts all configured models against all 15 scenarios.
- The config button opens a modal for generation parameters:
temperature,top_p,top_k, andmin_p. - Benchmark config is stored in
localStorageso the same browser keeps your latest settings between sessions. Shift+Clicka scenario header to rerun only that scenario across all displayed models.- Clicking a failed or timed-out cell opens the raw trace for that model and scenario.
- If
LLM_MODELS_2is empty, the second table stays hidden.
- app/ contains the Next.js app router entry points and styles.
- components/dashboard.tsx renders the benchmark UI and live event handling.
- app/api/run/route.ts streams benchmark progress over Server-Sent Events.
- lib/benchmark.ts defines the benchmark spec, mocked tools, and scoring logic.
- lib/orchestrator.ts runs scenarios and captures traces.
- lib/llm-client.ts contains the OpenAI-compatible client adapter.
- lib/models.ts parses provider configuration and model groups.
- This is not a general intelligence benchmark. It isolates tool-use behavior under a fixed tool schema.
- The suite uses mocked tools, so it measures orchestration quality rather than live external service quality.
- The benchmark uses one universal system prompt and one deterministic date anchor; prompt-sensitive rankings may change under different instructions.
- Models are compared through OpenAI-compatible endpoints. Provider-specific extras outside that interface are intentionally ignored.
This project is licensed under the MIT License. See LICENSE.
Created by stevibe.
