- Why Cobalt
- Quickstart
- Core Concepts
- AI-First
- CI/CD
- Integrations
- Configuration
- CLI
- Roadmap
- Contributing
- License
Cobalt is a TypeScript testing framework built for AI agents and LLM-powered applications. Define datasets, run your agent, and evaluate outputs with LLM judges, custom functions, or pre-built evaluators — all from the command line. Results are tracked in SQLite with built-in comparison tools, cost estimation, and CI/CD quality gates. Cobalt ships with an MCP server so AI coding assistants can run experiments and improve your agents directly.
npm install @basalt-ai/cobalt
npx cobalt init
npx cobalt runThat's it — cobalt init creates an example experiment and runs it out of the box.
Now write your own. Create experiments/my-agent.cobalt.ts:
import { experiment, Dataset, Evaluator } from '@basalt-ai/cobalt'
// Define your test data
const dataset = new Dataset({
items: [
{ input: 'What is 2+2?', expectedOutput: '4' },
{ input: 'Capital of France?', expectedOutput: 'Paris' },
],
})
const evaluators = [
new Evaluator({
name: 'Correctness',
type: 'llm-judge',
prompt: 'Is the output correct?\nExpected: {{expectedOutput}}\nActual: {{output}}',
}),
]
// Run your agent and evaluate
experiment('qa-agent', dataset, async ({ item }) => {
const result = await myAgent(item.input)
return { output: result }
}, { evaluators })npx cobalt run --file experiments/my-agent.cobalt.tsYour test data. Load from JSON, JSONL, CSV, inline objects, or pull directly from platforms like Langfuse, LangSmith, Braintrust, and Basalt. Datasets are immutable and chainable — transform them with filter(), map(), sample(), and slice().
Scores your agent's output. Four built-in types: LLM judge (boolean pass/fail or 0-1 scale), custom functions (write your own logic), semantic similarity (cosine/dot product), and Autoevals (11 battle-tested evaluators from Braintrust). Extend with plugins for domain-specific evaluators.
The core loop. An experiment runs your agent against every item in a dataset, evaluates each output, and produces a structured report with per-evaluator statistics (avg, min, max, p50, p95, p99). Supports parallel execution, multiple runs with aggregation, timeouts, and CI thresholds.
Cobalt is built for AI-assisted development. Connect the MCP server, and your AI coding assistant can run experiments, analyze failures, and iterate on your agent — all from a single conversation.
Get started in 30 seconds:
- Add the Cobalt MCP server to your assistant config
- Ask it to run your experiments
- Let it analyze failures and suggest improvements
"Compare gpt 5.1 and 5.2 on my agent and tell me wich one is the best"
"Run my QA experiment and tell me which test cases are failing"
"Generate a Cobalt experiment for my agent at
src/agents/summarizer.ts"
"Compare my last two runs and check for regressions"
"My agent is scoring 60% on correctness. Analyze the failures and suggest code fixes"
The built-in MCP server gives Claude Code (and other MCP clients) direct access to your experiments:
{
"mcpServers": {
"cobalt": {
"command": "npx",
"args": ["cobalt", "mcp"]
}
}
}| Tools | Resources | Prompts |
|---|---|---|
cobalt_run — Run experiments |
cobalt://config — Current config |
improve-agent — Analyze failures |
cobalt_results — View results |
cobalt://experiments — List experiments |
generate-tests — Add test cases |
cobalt_compare — Diff two runs |
cobalt://latest-results — Latest results |
regression-check — Detect regressions |
cobalt_generate — Generate experiments |
cobalt init generates a .cobalt/SKILLS.md file and integrates with your AI instruction files (CLAUDE.md, AGENTS.md, .github/copilot-instructions.md) so your assistant knows how to use Cobalt from day one. After upgrading the SDK, run cobalt update to regenerate the skills file and check for updates.
Cobalt is built to run in your CI pipeline. Define quality thresholds for your agents, and Cobalt will enforce them on every commit — ensuring your AI systems stay reliable over time, not just at launch.
The easiest way to integrate Cobalt into your CI. Runs experiments, posts rich PR comments with score tables, auto-compares against the base branch, and optionally generates AI-powered analysis.
- uses: basalt-ai/cobalt@v1
with:
api_key: ${{ secrets.OPENAI_API_KEY }}For any CI provider, use the CLI directly with --ci to enforce quality thresholds:
npx cobalt run --ci
# Exit code 1 if any threshold is violatedDefine thresholds per evaluator, latency, cost, or overall score — Cobalt catches regressions before they reach production.
Load datasets from your existing evaluation platforms:
| Platform | Loader | Docs |
|---|---|---|
| Langfuse | Dataset.fromLangfuse('dataset-name') |
Setup → |
| LangSmith | Dataset.fromLangsmith('dataset-name') |
Setup → |
| Braintrust | Dataset.fromBraintrust('project', 'dataset') |
Setup → |
| Basalt | Dataset.fromBasalt('dataset-id') |
Setup → |
File formats: JSON, JSONL, CSV, HTTP/HTTPS remote URLs.
LLM providers: OpenAI and Anthropic (auto-detected from model name).
// cobalt.config.ts
import { defineConfig } from '@basalt-ai/cobalt'
export default defineConfig({
testDir: './experiments',
judge: { model: 'gpt-5-mini', provider: 'openai' },
concurrency: 5,
timeout: 30_000,
cache: { enabled: true, ttl: '7d' },
})| Option | Default | Description |
|---|---|---|
testDir |
'./experiments' |
Experiment file directory |
judge.model |
'gpt-5-mini' |
Default LLM judge model |
concurrency |
5 |
Max parallel executions |
timeout |
30000 |
Per-item timeout (ms) |
reporters |
['cli', 'json'] |
Output reporters |
cache.ttl |
'7d' |
LLM response cache TTL |
plugins |
[] |
Custom evaluator plugins |
thresholds |
-- | CI quality gates |
Full Configuration Reference →
cobalt run <file|dir> # Run experiments
cobalt init # Initialize project
cobalt update # Update skills file & check for SDK updates
cobalt history # View past runs
cobalt compare <id1> <id2> # Compare two runs
cobalt serve # Start dashboard
cobalt clean # Clean cache/results
cobalt mcp # Start MCP serverCobalt is open source and community-driven. The roadmap is shaped by what you need — tell us what matters to you.
| Status | Feature |
|---|---|
| ✅ | Core experiment runner, evaluators, datasets, CLI |
| ✅ | MCP server for AI-assisted testing |
| ✅ | CI mode with quality thresholds |
| ✅ | Plugin system & Autoevals integration |
| 🚧 | Vibe code your test reports - Vibe coded dashboard UI to make it like you want |
| ✅ | GitHub Action - First-class CI integration with PR comments |
| 🚧 | Tracing - Full tracing of the agent to have more context for the evaluation |
| 🔮 | Python version - Bring Cobalt to the Python ecosystem |
| 🔮 | VS Code extension - Run experiments from your editor |
| 🔮 | More integrations - Integrations with frameworks like Mastra or Langhcain |
| 🔮 | Multi-platform export - Push results to BigQuery, Snowflake or other tools |
We welcome contributions! See our Contributing Guide for development setup, code standards, and PR process.
- Report bugs: Open an issue
- Suggest features: GitHub Issues
MIT — see LICENSE for details.
