Find the cheese in your maze.
LabRat is a terminal-native AI data agent. Connect to your warehouse, ask a question in plain English, and watch the agent explore your schema, write dialect-correct SQL in real time, and surface the answer β all without leaving your terminal.
Note
Status: feature-complete v0 alpha. 542 tests passing. End-to-end demo operational against DuckDB. Evaluated on ADE-bench (dbt Labs): 100% easy Β· 80% medium Β· 60% hard Β· 80% overall (48/60 tasks, DuckDB+dbt, claude-sonnet-4-6, best-of-3) β 82% on the 39-task subset that overlaps with Altimate Code's published DuckDB results (vs. their 77%). Evaluated on the full DataAgentBench (UC Berkeley) 54-query / 12-dataset benchmark: 50.5% (pass@5, claude-sonnet-4-6, LabRat tools via MCP), corrected after a self-audit (see below). Strongest single-dataset signal: crmarenapro 82% on a 6-database hybrid query set. Full write-up: docs/dab-progress-report.md.
Correction (2026-06-03): the initial submission reported 58.0%, but a trace audit prompted by the DAB maintainers found that our claude-mcp harness left the agent with unrestricted Bash/web access, so on some queries it read the benchmark's ground_truth.csv/validate.py off disk or pulled external labels (load_dataset("ag_news")). 18 contaminated trials (16 of 20 agnews, plus one bookreview and one yelp) are withdrawn; the remaining trials stand at 50.5%, with agnews effectively unscored. A fully clean re-run with the agent sandboxed (MCP-only tools, no repo on disk, no network) is pending.
- The SQL editor is the agent's whiteboard. Watch SQL stream into the editor character-by-character in your warehouse's dialect as the agent thinks. Edit it. Run it. The agent learns from your edits.
- Learns from you. Per Meta's research, 88% of data scientists' queries hit tables they've used before. LabRat captures your query history, infers your domain, and applies your past corrections automatically. Day-30 LabRat is meaningfully better than day-1 LabRat.
- Audit-ready by default. Every interaction is event-sourced and logged. Pin findings and export polished HTML reports with full provenance β query, results, chart, timestamps, lineage.
- Safe by default. Read-only roles enforced at connection. Mutations and multi-statement injection refused (sqlglot AST-checked). Queries gated by EXPLAIN-estimated cost. Spend tracked per session. Destructive mistakes are physically impossible.
- Catalog-native. Reads your dbt project's
schema.yml,manifest.json, lineage, and tags. Connects to DataHub, OpenMetadata, or any MCP-compatible data catalog. Surfaces the canonical models in your warehouse, not just whatever the LLM guesses. - Mouse-native, keyboard-first. Composes with your shell, your SSH sessions, your tmux setup. Every feature works without the mouse.
LabRat is feature-complete for v0 alpha. Below is the full capability inventory.
| Layer | Status | Details |
|---|---|---|
| 7 warehouse adapters | β | DuckDB, Postgres, Snowflake, BigQuery, Redshift, Trino, MySQL |
| 3 LLM providers | β | Anthropic API, Claude Code CLI (Mac OAuth), OpenAI-compatible |
| Agent tool loop | β | one-call profile_dataset grounding, schema exploration, SQL execution, safety gates (mutation + statement-stacking refusal), multi-DB routing, attach_database for cross-DB JOINs, load_file (CSV/TSV/JSON/Parquet), opt-in LLM-as-judge verifier loop, configurable max_turns / max_tool_calls caps |
| MCP server | β | python -m labrat.mcp.server mounts the LabRat tool registry over MCP stdio β drop into Claude Code, Codex, Cursor, OpenCode, or any MCP-supporting host. Reads connection spec from LABRAT_MCP_CONNECTIONS env var. |
| Query history | β | always-on, PII-redacted JSONL per profile |
| Personal context engine | β | table relevance scoring, LLM-generated descriptions |
| dbt catalog integration | β | manifest.json + schema.yml + catalog.json + lineage |
| MCP catalog integration | β | generic async client for any MCP-compatible catalog |
| Self-healing memory | β | edit-derived + chat-correction memories, retrieval |
| Custom validations | β | natural-language rules, warn/block severity |
| Eval framework | β | ADE-bench 60 tasks: 80% overall (100% easy Β· 80% medium Β· 60% hard, DuckDB+dbt, Sonnet 4.6, best-of-3). DataAgentBench Phase 1b: 48.5% (17 queries, 5 DuckDB+SQLite datasets, pass@5, raw Claude + prompt engineering baseline). |
| 3-pane TUI | β | chat + SQL whiteboard + schema browser |
| Charts | β | unicode (plotext) + image protocol (matplotlib/kitty) |
| HTML export | β | findings with full provenance |
| Audit log | β | JSONL event sourcing |
Test coverage: 542 passing, 10 skipped (gated by ANTHROPIC_API_KEY / LABRAT_RUN_LLM_TESTS).
ADE-bench (dbt Labs, DuckDB+dbt, claude-sonnet-4-6, best-of-3, via LabratLocalAgent):
| Tier | Tasks | Score |
|---|---|---|
| Easy | 15 | 100% (15/15) |
| Medium | 30 | 80% (24/30) |
| Hard | 15 | 60% (9/15) |
| Overall | 60 | 80% (48/60) |
On the 39 tasks shared with Altimate Code's published DuckDB results: LabRat 82% (32/39) vs. Altimate 77% (30/39), same model (Sonnet 4.6), same best-of-3 methodology. Full write-up: docs/ade-bench-progress-report.md.
DataAgentBench (UC Berkeley, multi-DB query answering, claude-sonnet-4-6):
| Phase | Datasets | Tasks | Score | Notes |
|---|---|---|---|---|
| Phase 1a baseline | 5 (DuckDB+SQLite) | 17 | 43% | n_trials=1, raw-bash driver |
| Phase 1b | 5 (DuckDB+SQLite) | 17 | 48.5% | pass@5, ATTACH preamble, raw-bash driver β raw-Claude floor |
| Phase 4 | 5 (DuckDB+SQLite) | 17 | 54.0% | pass@5, claude-mcp driver. +5.5pp over Phase 1b = measured tool-layer value |
| Phase 5 (full DAB) | 12 (all official) | 54 | 50.5% (corrected; submitted 58.0%) | pass@5, claude-mcp driver, full Phase 2+3 substrate. 18 contaminated trials withdrawn after self-audit β see correction note below |
Phase 5 correction (2026-06-03): the submitted 58.0% was inflated by a harness flaw β the claude-mcp driver left the agent with unrestricted Bash/web access, so on some queries it read ground_truth.csv/validate.py off disk or loaded external labels. Withdrawing the 18 contaminated trials (16 of 20 agnews; one bookreview; one yelp) gives 50.5%, with agnews effectively unscored (95% β 15%). The other nine datasets show no such access. A clean sandboxed re-run is pending. Details: docs/dab-progress-report.md.
Phase 5 per-dataset highlights (clean datasets): crmarenapro 82% on 6 databases (substrate's strongest single signal); stockindex 100%, stockmarket 80%; pancancer_atlas 67%; github_repos 50%, googlelocal 50%; deps_dev_v1 10% (stochastic regression from Phase 4); music_brainz_20k 7% and patents 0% (Sonnet ceiling, not infra). bookreview 87% and yelp 60% after withdrawing one contaminated trial each. Three drivers coexist (raw-bash for baseline reproducibility, labrat-agent for multi-provider standalone use, claude-mcp for the Max-plan tool path). Full write-up: docs/dab-progress-report.md.
# Coming soon
uv tool install labratUntil then, build from source:
git clone https://github.com/esagduyu/labrat
cd labrat
uv sync
uv run labratRequires Python 3.12+.
labratOn first run, an onboarding wizard walks you through:
- Picking a database dialect (DuckDB / PostgreSQL / Snowflake / BigQuery / Redshift / Trino / MySQL)
- Entering credentials (stored encrypted in OS keyring)
- Testing the connection
- Optionally linking a dbt project or data catalog
Then you're in. Ask a question:
> show me Q4 revenue by region
LabRat will:
- Explore your schema (tables, columns, relationships)
- Sample data to understand value distributions
- Consult your query history and any applicable memories
- Write dialect-correct SQL in the editor pane as it thinks
- Run the query (with safety gates)
- Render the results
- Offer to chart or pin the finding
Press ? at any time for the in-app keyboard reference.
DuckDB, PostgreSQL, Snowflake, BigQuery, Redshift, Trino/Presto, MySQL.
Pull requests for additional warehouses welcome β the Connection abstract base class makes new adapters straightforward.
- Anthropic API β direct SDK access; recommended when you have API credits
- Claude Code CLI β shells out to the local
claudebinary, authenticated via Mac OAuth (Claude Max subscription). Used byLabratLocalAgentfor ADE-bench runs where the API key has no credits - OpenAI-compatible endpoints β Azure OpenAI, LiteLLM gateways, vLLM, Together, Fireworks, Ollama for local
Configure per profile. The default model is claude-sonnet-4-6; switch in settings.
We started flirting with ratatui and Rust. We landed on Python + Textual because the agent is the point of the product and Python's iteration speed on prompts and tools dominates. The name stuck. The rat got a lab coat.
AGPL-3.0. You can use, modify, and redistribute LabRat for any purpose, including commercial. If you distribute a modified version, or run it as a service, you must release your modifications under the same license.
If you want LabRat under a more permissive license for proprietary use, get in touch.
LabRat stands on shoulders:
- Textual by Will McGugan β the framework that made mouse-native, async-native TUIs a real option in Python
- rich-pyfiglet for the placeholder banner that ships with v0
- Harlequin by Ted Conbeer β proved a terminal SQL editor can feel professional and shipped real adapter abstractions
- DuckDB β the universal SQL engine that makes "just connect to anything" actually work
- SQLGlot by Toby Mao β the dialect handling we could never have built ourselves
- Polars β fast Arrow-backed DataFrames
- uv by Astral β finally, a Python package manager that doesn't make you sigh
- dbt Labs ADE-bench team β the most serious public benchmark for data-engineering agents, Docker-sandboxed, execution-based, no LLM judges
- Meta's Analytics at Meta team β their writeup of the home-grown analytics agent is the architectural foundation for LabRat's personal context layer
- The folks at SignalPilot and Databao β competitors in adjacent categories. They've validated the space and set a high bar. We watch their work closely.
Issues, discussions, and PRs welcome. See CONTRIBUTING.md (coming soon) for details. Until then, the simplest contribution is to use LabRat, hit a wall, and open an issue describing what broke.
# Run the full test suite
uv run pytest
# Lint and type-check
uv run ruff check . && uv run ruff format --check . && uv run pyright
# Evaluate against the sample DuckDB
# Schema/SQL tests run with no auth; the agent NLβSQL step uses ANTHROPIC_API_KEY
# if set, otherwise falls back to the local `claude` CLI (Mac OAuth)
uv run python scripts/eval_duckdb.py
# ADE-bench evaluation (requires Docker + ade-bench repo at ~/repos/ade-bench)
# Uses local Claude Code via Mac OAuth β no API credits needed
cd ~/repos/ade-bench && uv run ade run helixops_saas001 airbnb001 --db duckdb --project-type dbt --agent labrat_local --no-diffs
# DataAgentBench evaluation (requires DataAgentBench repo at ~/repos/DataAgentBench)
# Phase 1b: pass@5 by default; --n-trials 1 for a quick single-trial run
uv run python scripts/eval_dab.py --datasets deps_dev_v1,github_repos,music_brainz_20k,stockindex,stockmarket
# Phase 4 measurement via the LabRat MCP server inside claude --print (Max-plan billing)
uv run python scripts/eval_dab.py --driver claude-mcp --n-trials 5
# Opt-in LLM-as-judge verifier loop on the labrat-agent driver (default off; extra LLM call/answer).
# --agent-timeout raises the claude-code per-call subprocess timeout to absorb the extra round-trips:
uv run python scripts/eval_dab.py --driver labrat-agent --agent-verify --agent-timeout 300
# Resume a crashed run:
uv run python scripts/eval_dab.py --output-dir runs/dab/dab-<id>
# Standalone LabRat agent against an arbitrary prompt (any provider, any DuckDB connection)
uv run python scripts/run_task.py \
--prompt "How many rows in orders?" \
--connections '{"main":{"db_type":"duckdb","db_path":"/path.duckdb"}}' \
--provider anthropic --model claude-sonnet-4-6
# Mount the LabRat MCP server inside Claude Code / Codex / Cursor / OpenCode
LABRAT_MCP_CONNECTIONS='{"main":{"db_type":"duckdb","db_path":"/path.duckdb"}}' \
uv run python -m labrat.mcp.server
# Generate UI screenshots (no API key needed)
uv run python scripts/take_screenshots.pyv0 alpha is feature-complete. Post-v0 priorities:
- ADE-bench improvements: 80% overall β next target is
compare_schemaandtrace_column_lineagetools (Tier 2) to close the remaining output-schema and dependency-discovery gaps - DataAgentBench clean re-run (top priority): re-run the full 54-query benchmark with the agent sandboxed β tool access restricted to the MCP server (no Bash/WebFetch/subagents via
--allowedTools), benchmark repo off the agent's filesystem, no network egress β so answer-key/external-label access is impossible by construction. This produces a defensible replacement for the corrected 50.5%. Then Phase 6 (closing the gap to MinusX 63.1%): force-query prompt rule to recover music_brainz_20k (7% β likely β₯40% β the agent has tools and currently chooses not to use them); investigate patents 0% (Sonnet ceiling vs. prompt structure); raise pass@5 to pass@10 to tighten dataset-mean variance (deps_dev_v1 regressed from Phase 4's 40% to 10% β likely n=5 noise) - DAB harness ergonomics: per-trial exception isolation now records provider failures as
infra:timeout/infra:agent_error(so one error can't crash a run); remaining work is detecting session-limit text and sleep-until-reset rather than fast-failing the rest of the queue (today's runs still need a few manual--output-dirresume cycles for a 270-trial sweep) - testcontainers integration tests: full Postgres/MySQL/Trino adapters against live containers
- v1 GA: dogfooded for one week, P0 bug-free, README demo gif
A small rat in a big maze. Finding cheese, one query at a time.