🐀 LabRat

Find the cheese in your maze.

LabRat is a terminal-native AI data agent. Connect to your warehouse, ask a question in plain English, and watch the agent explore your schema, write dialect-correct SQL in real time, and surface the answer — all without leaving your terminal.

Note

Status: feature-complete v0 alpha. 542 tests passing. End-to-end demo operational against DuckDB. Evaluated on ADE-bench (dbt Labs): 100% easy · 80% medium · 60% hard · 80% overall (48/60 tasks, DuckDB+dbt, claude-sonnet-4-6, best-of-3) — 82% on the 39-task subset that overlaps with Altimate Code's published DuckDB results (vs. their 77%). Evaluated on the full DataAgentBench (UC Berkeley) 54-query / 12-dataset benchmark: 50.5% (pass@5, claude-sonnet-4-6, LabRat tools via MCP), corrected after a self-audit (see below). Strongest single-dataset signal: crmarenapro 82% on a 6-database hybrid query set. Full write-up: docs/dab-progress-report.md.

Correction (2026-06-03): the initial submission reported 58.0%, but a trace audit prompted by the DAB maintainers found that our claude-mcp harness left the agent with unrestricted Bash/web access, so on some queries it read the benchmark's ground_truth.csv/validate.py off disk or pulled external labels (load_dataset("ag_news")). 18 contaminated trials (16 of 20 agnews, plus one bookreview and one yelp) are withdrawn; the remaining trials stand at 50.5%, with agnews effectively unscored. A fully clean re-run with the agent sandboxed (MCP-only tools, no repo on disk, no network) is pending.

What makes LabRat different

The SQL editor is the agent's whiteboard. Watch SQL stream into the editor character-by-character in your warehouse's dialect as the agent thinks. Edit it. Run it. The agent learns from your edits.
Learns from you. Per Meta's research, 88% of data scientists' queries hit tables they've used before. LabRat captures your query history, infers your domain, and applies your past corrections automatically. Day-30 LabRat is meaningfully better than day-1 LabRat.
Audit-ready by default. Every interaction is event-sourced and logged. Pin findings and export polished HTML reports with full provenance — query, results, chart, timestamps, lineage.
Safe by default. Read-only roles enforced at connection. Mutations and multi-statement injection refused (sqlglot AST-checked). Queries gated by EXPLAIN-estimated cost. Spend tracked per session. Destructive mistakes are physically impossible.
Catalog-native. Reads your dbt project's schema.yml, manifest.json, lineage, and tags. Connects to DataHub, OpenMetadata, or any MCP-compatible data catalog. Surfaces the canonical models in your warehouse, not just whatever the LLM guesses.
Mouse-native, keyboard-first. Composes with your shell, your SSH sessions, your tmux setup. Every feature works without the mouse.

Status

LabRat is feature-complete for v0 alpha. Below is the full capability inventory.

Layer	Status	Details
7 warehouse adapters	✅	DuckDB, Postgres, Snowflake, BigQuery, Redshift, Trino, MySQL
3 LLM providers	✅	Anthropic API, Claude Code CLI (Mac OAuth), OpenAI-compatible
Agent tool loop	✅	one-call `profile_dataset` grounding, schema exploration, SQL execution, safety gates (mutation + statement-stacking refusal), multi-DB routing, `attach_database` for cross-DB JOINs, `load_file` (CSV/TSV/JSON/Parquet), opt-in LLM-as-judge verifier loop, configurable `max_turns` / `max_tool_calls` caps
MCP server	✅	`python -m labrat.mcp.server` mounts the LabRat tool registry over MCP stdio — drop into Claude Code, Codex, Cursor, OpenCode, or any MCP-supporting host. Reads connection spec from `LABRAT_MCP_CONNECTIONS` env var.
Query history	✅	always-on, PII-redacted JSONL per profile
Personal context engine	✅	table relevance scoring, LLM-generated descriptions
dbt catalog integration	✅	manifest.json + schema.yml + catalog.json + lineage
MCP catalog integration	✅	generic async client for any MCP-compatible catalog
Self-healing memory	✅	edit-derived + chat-correction memories, retrieval
Custom validations	✅	natural-language rules, warn/block severity
Eval framework	✅	ADE-bench 60 tasks: 80% overall (100% easy · 80% medium · 60% hard, DuckDB+dbt, Sonnet 4.6, best-of-3). DataAgentBench Phase 1b: 48.5% (17 queries, 5 DuckDB+SQLite datasets, pass@5, raw Claude + prompt engineering baseline).
3-pane TUI	✅	chat + SQL whiteboard + schema browser
Charts	✅	unicode (plotext) + image protocol (matplotlib/kitty)
HTML export	✅	findings with full provenance
Audit log	✅	JSONL event sourcing

Test coverage: 542 passing, 10 skipped (gated by ANTHROPIC_API_KEY / LABRAT_RUN_LLM_TESTS).

ADE-bench (dbt Labs, DuckDB+dbt, claude-sonnet-4-6, best-of-3, via LabratLocalAgent):

Tier	Tasks	Score
Easy	15	100% (15/15)
Medium	30	80% (24/30)
Hard	15	60% (9/15)
Overall	60	80% (48/60)

On the 39 tasks shared with Altimate Code's published DuckDB results: LabRat 82% (32/39) vs. Altimate 77% (30/39), same model (Sonnet 4.6), same best-of-3 methodology. Full write-up: docs/ade-bench-progress-report.md.

DataAgentBench (UC Berkeley, multi-DB query answering, claude-sonnet-4-6):

Phase	Datasets	Tasks	Score	Notes
Phase 1a baseline	5 (DuckDB+SQLite)	17	43%	n_trials=1, raw-bash driver
Phase 1b	5 (DuckDB+SQLite)	17	48.5%	pass@5, ATTACH preamble, raw-bash driver — raw-Claude floor
Phase 4	5 (DuckDB+SQLite)	17	54.0%	pass@5, claude-mcp driver. +5.5pp over Phase 1b = measured tool-layer value
Phase 5 (full DAB)	12 (all official)	54	50.5% (corrected; submitted 58.0%)	pass@5, claude-mcp driver, full Phase 2+3 substrate. 18 contaminated trials withdrawn after self-audit — see correction note below

Phase 5 correction (2026-06-03): the submitted 58.0% was inflated by a harness flaw — the claude-mcp driver left the agent with unrestricted Bash/web access, so on some queries it read ground_truth.csv/validate.py off disk or loaded external labels. Withdrawing the 18 contaminated trials (16 of 20 agnews; one bookreview; one yelp) gives 50.5%, with agnews effectively unscored (95% → 15%). The other nine datasets show no such access. A clean sandboxed re-run is pending. Details: docs/dab-progress-report.md.

Phase 5 per-dataset highlights (clean datasets): crmarenapro 82% on 6 databases (substrate's strongest single signal); stockindex 100%, stockmarket 80%; pancancer_atlas 67%; github_repos 50%, googlelocal 50%; deps_dev_v1 10% (stochastic regression from Phase 4); music_brainz_20k 7% and patents 0% (Sonnet ceiling, not infra). bookreview 87% and yelp 60% after withdrawing one contaminated trial each. Three drivers coexist (raw-bash for baseline reproducibility, labrat-agent for multi-provider standalone use, claude-mcp for the Max-plan tool path). Full write-up: docs/dab-progress-report.md.

Install

# Coming soon
uv tool install labrat

Until then, build from source:

git clone https://github.com/esagduyu/labrat
cd labrat
uv sync
uv run labrat

Requires Python 3.12+.

Quickstart

labrat

On first run, an onboarding wizard walks you through:

Picking a database dialect (DuckDB / PostgreSQL / Snowflake / BigQuery / Redshift / Trino / MySQL)
Entering credentials (stored encrypted in OS keyring)
Testing the connection
Optionally linking a dbt project or data catalog

Then you're in. Ask a question:

> show me Q4 revenue by region

LabRat will:

Explore your schema (tables, columns, relationships)
Sample data to understand value distributions
Consult your query history and any applicable memories
Write dialect-correct SQL in the editor pane as it thinks
Run the query (with safety gates)
Render the results
Offer to chart or pin the finding

Press ? at any time for the in-app keyboard reference.

Supported warehouses

DuckDB, PostgreSQL, Snowflake, BigQuery, Redshift, Trino/Presto, MySQL.

Pull requests for additional warehouses welcome — the Connection abstract base class makes new adapters straightforward.

Supported LLM providers

Anthropic API — direct SDK access; recommended when you have API credits
Claude Code CLI — shells out to the local claude binary, authenticated via Mac OAuth (Claude Max subscription). Used by LabratLocalAgent for ADE-bench runs where the API key has no credits
OpenAI-compatible endpoints — Azure OpenAI, LiteLLM gateways, vLLM, Together, Fireworks, Ollama for local

Configure per profile. The default model is claude-sonnet-4-6; switch in settings.

Why "LabRat"?

We started flirting with ratatui and Rust. We landed on Python + Textual because the agent is the point of the product and Python's iteration speed on prompts and tools dominates. The name stuck. The rat got a lab coat.

License

AGPL-3.0. You can use, modify, and redistribute LabRat for any purpose, including commercial. If you distribute a modified version, or run it as a service, you must release your modifications under the same license.

If you want LabRat under a more permissive license for proprietary use, get in touch.

Acknowledgments

LabRat stands on shoulders:

Textual by Will McGugan — the framework that made mouse-native, async-native TUIs a real option in Python
rich-pyfiglet for the placeholder banner that ships with v0
Harlequin by Ted Conbeer — proved a terminal SQL editor can feel professional and shipped real adapter abstractions
DuckDB — the universal SQL engine that makes "just connect to anything" actually work
SQLGlot by Toby Mao — the dialect handling we could never have built ourselves
Polars — fast Arrow-backed DataFrames
uv by Astral — finally, a Python package manager that doesn't make you sigh
dbt Labs ADE-bench team — the most serious public benchmark for data-engineering agents, Docker-sandboxed, execution-based, no LLM judges
Meta's Analytics at Meta team — their writeup of the home-grown analytics agent is the architectural foundation for LabRat's personal context layer
The folks at SignalPilot and Databao — competitors in adjacent categories. They've validated the space and set a high bar. We watch their work closely.

Contributing

Issues, discussions, and PRs welcome. See CONTRIBUTING.md (coming soon) for details. Until then, the simplest contribution is to use LabRat, hit a wall, and open an issue describing what broke.

Development

# Run the full test suite
uv run pytest

# Lint and type-check
uv run ruff check . && uv run ruff format --check . && uv run pyright

# Evaluate against the sample DuckDB
# Schema/SQL tests run with no auth; the agent NL→SQL step uses ANTHROPIC_API_KEY
# if set, otherwise falls back to the local `claude` CLI (Mac OAuth)
uv run python scripts/eval_duckdb.py

# ADE-bench evaluation (requires Docker + ade-bench repo at ~/repos/ade-bench)
# Uses local Claude Code via Mac OAuth — no API credits needed
cd ~/repos/ade-bench && uv run ade run helixops_saas001 airbnb001 --db duckdb --project-type dbt --agent labrat_local --no-diffs

# DataAgentBench evaluation (requires DataAgentBench repo at ~/repos/DataAgentBench)
# Phase 1b: pass@5 by default; --n-trials 1 for a quick single-trial run
uv run python scripts/eval_dab.py --datasets deps_dev_v1,github_repos,music_brainz_20k,stockindex,stockmarket
# Phase 4 measurement via the LabRat MCP server inside claude --print (Max-plan billing)
uv run python scripts/eval_dab.py --driver claude-mcp --n-trials 5
# Opt-in LLM-as-judge verifier loop on the labrat-agent driver (default off; extra LLM call/answer).
# --agent-timeout raises the claude-code per-call subprocess timeout to absorb the extra round-trips:
uv run python scripts/eval_dab.py --driver labrat-agent --agent-verify --agent-timeout 300
# Resume a crashed run:
uv run python scripts/eval_dab.py --output-dir runs/dab/dab-<id>

# Standalone LabRat agent against an arbitrary prompt (any provider, any DuckDB connection)
uv run python scripts/run_task.py \
    --prompt "How many rows in orders?" \
    --connections '{"main":{"db_type":"duckdb","db_path":"/path.duckdb"}}' \
    --provider anthropic --model claude-sonnet-4-6

# Mount the LabRat MCP server inside Claude Code / Codex / Cursor / OpenCode
LABRAT_MCP_CONNECTIONS='{"main":{"db_type":"duckdb","db_path":"/path.duckdb"}}' \
    uv run python -m labrat.mcp.server

# Generate UI screenshots (no API key needed)
uv run python scripts/take_screenshots.py

Roadmap

v0 alpha is feature-complete. Post-v0 priorities:

ADE-bench improvements: 80% overall — next target is compare_schema and trace_column_lineage tools (Tier 2) to close the remaining output-schema and dependency-discovery gaps
DataAgentBench clean re-run (top priority): re-run the full 54-query benchmark with the agent sandboxed — tool access restricted to the MCP server (no Bash/WebFetch/subagents via --allowedTools), benchmark repo off the agent's filesystem, no network egress — so answer-key/external-label access is impossible by construction. This produces a defensible replacement for the corrected 50.5%. Then Phase 6 (closing the gap to MinusX 63.1%): force-query prompt rule to recover music_brainz_20k (7% → likely ≥40% — the agent has tools and currently chooses not to use them); investigate patents 0% (Sonnet ceiling vs. prompt structure); raise pass@5 to pass@10 to tighten dataset-mean variance (deps_dev_v1 regressed from Phase 4's 40% to 10% — likely n=5 noise)
DAB harness ergonomics: per-trial exception isolation now records provider failures as infra:timeout/infra:agent_error (so one error can't crash a run); remaining work is detecting session-limit text and sleep-until-reset rather than fast-failing the rest of the queue (today's runs still need a few manual --output-dir resume cycles for a 270-trial sweep)
testcontainers integration tests: full Postgres/MySQL/Trino adapters against live containers
v1 GA: dogfooded for one week, P0 bug-free, README demo gif

A small rat in a big maze. Finding cheese, one query at a time.

Name		Name	Last commit message	Last commit date
Latest commit History 176 Commits
.github/workflows		.github/workflows
docs		docs
findings		findings
screenshots		screenshots
scripts		scripts
src/labrat		src/labrat
tests		tests
.gitignore		.gitignore
.python-version		.python-version
CLAUDE.md		CLAUDE.md
README.md		README.md
TESTING.md		TESTING.md
benchmark_plan.md		benchmark_plan.md
decisions.md		decisions.md
labrat_tour.html		labrat_tour.html
pyproject.toml		pyproject.toml
snapshot_report.html		snapshot_report.html
test_findings.md		test_findings.md
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🐀 LabRat

What makes LabRat different

Status

Install

Quickstart

Supported warehouses

Supported LLM providers

Why "LabRat"?

License

Acknowledgments

Contributing

Development

Roadmap

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🐀 LabRat

What makes LabRat different

Status

Install

Quickstart

Supported warehouses

Supported LLM providers

Why "LabRat"?

License

Acknowledgments

Contributing

Development

Roadmap

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages