An autonomous agent for mathematical reasoning and proof.
How It Works · Quick Start · Tools · Skills · Autonomous Mode · Commands
Claude Code gives an LLM a shell and dev tools so it can write and run code. Bourbaki does the same thing for math: it gives an LLM a computer algebra system (SymPy), a proof assistant (Lean 4), and research APIs (OEIS, arXiv).
You ask a question in the TUI, the agent computes, verifies, looks things up, and streams the answer back. If it writes a proof, it can formalize it. If it makes a claim, it can check it.
- You ask a question in the TUI
- The backend agent reasons about the approach
- It calls tools: SymPy for computation, Lean for verification, OEIS/arXiv for lookup
- Results feed back into the agent, which iterates if needed
- A scratchpad enforces limits and deduplicates repeated calls
- The final answer streams back to the TUI as it's generated
The TUI is a pure display client. All reasoning, tool calls, and state live in the Python backend.
# Clone the repo
git clone https://github.com/0bserver07/bourbaki.git
cd bourbaki
# Start the backend
cd backend
pip install -e .
uvicorn bourbaki.main:app --reload --port 8000
# In another terminal — start the TUI
bun install
bun startThe TUI connects to localhost:8000 by default. Override with BOURBAKI_BACKEND_URL.
- Python 3.11+
- Bun v1.0+
- An LLM API key (set
ANTHROPIC_API_KEY,OPENAI_API_KEY, orGOOGLE_API_KEY) - Lean 4 with Mathlib (optional, for formal verification)
| Tool | What it does |
|---|---|
| Symbolic Compute | Native SymPy: simplification, integration, solving, 30+ operations |
| Lean Prover | Lean 4 + Mathlib, machine-checked formal proofs |
| Sequence Lookup | OEIS: identify and explore integer sequences |
| Paper Search | arXiv: find relevant papers and results |
| Web Search | Exa: search the web for mathematical references |
Skills are proof techniques loaded from SKILL.md files. They tell the agent how to approach a specific type of proof step by step, instead of letting it improvise.
21 built-in skills across five categories:
- Basic: induction, strong induction, direct proof, contradiction, pigeonhole, counting
- Analysis: epsilon-delta, convergence tests, sequence limits, inequality chains
- Geometry: coordinate proof, synthetic construction, transformations
- Algebra: group homomorphisms, ring ideals, polynomials
- Advanced: extremal arguments, probabilistic method, conjecture exploration, formalization, proof explanation
Skills can be added at three levels: built-in (src/skills/), user (~/.bourbaki/skills/), or project (.bourbaki/skills/).
Long-running proof search via a proposer-builder-reviewer loop driven by GLM-5.1 and a warm LeanREPLSession. One proposal per iteration, bounded by max_iterations (default 50, 8 for interactive). Every reported solve is gated by a lean_prover whole-file compile — no REPL-only claims.
Drive the loop from backend/bourbaki/benchmarks/minif2f.py::attempt_proof_loop or from the FastAPI /query endpoint with use_loop=True. (The TUI's /prove <id> command still points at the legacy /autonomous/start route, which now returns HTTP 410 Gone — the legacy pipeline was deleted in commit 2113629. Rewiring the TUI to the new loop is tracked separately.)
Verified pass rates on miniF2F valid (every solve confirmed by lean_prover standalone compile — see docs/REALITY_CHECK.md for the audit of the earlier REPL-only era):
| Date | Approach | Verified | Sample |
|---|---|---|---|
| 2026-02-22 (audit) | v0.2.1 code, lean_prover-gated | 6.2% (15/244) | full 244 |
| 2026-03-08 (v0.2.2) | + REPL pipe-recovery + tactic blocklist | 25.8% (63/244) | full 244 |
| 2026-04-01 | + HILBERT decomposer + in-context solving | 50.0% (5/10) | 10-problem |
| 2026-04-25 | proposer-builder-reviewer loop (GLM-5.1) | 90.0% (9/10) | 10-problem · 0 false positives |
| 2026-05-09 | same loop on a wider sample | 62.9% (22/35) † | 35-problem stratified · 0 false positives |
† This number is a lower bound. The reviewer's final-gate lean_prover call was using a 30s default timeout in tools/lean_prover.py, which is too short for standalone lake env lean + import Mathlib (typically 60-180s). On the 2026-05-09 run, an unknown number of correct proofs were rejected by this timeout. Commit 7b07c07 raises the reviewer's timeout to 240s; a re-run with the fix is tracked in issue #19.
The 2026-02-17 v0.2.0 and 2026-02-18 v0.2.1 releases claimed 91.8% / 94.3% on the valid/test splits. Both numbers were inflated ~15× by REPL false positives and were retracted in the v0.2.2 audit (both GitHub releases now read "RETRACTED (inflated numbers)" in their titles). The current proposer-builder-reviewer architecture (commits 49211ce through 7b07c07) replaces the prior HILBERT-style pipeline; the full 244-problem run with the new architecture is pending (tracked in issue #14).
- Reviewer's
lean_provertimeout was 30s until commit7b07c07(#19). Standalonelake env lean + import Mathlibtypically needs 60-180s on a cold cache, so on busy/cold systems the gate silently rejected correct proofs. Bumped to 240s — the 2026-05-09 22/35 result is a lower bound; the new headline lands once #19's re-run completes. - The Bash-tool-managed sessions some agents use kill long background processes when their parent shell exits. The standalone scripts in
backend/scripts/(and theirjustwrappers) launch the asyncio loop inside a single Python process, but the launching shell still owns the process group — run benchmarks from a normal terminal (or withnohup/tmux/screen) rather than from inside an ephemeral agent session. - z.ai has two separate billing pools.
glm:routes via the Anthropic-compat endpoint,glm-oai:via the OpenAI-compat one. Funds in one pool do NOT cover the other — verify your balance is on the side your prefix targets (the loop defaults toglm:after commit66cba4c). - The loop's ceiling on AIME/IMO problems is GLM-5.1's familiarity, not a code bug. On the 2026-05-09 35-problem run, aime 0/3 and imo 0/3 was a model-capability floor — the proposer simply did not produce proof candidates for problems in styles GLM-5.1 has weak coverage of. A stronger LLM would lift these without architectural changes.
- Pass@N shares a single REPL session across attempts.
attempt_proof_pass_at_nruns the loop up to N times per problem but does not currently reset the Lean session between attempts. The lean4-repl has no clean:resetprimitive, so state from attempt K can leak into attempt K+1. Documented in the function docstring; A/B tracked in #18.
Prove a theorem:
❯ Prove that the sum of the first n integers equals n(n+1)/2
⏺ Thinking...
⏺ Symbolic Compute (expression=Sum(k, (k, 1, n)))
⎿ Computed result
⏺ Lean Prover
⎿ ✓ Verified in 2.3s
Proof by induction. Base case: n = 1, sum = 1 = 1·2/2. ✓
Inductive step: assume ∑_{k=1}^{n} k = n(n+1)/2.
Then ∑_{k=1}^{n+1} k = n(n+1)/2 + (n+1) = (n+1)(n+2)/2. ∎
Compute symbolically:
❯ Factor 84 and find its divisors
⏺ Symbolic Compute (operation=factor, expression=84)
⎿ Computed result
84 = 2² × 3 × 7
Divisors: {1, 2, 3, 4, 6, 7, 12, 14, 21, 28, 42, 84}
Identify a sequence:
❯ What sequence is 1, 1, 2, 3, 5, 8, 13?
⏺ Sequence Lookup (query="1,1,2,3,5,8,13")
⎿ Found 1 results
A000045 — Fibonacci numbers: F(n) = F(n-1) + F(n-2) with F(0) = 0 and F(1) = 1.
| Command | What it does |
|---|---|
/help |
Show all commands |
/model <name> |
Switch LLM model |
/skills |
List available proof technique skills |
/problems |
Browse the problem database |
/prove <id> |
Start proof attempt (legacy TUI handler still POSTs to /autonomous/start, which now returns 410; use the attempt_proof_loop driver or /query with use_loop=True for the new loop) |
/pause |
Pause proof search (legacy, 410) |
/progress |
Show proof search progress (legacy, 410) |
/sessions |
List saved sessions |
/new |
Start a new session |
/export [format] |
Export last answer (latex, lean, markdown) |
/debug |
Toggle debug mode |
/clear |
Clear the screen |
src/ React + Ink TUI (display client)
├── components/ UI components (Input, AgentEventView, AnswerView)
├── hooks/ useAgentRunner (SSE bridge), useModelSelection
└── skills/ 21 SKILL.md proof technique files
backend/bourbaki/ Python backend (owns all state)
├── agent/ Pydantic AI agent, prompts, scratchpad, event mapper
├── tools/ SymPy, Lean 4, OEIS, arXiv, Web Search, Skills
├── sessions/ Persistence + context compaction
├── prover/ Proposer-builder-reviewer-memory loop
├── autonomous/ Phase-3 vestige — only `tactics.py` survives (blocklist)
├── benchmarks/ miniF2F + PutnamBench runners
├── problems/ 13 classic problems database
└── server/routes/ FastAPI endpoints (query, sessions, skills, ...)
- Backend: Python, FastAPI, Pydantic AI, SymPy, httpx
- TUI: Bun, React + Ink, TypeScript
- Verification: Lean 4 + Mathlib
- Sequences: OEIS API
- Papers: arXiv API
Named after Nicolas Bourbaki, the collective pseudonym of a group of mathematicians who tried to rewrite all of mathematics from scratch using set theory.