diff --git a/README.md b/README.md index 9f5c5a8..045e0ec 100644 --- a/README.md +++ b/README.md @@ -23,7 +23,17 @@ A search engine for Lean 4 declarations. This project provides tools and resources for exploring the Lean 4 ecosystem. -**For full documentation, please visit: [https://www.leanexplore.com/docs](https://www.leanexplore.com/docs)** +The current indexed projects include: + +* Batteries +* CSLib +* FLT (Fermat's Last Theorem) +* FormalConjectures +* Init +* Lean +* Mathlib +* PhysLean +* Std ## Installation @@ -46,19 +56,20 @@ lean-explore data fetch lean-explore mcp serve --backend local ``` -The current indexed projects include: +## Documentation -* Batteries -* CSLib -* FLT (Fermat's Last Theorem) -* FormalConjectures -* Init -* Lean -* Mathlib -* PhysLean -* Std +Full docs live in the [`docs/`](docs/README.md) folder, or at [https://www.leanexplore.com/docs](https://www.leanexplore.com/docs). -This code is distributed under an Apache License (see [LICENSE](LICENSE)). +| Page | Description | +|---|---| +| [Getting Started](docs/getting-started.md) | Install and run your first search. | +| [CLI Reference](docs/cli.md) | Every `lean-explore` command and flag. | +| [MCP Server](docs/mcp-server.md) | Wire LeanExplore into Claude, Cursor, or any MCP client. | +| [API Client](docs/api-client.md) | Use `ApiClient` from Python. | +| [Local Search Backend](docs/local-backend.md) | How hybrid BM25 + FAISS + reranking works. | +| [Configuration](docs/configuration.md) | Environment variables and data layout. | +| [Data Models](docs/data-models.md) | `SearchResult`, `SearchResponse`, and related types. | +| [Extraction Pipeline](docs/extraction-pipeline.md) | Rebuild the dataset from Lean source (contributors). | ## Contributing @@ -82,3 +93,7 @@ Justin Asher. (2025). *LeanExplore: A search engine for Lean 4 declarations*. [h url = {https://arxiv.org/abs/2506.11085} } ``` + +## License + +This code is distributed under an Apache License (see [LICENSE](LICENSE)). diff --git a/docs/README.md b/docs/README.md new file mode 100644 index 0000000..e9386e5 --- /dev/null +++ b/docs/README.md @@ -0,0 +1,44 @@ +# LeanExplore Documentation + +LeanExplore is a search engine for Lean 4 declarations (theorems, definitions, +lemmas, instances, and more) from Mathlib and other major Lean packages. It +provides hybrid retrieval that matches both by declaration name (e.g., +`List.map`, `Nat.Prime`) and by informal natural-language meaning (e.g., +"continuous function on a compact set"). + +This documentation covers how to install, configure, and use LeanExplore from +the CLI, as a Python library, and as an MCP server for AI assistants. + +## Contents + +| Page | Description | +|---|---| +| [Getting Started](./getting-started.md) | Install the package, fetch data, and run your first search. | +| [CLI Reference](./cli.md) | Every `lean-explore` command, flag, and option, with examples. | +| [MCP Server](./mcp-server.md) | Run the MCP server and connect it to Claude, Cursor, and other MCP clients. Full reference for every tool exposed. | +| [API Client](./api-client.md) | Use `ApiClient` to query the remote LeanExplore API from Python. | +| [Local Search Backend](./local-backend.md) | How the on-device BM25 + FAISS + cross-encoder pipeline works. | +| [Configuration](./configuration.md) | Environment variables, cache paths, and data layout on disk. | +| [Data Models](./data-models.md) | `SearchResult`, `SearchResponse`, and related types returned from search. | +| [Extraction Pipeline](./extraction-pipeline.md) | Rebuild the dataset from Lean source (contributor-focused). | + +## Which backend should I use? + +LeanExplore has two backends and you pick one per task: + +| | **Remote API** | **Local backend** | +|---|---|---| +| Install | `pip install lean-explore` | `pip install lean-explore[local]` | +| Requires | API key | ~1 GB of data + a few GB of model weights | +| Network | Required per query | Only for initial data fetch | +| Use when | You want zero setup | You want offline, private, or tunable search | + +If you don't know yet, start with the remote API. It is the fastest path to a +working search. You can switch to local later without changing your code. + +## Links + +- Website: +- Repository: +- Paper: +- Contributing: [../CONTRIBUTING.md](../CONTRIBUTING.md) diff --git a/docs/api-client.md b/docs/api-client.md new file mode 100644 index 0000000..ad596d9 --- /dev/null +++ b/docs/api-client.md @@ -0,0 +1,136 @@ +# API Client + +`lean_explore.api.ApiClient` is an async HTTP client for the hosted +LeanExplore API. It ships with the base package: no PyTorch, no local +indices, no data download required. + +## Install and authenticate + +```bash +pip install lean-explore +``` + +Get an API key from and set it as an +environment variable: + +```bash +export LEANEXPLORE_API_KEY="your-key-here" +``` + +Or pass it explicitly to the client constructor. + +## Quick start + +```python +import asyncio +from lean_explore.api import ApiClient + +async def main(): + client = ApiClient() # reads LEANEXPLORE_API_KEY + + response = await client.search("prime number divisibility", limit=5) + for result in response.results: + print(f"{result.name} ({result.module})") + +asyncio.run(main()) +``` + +The `search` and `get_by_id` methods are both `async`, so call them from +inside an async function or via `asyncio.run`. + +## `ApiClient` + +```python +ApiClient(api_key: str | None = None, timeout: float = 10.0) +``` + +| Parameter | Default | Description | +|---|---|---| +| `api_key` | `None` | API key. Falls back to `LEANEXPLORE_API_KEY` env var. Raises `ValueError` if neither is provided. | +| `timeout` | `10.0` | HTTP timeout in seconds for every request. | + +The client hits `https://www.leanexplore.com/api/v2` by default. + +### `search()` + +```python +async def search( + query: str, + limit: int = 20, + rerank_top: int | None = None, + packages: list[str] | None = None, +) -> SearchResponse +``` + +| Parameter | Default | Description | +|---|---|---| +| `query` | *required* | Declaration name or natural-language text. | +| `limit` | `20` | Maximum results to return. | +| `rerank_top` | `None` | Ignored by the API backend (reranking is server-side). Accepted for interface parity with the local `Service`. | +| `packages` | `None` | Filter by package, e.g., `["Mathlib", "Std"]`. | + +Returns a [`SearchResponse`](./data-models.md#searchresponse) with `query`, +`results`, `count`, and `processing_time_ms`. + +Raises: + +- `httpx.HTTPStatusError`: the API returned a non-2xx status. +- `httpx.RequestError`: network, DNS, or timeout failure. + +### `get_by_id()` + +```python +async def get_by_id(declaration_id: int) -> SearchResult | None +``` + +Fetches a single declaration by its numeric id (ids come from `search` +results). Returns a [`SearchResult`](./data-models.md#searchresult) on hit, +or `None` on a 404. Other non-2xx statuses raise +`httpx.HTTPStatusError`. + +## Full example + +```python +import asyncio +from lean_explore.api import ApiClient + +async def main(): + client = ApiClient(api_key="sk-...", timeout=15.0) + + response = await client.search( + query="continuous function on a compact set", + limit=10, + packages=["Mathlib"], + ) + print(f"{response.count} results in {response.processing_time_ms} ms") + + # Grab the first result's full source + if response.results: + first = response.results[0] + print(f"\n{first.name}") + print(first.source_link) + print(first.source_text) + + # Round-trip by id + same = await client.get_by_id(first.id) + assert same is not None and same.id == first.id + +asyncio.run(main()) +``` + +## Notes + +- The client creates a fresh `httpx.AsyncClient` per call. If you make many + rapid requests, consider wrapping them in `asyncio.gather(...)`; they run + concurrently without extra setup. +- Package names follow the casing shown in results (e.g., `Mathlib`, `Std`, + `PhysLean`). +- For offline use or custom tuning, see [Local Search Backend](./local-backend.md) + and use `lean_explore.search.Service` instead. + +## See also + +- [Data Models](./data-models.md): field reference for `SearchResult` and + `SearchResponse`. +- [Configuration](./configuration.md): environment variables including + `LEANEXPLORE_API_KEY`. diff --git a/docs/cli.md b/docs/cli.md new file mode 100644 index 0000000..f511f0e --- /dev/null +++ b/docs/cli.md @@ -0,0 +1,164 @@ +# CLI Reference + +The `lean-explore` command is installed alongside the Python package. It is +organized as a small command tree: + +``` +lean-explore +├── search # Query Lean declarations +├── mcp +│ └── serve # Run the MCP server (api or local backend) +└── data + ├── fetch # Download the data toolchain + └── clean # Delete all cached data +``` + +Run `lean-explore --help` or `lean-explore --help` for built-in +help at any time. + +## `lean-explore search` + +Search the Lean Explore index and print results to your terminal. + +```bash +lean-explore search QUERY [OPTIONS] +``` + +### Arguments + +| Argument | Description | +|---|---| +| `QUERY` | The search query. Can be a declaration name (`List.map`, `Nat.Prime`) or natural-language text (`"continuous function on a compact set"`). Required. | + +### Options + +| Flag | Default | Description | +|---|---|---| +| `--limit`, `-n` | `5` | Number of results to display. | +| `--package`, `-p` | (all) | Filter by package. Repeatable: `-p Mathlib -p Std`. | + +### Requirements + +`lean-explore search` uses the remote API. You must have `LEANEXPLORE_API_KEY` +set in your environment: + +```bash +export LEANEXPLORE_API_KEY="your-key-here" +``` + +### Examples + +```bash +lean-explore search "List.map" +lean-explore search "prime number divisibility" --limit 10 +lean-explore search "fundamental theorem of calculus" -p Mathlib +lean-explore search "continuous" -p Mathlib -p Std -n 20 +``` + +## `lean-explore mcp serve` + +Launch the Model Context Protocol server so an MCP client (Claude, Cursor, +etc.) can call LeanExplore's search tools. The server speaks MCP over stdio, +so you normally do not run it manually; your MCP client launches it. Running +it directly is mostly useful for debugging. + +```bash +lean-explore mcp serve [OPTIONS] +``` + +### Options + +| Flag | Default | Description | +|---|---|---| +| `--backend`, `-b` | `api` | Backend to use: `api` or `local`. | +| `--api-key` | (none) | API key for the `api` backend. Overrides `LEANEXPLORE_API_KEY`. | + +### Backends + +- **`api`**: Delegates every query to the hosted LeanExplore API. Requires an + API key (via env var or `--api-key`). +- **`local`**: Runs the full hybrid search pipeline on-device. Requires + `pip install lean-explore[local]` and `lean-explore data fetch`. + +### Examples + +```bash +# Remote API (most users) +lean-explore mcp serve --backend api + +# Remote API with an inline key +lean-explore mcp serve --backend api --api-key sk-... + +# Local, fully offline backend +lean-explore mcp serve --backend local +``` + +For MCP tool schemas and client configuration, see +[MCP Server](./mcp-server.md). + +## `lean-explore data fetch` + +Download the prebuilt data toolchain used by the local search backend. This +pulls the SQLite database, the FAISS semantic index, and the BM25 indices from +remote storage into `~/.lean_explore/cache//`. + +```bash +lean-explore data fetch [OPTIONS] +``` + +### Options + +| Flag | Default | Description | +|---|---|---| +| `--version`, `-v` | latest | Specific version to install (e.g., `20260127_103630`). | + +### Behavior + +1. Fetches the current version tag (or uses the one you passed). +2. Downloads all required files to `~/.lean_explore/cache//`: + - `lean_explore.db` + - `informalization_faiss.index` and `informalization_faiss_ids_map.json` + - `bm25_ids_map.json` + - `bm25_name_raw/` and `bm25_name_spaced/` directories +3. Writes `~/.lean_explore/active_version` so future commands resolve the + correct cache directory. +4. Removes any older cached versions to reclaim disk space. + +Already-present files are skipped, so re-running `fetch` after an interrupted +download will resume rather than start over. + +### Examples + +```bash +# Install the latest version +lean-explore data fetch + +# Pin a specific version +lean-explore data fetch --version 20260127_103630 +``` + +## `lean-explore data clean` + +Remove every cached data toolchain and clear the active-version pointer. +Useful if you want to reset state or free disk. + +```bash +lean-explore data clean +``` + +This command prompts for confirmation before deleting anything. It does not +touch downloaded model weights; those live under `~/.cache/huggingface/`. + +## Exit codes + +All CLI commands follow standard conventions: + +- `0`: success +- non-zero: an error occurred (missing API key, failed download, etc.). + An error message is printed to stderr. + +## See also + +- [Getting Started](./getting-started.md) +- [MCP Server](./mcp-server.md) +- [Configuration](./configuration.md) diff --git a/docs/configuration.md b/docs/configuration.md new file mode 100644 index 0000000..2acf6f7 --- /dev/null +++ b/docs/configuration.md @@ -0,0 +1,143 @@ +# Configuration + +This page lists every environment variable LeanExplore reads, where files +live on disk, and how to override defaults. Configuration is centralized in +`src/lean_explore/config.py`. + +## Environment variables + +### Authentication + +| Variable | Default | Used by | +|---|---|---| +| `LEANEXPLORE_API_KEY` | (required for API use) | `ApiClient`, `lean-explore search`, `lean-explore mcp serve --backend api` | + +### Paths + +| Variable | Default | Purpose | +|---|---|---| +| `LEAN_EXPLORE_CACHE_DIR` | `~/.lean_explore/cache` | Where `lean-explore data fetch` downloads data. Read by the local backend. | +| `LEAN_EXPLORE_DATA_DIR` | `/data` | Local extraction output directory. Only relevant if you run the extraction pipeline yourself. | +| `LEAN_EXPLORE_PACKAGES_ROOT` | `/lean` | Root for per-package Lean workspaces. Only relevant for extraction. | +| `LEAN_EXPLORE_VERSION` | contents of `~/.lean_explore/active_version`, or `v4.24.0` | Pin a specific cached version. | + +### Local backend tuning + +| Variable | Default | Purpose | +|---|---|---| +| `LEAN_EXPLORE_EMBEDDING_BATCH_SIZE` | `8` | Batch size used when embedding the query with the sentence-transformer. Raise on GPUs with more VRAM. | +| `LEAN_EXPLORE_RERANKER_BATCH_SIZE` | `16` on CUDA, `32` on CPU | Batch size for the cross-encoder reranker. | + +### Extraction pipeline (only for contributors rebuilding the index) + +| Variable | Default | Purpose | +|---|---|---| +| `OPENROUTER_API_KEY` | (none) | Used by the informalization step to call OpenRouter. | + +## On-disk layout + +### Cache directory + +Populated by `lean-explore data fetch`: + +``` +~/.lean_explore/ +├── active_version # Text file: currently active version +└── cache/ + └── / # e.g. 20260127_103630 + ├── lean_explore.db # SQLite database + ├── informalization_faiss.index # FAISS semantic index + ├── informalization_faiss_ids_map.json + ├── bm25_ids_map.json + ├── bm25_name_raw/ # BM25 index (raw tokenization) + │ ├── data.csc.index.npy + │ ├── indices.csc.index.npy + │ ├── indptr.csc.index.npy + │ ├── nonoccurrence_array.index.npy + │ ├── params.index.json + │ └── vocab.index.json + └── bm25_name_spaced/ # BM25 index (spaced tokenization) + └── (same six files) +``` + +Change this location with `LEAN_EXPLORE_CACHE_DIR`. + +### Model cache + +First-run local-backend models are cached under `~/.cache/huggingface/` by +`sentence-transformers` and `transformers`. This is controlled by +Hugging Face, not LeanExplore. See `HF_HOME` / `TRANSFORMERS_CACHE` if you +need to move it. + +## Programmatic access + +All of the above resolves through the `Config` class: + +```python +from lean_explore.config import Config + +Config.CACHE_DIRECTORY # ~/.lean_explore/cache +Config.ACTIVE_VERSION # current version string +Config.ACTIVE_CACHE_PATH # ~/.lean_explore/cache/ +Config.DATABASE_PATH # .../lean_explore.db +Config.DATABASE_URL # sqlite+aiosqlite:///.../lean_explore.db +Config.FAISS_INDEX_PATH +Config.FAISS_IDS_MAP_PATH +Config.BM25_RAW_PATH +Config.BM25_SPACED_PATH +Config.BM25_IDS_MAP_PATH +Config.API_BASE_URL # https://www.leanexplore.com/api/v2 +Config.R2_ASSETS_BASE_URL # data download base URL +``` + +`ACTIVE_VERSION` is resolved, in order: + +1. `LEAN_EXPLORE_VERSION` environment variable, if set. +2. Contents of `~/.lean_explore/active_version` (written by `data fetch`). +3. Fallback: `v4.24.0`. + +## Common recipes + +**Move the cache to a shared drive.** + +```bash +export LEAN_EXPLORE_CACHE_DIR=/mnt/shared/lean-explore-cache +lean-explore data fetch +``` + +**Pin a specific data version across a team.** + +```bash +export LEAN_EXPLORE_VERSION=20260127_103630 +``` + +**Tune for a beefier GPU.** + +```bash +export LEAN_EXPLORE_EMBEDDING_BATCH_SIZE=64 +export LEAN_EXPLORE_RERANKER_BATCH_SIZE=128 +lean-explore mcp serve --backend local +``` + +**Run the MCP server under a non-default config in Claude Desktop.** + +```json +{ + "mcpServers": { + "lean-explore": { + "command": "lean-explore", + "args": ["mcp", "serve", "--backend", "local"], + "env": { + "LEAN_EXPLORE_CACHE_DIR": "/Users/me/lean-cache", + "LEAN_EXPLORE_EMBEDDING_BATCH_SIZE": "32" + } + } + } +} +``` + +## See also + +- [Getting Started](./getting-started.md) +- [Local Search Backend](./local-backend.md) +- [MCP Server](./mcp-server.md) diff --git a/docs/data-models.md b/docs/data-models.md new file mode 100644 index 0000000..9d7d340 --- /dev/null +++ b/docs/data-models.md @@ -0,0 +1,101 @@ +# Data Models + +All public search results are Pydantic v2 models, defined in +`lean_explore.models`. Both the remote `ApiClient` and the local `Service` +return the same types, so code written against one works against the other. + +## Imports + +```python +from lean_explore.models import ( + SearchResult, + SearchResponse, + SearchResultSummary, + SearchSummaryResponse, + extract_bold_description, +) +``` + +## `SearchResult` + +A single Lean declaration returned from a search. + +| Field | Type | Description | +|---|---|---| +| `id` | `int` | Primary key identifier. | +| `name` | `str` | Fully qualified Lean name (e.g., `Nat.add`). | +| `module` | `str` | Module path (e.g., `Mathlib.Data.List.Basic`). | +| `docstring` | `str \| None` | Doc comment from the Lean source. May be `None`. | +| `source_text` | `str` | Full Lean source code for the declaration. | +| `source_link` | `str` | GitHub URL to the source. | +| `dependencies` | `str \| None` | JSON-encoded array of declaration names this depends on. Parse with `json.loads` if you need the list. | +| `informalization` | `str \| None` | AI-generated natural-language description. Typically starts with a bold header: `**Title.** Rest...`. | + +Example: + +```python +for result in response.results: + print(result.name) + print(" module:", result.module) + print(" link: ", result.source_link) + if result.docstring: + print(" doc: ", result.docstring.splitlines()[0]) +``` + +## `SearchResponse` + +Envelope returned by `ApiClient.search()` and `Service.search()`. + +| Field | Type | Description | +|---|---|---| +| `query` | `str` | The original query string. | +| `results` | `list[SearchResult]` | Hits, ordered best-first. | +| `count` | `int` | `len(results)`. | +| `processing_time_ms` | `int \| None` | Server/engine latency in milliseconds. | + +## `SearchResultSummary` + +Slim form used by the MCP `search_summary` tool. Contains only enough to let +a caller decide which ids to drill into. + +| Field | Type | Description | +|---|---|---| +| `id` | `int` | Declaration id (feed into per-field getters). | +| `name` | `str` | Fully qualified Lean name. | +| `description` | `str \| None` | The bold header extracted from the informalization. | + +## `SearchSummaryResponse` + +Envelope for slim results. + +| Field | Type | Description | +|---|---|---| +| `query` | `str` | The original query string. | +| `results` | `list[SearchResultSummary]` | Slim hits, ordered best-first. | +| `count` | `int` | `len(results)`. | +| `processing_time_ms` | `int \| None` | Latency in milliseconds. | + +## `extract_bold_description` + +```python +extract_bold_description(informalization: str | None) -> str | None +``` + +Pulls the `**Bold Title.**` prefix out of an informalization. Returns `None` +if the input is `None` or does not start with a bold header. This is the +helper used to build `SearchResultSummary.description`. + +## ORM model (advanced) + +If you are working directly against the SQLite database (e.g., in the +extraction pipeline), `lean_explore.models.Declaration` is the SQLAlchemy +ORM model that backs the `declarations` table. Most users should stick to +the Pydantic models above. + +## See also + +- [API Client](./api-client.md): returns `SearchResponse` / `SearchResult`. +- [Local Search Backend](./local-backend.md): same return types via + `Service`. +- [MCP Server](./mcp-server.md): tool return payloads serialize these + models to JSON. diff --git a/docs/extraction-pipeline.md b/docs/extraction-pipeline.md new file mode 100644 index 0000000..ca022d0 --- /dev/null +++ b/docs/extraction-pipeline.md @@ -0,0 +1,182 @@ +# Extraction Pipeline + +The extraction pipeline is how LeanExplore *builds* the dataset that the +search engine queries. For everyday use you do not run it; `lean-explore +data fetch` downloads a prebuilt version. This page is for contributors who +want to regenerate the data (e.g., against a newer Mathlib) or run a custom +build locally. + +The pipeline lives in `lean_explore.extract` and is invoked as a Python +module: + +```bash +python -m lean_explore.extract [OPTIONS] +``` + +## What the pipeline does + +Four stages, run in order: + +1. **doc-gen4** (optional, `--run-doc-gen4`): runs Lake + doc-gen4 against + each package workspace in `lean//` to produce structured + declaration data (BMP files). This is the slow step: it builds Lean + packages from source. +2. **Parse** (`--parse-docs`): reads the doc-gen4 output and writes one + row per declaration into a fresh SQLite database under a new + `YYYYMMDD_HHMMSS` directory in `LEAN_EXPLORE_DATA_DIR`. +3. **Informalize** (`--informalize`): calls an LLM (through OpenRouter) to + generate a natural-language description for each declaration. +4. **Embed** (`--embeddings`): runs each informalization through the + sentence-transformer to produce vector embeddings. +5. **Index** (`--index`): builds the FAISS semantic index and the two BM25 + indices over the declaration names. + +Running with no flags runs all stages. Passing any stage flag switches to +"only run the stages I asked for" mode, which is how you resume after a +failure, or iterate on a single step. + +## Prerequisites + +### Python install + +```bash +pip install lean-explore[extract] +``` + +This adds PyTorch, `sentence-transformers`, and `networkx` on top of the +base dependencies. + +### Lean toolchain + +You only need this if you plan to run `--run-doc-gen4`. Install +[`elan`](https://github.com/leanprover/elan) (the Lean version manager), +which provides `lake`. The pipeline fetches the correct Lean toolchain per +package from GitHub. + +### Environment variables + +| Variable | Required for | Purpose | +|---|---|---| +| `OPENROUTER_API_KEY` | informalize stage | LLM calls go through [OpenRouter](https://openrouter.ai). | +| `LEAN_EXPLORE_DATA_DIR` | optional | Where extractions are written. Default: `/data`. | +| `LEAN_EXPLORE_PACKAGES_ROOT` | optional | Root for per-package Lean workspaces. Default: `/lean`. | +| `LEAN_EXPLORE_EMBEDDING_BATCH_SIZE` | optional | Overrides the embedding batch size. | + +See [Configuration](./configuration.md) for the full list. + +## Usage + +### Full rebuild from scratch + +```bash +export OPENROUTER_API_KEY=sk-or-... +python -m lean_explore.extract --run-doc-gen4 --fresh +``` + +- `--run-doc-gen4` runs the Lake/doc-gen4 step for every registered package. +- `--fresh` clears cached Lake dependencies, forcing the latest compatible + versions to be resolved. Use this for nightly-style updates. + +Expect this to take a while (hours on a cold machine), dominated by the +Lake build. + +### Re-run a single stage on the latest extraction + +Each stage flag accepts `--/--no-`. After a fresh `parse-docs` +creates a new timestamped directory, subsequent stages target the most +recent extraction automatically. + +```bash +# Just rebuild embeddings + indices against the latest extraction +python -m lean_explore.extract --no-parse-docs --no-informalize --embeddings --index + +# Only rebuild the FAISS/BM25 indices +python -m lean_explore.extract --index +``` + +### Smoke-test on a small slice + +```bash +python -m lean_explore.extract \ + --informalize-limit 50 \ + --embedding-limit 50 +``` + +## Selected flags + +Run `python -m lean_explore.extract --help` for the full list. The most +commonly used flags: + +| Flag | Default | Purpose | +|---|---|---| +| `--run-doc-gen4` | off | Regenerate Lean documentation with Lake + doc-gen4. | +| `--fresh` | off | Clear Lake caches before building. | +| `--parse-docs / --no-parse-docs` | on* | Parse doc-gen4 output into the database. | +| `--informalize / --no-informalize` | on* | Generate natural-language descriptions. | +| `--embeddings / --no-embeddings` | on* | Generate embeddings. | +| `--index / --no-index` | on* | Build FAISS + BM25 indices. | +| `--informalize-model` | `google/gemini-3-flash-preview` | OpenRouter model id. | +| `--informalize-max-concurrent` | `10` | Parallel informalization requests. | +| `--informalize-limit` | none | Cap informalization count (for testing). | +| `--embedding-model` | `Qwen/Qwen3-Embedding-0.6B` | Sentence-transformer model. | +| `--embedding-batch-size` | `250` | Lower this if you hit OOM. | +| `--embedding-max-seq-length` | `512` | Lower this to reduce memory. | +| `--embedding-server-url` | none | Delegate embeddings to a separate process (e.g., `http://localhost:5001`) to keep GPU memory free elsewhere. | +| `--verbose` | off | Verbose logging. | + +*When you pass any stage flag explicitly, stages you did not mention default +to `off`. + +## Output layout + +Every run writes to a new timestamped directory under +`LEAN_EXPLORE_DATA_DIR`: + +``` +/ +└── 20260127_103630/ + ├── lean_explore.db + ├── informalization_faiss.index + ├── informalization_faiss_ids_map.json + ├── bm25_ids_map.json + ├── bm25_name_raw/ + └── bm25_name_spaced/ +``` + +This is the same layout that `lean-explore data fetch` populates into the +cache directory. To make the local backend use your extraction instead of a +downloaded toolchain, construct a `SearchEngine` with `use_local_data=True`: + +```python +from lean_explore.search import SearchEngine, Service + +engine = SearchEngine(use_local_data=True) +service = Service(engine=engine) +``` + +`SearchEngine` with `use_local_data=True` resolves the latest complete +extraction directory automatically. + +## Packages extracted + +The registry lives in `src/lean_explore/extract/package_registry.py`: + +- `mathlib`: also supplies `Batteries`, `Init`, `Lean`, `Std` from its + transitive dependencies. +- `physlean` +- `flt` +- `formal-conjectures` +- `cslib` + +Each package entry specifies its git URL, version strategy (`LATEST` or +`TAGGED`), and dependencies. Adding a new package means adding a +`PackageConfig` to that registry and creating a corresponding workspace +under `lean//`. + +## See also + +- [Configuration](./configuration.md): environment variables and cache/data + paths. +- [Local Search Backend](./local-backend.md): how to consume your + extraction output with `SearchEngine`. +- [CONTRIBUTING.md](../CONTRIBUTING.md): general contributor guide. diff --git a/docs/getting-started.md b/docs/getting-started.md new file mode 100644 index 0000000..28f3827 --- /dev/null +++ b/docs/getting-started.md @@ -0,0 +1,123 @@ +# Getting Started + +This guide takes you from zero to your first search in under five minutes. +LeanExplore has two paths: + +1. **Remote API**: a small pure-Python install that talks to the hosted + search service. Best for getting started. +2. **Local backend**: runs the full search pipeline on your machine. Larger + install, no network calls after the initial data fetch. + +## Requirements + +- Python 3.10 or newer (3.11, 3.12 also supported) +- `pip` (or `uv`, `poetry`, etc.) +- For the local backend: roughly 1 GB of disk for data files and a few GB for + the first-run model download (embedding + reranker models from Hugging Face) + +## Option 1: Remote API (recommended to start) + +### 1. Install + +```bash +pip install lean-explore +``` + +This installs the CLI, the `ApiClient`, and the MCP server, roughly 50 MB +of pure-Python and C-extension dependencies. No PyTorch. + +### 2. Get an API key + +Sign up and generate a key at . Then export it: + +```bash +export LEANEXPLORE_API_KEY="your-key-here" +``` + +You can also add it to your shell profile (`~/.zshrc`, `~/.bashrc`) so it +persists between sessions. + +### 3. Run a search + +```bash +lean-explore search "prime number divisibility" +lean-explore search "List.map" --limit 10 +lean-explore search "fundamental theorem of arithmetic" --package Mathlib +``` + +The first argument is the query. It can be a Lean declaration name, a partial +name, or a natural-language description. The search engine handles both at +once; you don't need to pick a mode. + +### 4. (Optional) Run the MCP server + +If you want to give Claude, Cursor, or another MCP client access to +LeanExplore: + +```bash +lean-explore mcp serve --backend api +``` + +See [MCP Server](./mcp-server.md) for client configuration. + +## Option 2: Local backend + +Use this when you need offline search, want more control over parameters, or +want to avoid per-query network latency. + +### 1. Install with the `[local]` extra + +```bash +pip install lean-explore[local] +``` + +This adds PyTorch and `sentence-transformers`. Expect roughly 1–2 GB of +Python dependencies. + +### 2. Fetch the data + +```bash +lean-explore data fetch +``` + +This downloads the latest prebuilt database, FAISS index, and BM25 indices to +`~/.lean_explore/cache//`. The total download is on the order of +1 GB. + +### 3. Run the MCP server with the local backend + +```bash +lean-explore mcp serve --backend local +``` + +The first run will also download the embedding and reranker models from +Hugging Face (`Qwen/Qwen3-Embedding-0.6B` and `Qwen/Qwen3-Reranker-0.6B`), +caching them under `~/.cache/huggingface/`. Subsequent runs are fast. + +The `lean-explore search` CLI currently uses the remote API. For programmatic +local search, see [Local Search Backend](./local-backend.md) and use +`SearchEngine` / `Service` directly. + +## What's indexed + +LeanExplore currently covers: + +- Batteries +- CSLib +- FLT (Fermat's Last Theorem) +- FormalConjectures +- Init +- Lean (core) +- Mathlib +- PhysLean +- Std + +You can filter any search to a subset with `--package` (CLI) or the +`packages=[...]` argument (API/MCP). + +## Next steps + +- [CLI Reference](./cli.md): every command and flag. +- [MCP Server](./mcp-server.md): wire LeanExplore into Claude or Cursor. +- [API Client](./api-client.md): use `ApiClient` from Python. +- [Configuration](./configuration.md): environment variables and data layout. diff --git a/docs/local-backend.md b/docs/local-backend.md new file mode 100644 index 0000000..f8c9387 --- /dev/null +++ b/docs/local-backend.md @@ -0,0 +1,158 @@ +# Local Search Backend + +The local backend runs LeanExplore's full hybrid retrieval pipeline on your +machine. Once the data is fetched, no network calls are needed; queries +execute entirely locally. + +This page explains: + +- [When to use it](#when-to-use-it) +- [Installation and data setup](#installation-and-data-setup) +- [How hybrid retrieval works](#how-hybrid-retrieval-works) +- [Using it from Python](#using-it-from-python) + +## When to use it + +- **Offline use**: no network per query after the initial data fetch. +- **Privacy**: queries never leave your machine. +- **Tuning**: you can adjust candidate pool sizes and rerank depth. +- **Integration**: embed `SearchEngine` or `Service` directly in your + application or data pipeline. + +For quick exploration or low-setup use, prefer the +[remote API client](./api-client.md) instead. + +## Installation and data setup + +```bash +pip install lean-explore[local] +lean-explore data fetch +``` + +This adds PyTorch and `sentence-transformers`, then downloads the prebuilt +database and indices into `~/.lean_explore/cache//`. The first +search will additionally download the embedding model +(`Qwen/Qwen3-Embedding-0.6B`) and reranker model +(`Qwen/Qwen3-Reranker-0.6B`) from Hugging Face, cached under +`~/.cache/huggingface/`. + +GPU is optional but used automatically when available. + +## How hybrid retrieval works + +A query flows through five stages: + +1. **BM25 lexical search on names.** Two BM25 indices score the query against + declaration names: one using raw tokenization (entire dotted name as a + single token), one using spaced tokenization (splitting on dots, + underscores, and camelCase). This catches both exact and fuzzy name + matches (e.g., `list.map` → `List.map`). +2. **FAISS semantic search on informalizations.** Each declaration has an + AI-generated natural-language description (its *informalization*). The + query is embedded with the Qwen embedding model and the top FAISS + neighbours are retrieved. This catches "what does this thing mean" style + queries. +3. **Reciprocal Rank Fusion.** The candidate lists are merged by summing + `1/rank` across the BM25 and FAISS results, producing a single fused + ranking. +4. **Dependency boosting.** Candidates that are depended on by top-ranked + results get a small score boost, surfacing foundational lemmas. +5. **Cross-encoder reranking.** The top N candidates (default 25–50) are + rescored pairwise with the Qwen reranker, which looks at the query and + candidate text together. The final top results are returned. + +Auto-generated declarations (e.g., `.mk` constructors) are filtered from the +final list. + +Switches available on every search call: + +- `limit`: how many results to return (default 20 to 50). +- `rerank_top`: how deep to rerank. Larger is slower but more precise. + Setting it to `0` or `None` skips reranking entirely. +- `packages`: restrict to a subset of packages. + +## Using it from Python + +Two layers are available: + +- **`Service`**: the convenient wrapper. Returns a `SearchResponse` with + timing metadata. +- **`SearchEngine`**: the lower-level engine. Returns a plain + `list[SearchResult]` and exposes all retrieval knobs. + +### `Service` + +```python +import asyncio +from lean_explore.search import Service + +async def main(): + service = Service() # spins up a default SearchEngine + + response = await service.search( + query="continuous function on a compact set", + limit=10, + rerank_top=50, + packages=["Mathlib"], + ) + for result in response.results: + print(result.name, "-", result.module) + + # Look up one declaration by id + first = response.results[0] + same = await service.get_by_id(first.id) + +asyncio.run(main()) +``` + +### `SearchEngine` + +```python +import asyncio +from lean_explore.search import SearchEngine + +async def main(): + engine = SearchEngine() + + results = await engine.search( + query="List.map", + limit=25, + faiss_k=1000, # how many FAISS candidates to pull + bm25_k=1000, # how many BM25 candidates to pull + rerank_top=25, # depth of cross-encoder reranking + packages=["Mathlib", "Std"], + ) + for result in results: + print(result.name) + +asyncio.run(main()) +``` + +Key constructor options (all optional; sensible defaults): + +| Parameter | Default | Purpose | +|---|---|---| +| `db_url` | configured URL | SQLAlchemy database URL. | +| `embedding_model_name` | `Qwen/Qwen3-Embedding-0.6B` | Sentence-transformer model for queries. | +| `reranker_model_name` | `Qwen/Qwen3-Reranker-0.6B` | Cross-encoder used for final reranking. | +| `use_local_data` | `False` | `True` reads from `DATA_DIRECTORY` (extraction output). `False` reads from `CACHE_DIRECTORY` (downloaded data). | + +For the full set of parameters, see the class docstring in +`src/lean_explore/search/engine.py`. + +## Tuning tips + +- **Small queries, fast responses**: set `rerank_top=0` to skip the + cross-encoder. You lose some precision but gain significant speed. +- **Recall-heavy queries**: bump `faiss_k` and `bm25_k` to widen the + candidate pool before reranking. +- **Memory-constrained machines**: lower `LEAN_EXPLORE_EMBEDDING_BATCH_SIZE` + and `LEAN_EXPLORE_RERANKER_BATCH_SIZE`. See [Configuration](./configuration.md). + +## See also + +- [Data Models](./data-models.md) for the shape of `SearchResult`. +- [Configuration](./configuration.md) for env vars that control batch sizes + and cache paths. +- [MCP Server](./mcp-server.md): the local backend powers + `lean-explore mcp serve --backend local`. diff --git a/docs/mcp-server.md b/docs/mcp-server.md new file mode 100644 index 0000000..799350f --- /dev/null +++ b/docs/mcp-server.md @@ -0,0 +1,183 @@ +# MCP Server + +LeanExplore ships with a [Model Context Protocol](https://modelcontextprotocol.io) +server that exposes Lean search as a set of tools. Any MCP-compatible client +(Claude, Cursor, Zed, Continue, etc.) can launch it and let its model search +Mathlib and other Lean packages during a conversation. + +This page covers: + +- [Running the server](#running-the-server) +- [Connecting from MCP clients](#connecting-from-mcp-clients) +- [The tools](#the-tools) and their schemas +- [Recommended agent workflow](#recommended-agent-workflow) + +## Running the server + +The server speaks MCP over stdio; your client launches it as a subprocess. +You rarely invoke it directly except for debugging. + +### Remote API backend (default) + +```bash +lean-explore mcp serve --backend api +``` + +Requires `LEANEXPLORE_API_KEY` in the environment, or pass `--api-key`: + +```bash +lean-explore mcp serve --backend api --api-key sk-... +``` + +### Local backend + +```bash +lean-explore mcp serve --backend local +``` + +Requires `pip install lean-explore[local]` and a prior +`lean-explore data fetch`. The first run also downloads the embedding and +reranker models from Hugging Face. + +If required data files are missing, the server exits immediately with a +message pointing at `~/.lean_explore/cache//`. + +## Connecting from MCP clients + +### Claude Desktop / Claude Code + +Add an entry to your MCP settings file (location depends on client; Claude +Desktop uses `~/Library/Application Support/Claude/claude_desktop_config.json` +on macOS): + +```json +{ + "mcpServers": { + "lean-explore": { + "command": "lean-explore", + "args": ["mcp", "serve", "--backend", "api"], + "env": { + "LEANEXPLORE_API_KEY": "your-key-here" + } + } + } +} +``` + +For the local backend, swap `"api"` for `"local"` and drop the `env` block: + +```json +{ + "mcpServers": { + "lean-explore": { + "command": "lean-explore", + "args": ["mcp", "serve", "--backend", "local"] + } + } +} +``` + +### Cursor, Zed, Continue, and others + +Any client that accepts a command + args will work. Point it at the +`lean-explore` executable with the same arguments shown above. + +### Troubleshooting + +- **"API key required"**: set `LEANEXPLORE_API_KEY` in the `env` block (for + MCP clients that support it) or pass `--api-key` in `args`. +- **"Essential data files for the local backend are missing"**: run + `lean-explore data fetch` first. +- **Tools do not appear in the client**: check the client's MCP logs. The + server logs to stderr; increasing verbosity with + `python -m lean_explore.mcp.server --backend local --log-level DEBUG` + helps diagnose startup issues. + +## The tools + +The server registers eight tools. IDs returned from `search` or +`search_summary` can be passed to the per-field getters to fetch exactly the +field you need, which keeps token usage low. + +### `search_summary`: the preferred starting point + +Returns only `id`, `name`, and a short description per hit. Use this first, +then fetch details for the handful of entries you care about. + +**Parameters** + +| Name | Type | Default | Description | +|---|---|---|---| +| `query` | string | *required* | Declaration name or natural-language text. | +| `limit` | int | `10` | Maximum results to return. | +| `rerank_top` | int or null | `50` | Candidates to rerank with the cross-encoder. Set `0` or `null` to skip. Only used by the local backend. | +| `packages` | list of strings | `null` | Filter by package name (e.g., `["Mathlib", "Std"]`). | + +**Returns** + +```json +{ + "query": "prime number divisibility", + "results": [ + {"id": 12345, "name": "Nat.Prime.dvd_mul", "description": "Divisibility of a product by a prime"} + ], + "count": 1, + "processing_time_ms": 84 +} +``` + +### `search`: full results + +Same parameters as `search_summary`, but each result includes every field: +`id`, `name`, `module`, `docstring`, `source_text`, `source_link`, +`dependencies`, and `informalization`. Use this when you genuinely need all +fields at once. For large `limit` values, prefer `search_summary` plus the +per-field getters. + +### Per-field getters + +All six take a single parameter `declaration_id` (integer) and return `null` +if the id does not exist. The id comes from a prior `search` or +`search_summary` result. + +| Tool | Returns | +|---|---| +| `get_source_code` | `{id, name, source_text}`: the Lean source. | +| `get_source_link` | `{id, name, source_link}`: GitHub URL to the source. | +| `get_docstring` | `{id, name, docstring}`: the doc comment (may be null). | +| `get_description` | `{id, name, informalization}`: AI-generated natural-language description. | +| `get_module` | `{id, name, module}`: module path (e.g., `Mathlib.Data.List.Basic`). | +| `get_dependencies` | `{id, name, dependencies}`: JSON array of declaration names this depends on. | + +## Recommended agent workflow + +The server's built-in instructions ask models to follow this pattern: + +1. **Browse with `search_summary`** for low token cost. +2. **Fetch only what you need** with the per-field getters (`get_source_code`, + `get_docstring`, `get_description`, `get_module`, `get_dependencies`, + `get_source_link`). +3. **Use `search`** only when you genuinely need all fields for every result + at once. + +This keeps context short while still giving the model access to full source +code and dependency chains when it matters. + +## Query styles + +Both styles work through the same endpoints; the hybrid retriever decides +internally which signal applies: + +- **By name**: `List.map`, `Nat.Prime`, `CategoryTheory.Functor.map`. +- **By meaning**: `"continuous function on a compact set"`, `"sum of a + geometric series"`, `"a group homomorphism preserving multiplication"`. + +You do not need to specify which mode you want. + +## See also + +- [CLI Reference](./cli.md) for `lean-explore mcp serve` options. +- [Local Search Backend](./local-backend.md) for how hybrid retrieval works + under the hood. +- [Configuration](./configuration.md) for environment variables and cache + paths.