Support doc-gen4 SQLite database format in extractor by justincasher · Pull Request #80 · justincasher/lean-explore

justincasher · 2026-04-19T15:05:00Z

Summary

Doc-gen4 >= v4.29.0-rc2 (leanprover/doc-gen4#347) switched from per-declaration BMP JSON files to a single SQLite database (api-docs.db). The extractor now auto-detects which format each package emits and parses accordingly.
Adds a SQLite parser that reads name_info, declaration_ranges, modules, and the markdown/Verso docstring tables; deserializes the leansqlite RenderedCode BLOB to recover const references for dependency tracking; and validates api-docs.db (non-empty, required tables present) before trusting the new format.
For SQLite-format packages the workspace cache clear is skipped during fresh builds since the DB handles incremental updates correctly. BMP packages still get the full clear to avoid stale output files.
Source links for core modules (Init/Lean/Std/Lake) now use the real toolchain tag from each workspace's lean-toolchain instead of a placeholder. BMP parser now skips declarations whose source files are missing locally instead of crashing.
Build pipeline: :docs transitively builds the library, so the redundant explicit library build step is removed; lake build <lib>:docs is allowed to tolerate per-module failures the way doc-gen4 itself does.
Splits off of Support doc-gen4 SQLite database format for extraction #75; the lean/ workspace bumps landed separately in Bump extractor workspaces for doc-gen4 SQLite support #79.

Doc-gen4 >= v4.29.0-rc1 outputs declaration data to a SQLite database (api-docs.db) instead of individual BMP JSON files. The extraction pipeline now auto-detects which format each package uses and parses accordingly. For packages using the new SQLite format, the workspace cache clear is skipped during fresh builds since api-docs.db handles incremental updates correctly. The full cache clear is still performed for packages using the legacy BMP format to avoid stale output files.

The check `kind == "constructor"` never matched because doc-gen4 stores the kind as "ctor" in the name_info table. Removing it rather than fixing the string keeps behavior consistent with the BMP parser, which also only filters by the .mk suffix.

Convert remaining f-string logger calls to %-style formatting to match the convention used throughout the rest of the PR.

These counters were tracked but never printed, making it harder to debug extraction issues. Now logged alongside the existing skipped_no_source and source_errors messages.

The parameter was documented but never used in the function body.

The SQLite extraction path was hard-coding dependencies=None for every declaration, silently breaking dependency-based search ranking and informalization layering for mathlib/cslib. Add a RenderedCode BLOB parser that deserializes the leansqlite binary format used in name_info.type to extract const references (declaration dependencies) from type signatures. The parser handles the full TaggedText/RenderedCode.Tag/Name encoding including variable-length Nat and nested structures. Also adds the n.type column to the SQLite query so the BLOB is available for parsing.

The fallback source links for Init/Lean/Std/Lake modules used a 'toolchain' placeholder as the git ref, producing non-functional GitHub URLs. Doc-gen4 itself uses the actual Lean compiler hash for the same purpose. Read the lean-toolchain file from each workspace to get the real version tag (e.g., v4.29.0-rc6) and use it as the git ref. Falls back to 'master' when the toolchain file is unavailable.

The SQLite query only joined declaration_markdown_docstrings, silently dropping docstrings stored in Verso format (declaration_verso_docstrings). Doc-gen4 stores each docstring as either markdown text or a Verso binary BLOB, never both. Add a LEFT JOIN on declaration_verso_docstrings to detect Verso-only cases. Since Verso BLOBs use a complex nested serialization format (Doc.Block/Doc.Inline trees with opaque Dynamic values), full deserialization is deferred. For now, Verso-only docstrings are counted and logged as a warning so the gap is visible rather than silent.

_detect_docgen_format previously checked only file existence, which would misclassify a zero-byte, corrupt, or incomplete api-docs.db as usable. Now _validate_docgen_sqlite verifies the file is non-empty, opens as valid SQLite, and contains the required tables (name_info, declaration_ranges, modules). On failure the detector falls back to BMP if available, or returns "none".

Move the render=1 check from a Python-side continue into a WHERE clause on the SQLite query. This avoids fetching and iterating rows that will be immediately discarded, reducing work for large packages like mathlib.

The BMP parser crashed with FileNotFoundError when a source file referenced in a declaration's sourceLink was not present locally (e.g. after a package rename). The SQLite parser already handled this with a try/except. Apply the same pattern to the BMP parser: skip the declaration, count the error, and log a warning at the end.

Pinning doc-gen4 to specific release tags (e.g. @ "v4.30.0-rc1") caused transitive dependency conflicts with packages like plausible, breaking lake exe cache get. Using @ "main" lets Lake resolve a compatible version automatically since doc-gen4 main tracks the latest Lean toolchain. Also reorder require statements so doc-gen4 comes before the main package dependency, ensuring the main package's dependency versions take precedence during resolution.

Pinning doc-gen4 to @ "main" breaks packages on older toolchains (e.g. PhysLean at v4.29.0 vs doc-gen4 main at v4.30.0-rc1) due to incompatible olean headers. Restore version-matched pinning so each package gets a doc-gen4 compatible with its Lean toolchain. The require ordering fix (doc-gen4 first, main package last) is kept to avoid transitive dependency conflicts.

The cache-skip optimization for SQLite-format packages assumed oleans were always compatible. When the toolchain version changes between runs (e.g. from v4.30.0-rc1 to v4.29.0), stale oleans cause "incompatible header" build failures. Now the old toolchain is read before setup, and if it differs from the new one, the cache is cleared regardless of SQLite support.

The toolchain change detection was an overcorrection for a one-off issue caused by the @ "main" experiment. Lake handles toolchain changes and olean staleness itself — the cache-skip logic only needs to distinguish SQLite (incremental) from BMP (stale files) format.

Doc-gen4 can fail on individual modules (e.g. timeout computing equational lemmata) without invalidating the rest of the output. The explicit library build step should tolerate these failures like the :docs step already does, rather than aborting the entire pipeline.

The :docs facet transitively depends on :docInfo, which depends on the library oleans. Running lake build <lib> separately before lake build <lib>:docs causes Lake to process the doc-gen4 analysis twice — once during the explicit library build, then again as a replay during the :docs step. For mathlib this doubled the wall time (21K jobs instead of ~10K). The PhysLean issue this step was meant to fix is already resolved by the correct imports in PhysExtract.lean; the :docs target alone now builds everything it needs transitively.

The :docs facet populates api-docs.db via :docInfo, then runs doc-gen4 fromDb to generate every HTML page and the search index. We read api-docs.db directly and never consume the HTML, so the fromDb pass is wasted work — on Mathlib that's a second full scan over thousands of modules per rebuild. :docInfo stops after the database is populated, which is all the extractor needs. It was added alongside the SQLite output in doc-gen4 #347 (v4.29.0-rc2), so fall back to :docs on older toolchains where the facet doesn't exist.

justincasher added 21 commits April 19, 2026 11:01

Use %-style logging consistently in new SQLite code

d2c7882

Convert remaining f-string logger calls to %-style formatting to match the convention used throughout the rest of the PR.

Log skipped_prefix and skipped_constructor counters

9360b49

These counters were tracked but never printed, making it harder to debug extraction issues. Now logged alongside the existing skipped_no_source and source_errors messages.

Remove unused module_name parameter from _construct_source_link

6934848

The parameter was documented but never used in the function body.

Fix doc data script lint

a2ba5c5

Fix doc-gen4 version handling and tests

e25c77d

Correct doc-gen4 SQLite cutoff references

8224f5d

Filter non-rendered declarations in SQL instead of Python

60fa48a

Move the render=1 check from a Python-side continue into a WHERE clause on the SQLite query. This avoids fetching and iterating rows that will be immediately discarded, reducing work for large packages like mathlib.

Fix PhysLean extractor build targets

4d19e47

justincasher mentioned this pull request Apr 19, 2026

Support doc-gen4 SQLite database format for extraction #75

Closed

justincasher merged commit 9da7566 into main Apr 20, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support doc-gen4 SQLite database format in extractor#80

Support doc-gen4 SQLite database format in extractor#80
justincasher merged 22 commits into
mainfrom
docgen4-sqlite-python

justincasher commented Apr 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

justincasher commented Apr 19, 2026

Summary

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant