Support doc-gen4 SQLite database format for extraction by justincasher · Pull Request #75 · justincasher/lean-explore

justincasher · 2026-03-22T01:07:00Z

Summary

Doc-gen4 >= v4.29.0-rc2 (PR #347) switched from outputting individual BMP JSON files to a SQLite database (api-docs.db)
The extraction pipeline now auto-detects which format each package uses and parses accordingly
For packages with api-docs.db, the workspace cache clear is skipped during fresh builds since the SQLite DB handles incremental updates correctly
Currently affects mathlib and cslib (v4.29.0-rc6); physlean, flt, and formal-conjectures still use the legacy BMP format (v4.28.0 / v4.27.0)

Doc-gen4 >= v4.29.0-rc1 outputs declaration data to a SQLite database (api-docs.db) instead of individual BMP JSON files. The extraction pipeline now auto-detects which format each package uses and parses accordingly. For packages using the new SQLite format, the workspace cache clear is skipped during fresh builds since api-docs.db handles incremental updates correctly. The full cache clear is still performed for packages using the legacy BMP format to avoid stale output files.

The check `kind == "constructor"` never matched because doc-gen4 stores the kind as "ctor" in the name_info table. Removing it rather than fixing the string keeps behavior consistent with the BMP parser, which also only filters by the .mk suffix.

Convert remaining f-string logger calls to %-style formatting to match the convention used throughout the rest of the PR.

These counters were tracked but never printed, making it harder to debug extraction issues. Now logged alongside the existing skipped_no_source and source_errors messages.

The parameter was documented but never used in the function body.

The SQLite extraction path was hard-coding dependencies=None for every declaration, silently breaking dependency-based search ranking and informalization layering for mathlib/cslib. Add a RenderedCode BLOB parser that deserializes the leansqlite binary format used in name_info.type to extract const references (declaration dependencies) from type signatures. The parser handles the full TaggedText/RenderedCode.Tag/Name encoding including variable-length Nat and nested structures. Also adds the n.type column to the SQLite query so the BLOB is available for parsing.

The fallback source links for Init/Lean/Std/Lake modules used a 'toolchain' placeholder as the git ref, producing non-functional GitHub URLs. Doc-gen4 itself uses the actual Lean compiler hash for the same purpose. Read the lean-toolchain file from each workspace to get the real version tag (e.g., v4.29.0-rc6) and use it as the git ref. Falls back to 'master' when the toolchain file is unavailable.

The SQLite query only joined declaration_markdown_docstrings, silently dropping docstrings stored in Verso format (declaration_verso_docstrings). Doc-gen4 stores each docstring as either markdown text or a Verso binary BLOB, never both. Add a LEFT JOIN on declaration_verso_docstrings to detect Verso-only cases. Since Verso BLOBs use a complex nested serialization format (Doc.Block/Doc.Inline trees with opaque Dynamic values), full deserialization is deferred. For now, Verso-only docstrings are counted and logged as a warning so the gap is visible rather than silent.

_detect_docgen_format previously checked only file existence, which would misclassify a zero-byte, corrupt, or incomplete api-docs.db as usable. Now _validate_docgen_sqlite verifies the file is non-empty, opens as valid SQLite, and contains the required tables (name_info, declaration_ranges, modules). On failure the detector falls back to BMP if available, or returns "none".

Move the render=1 check from a Python-side continue into a WHERE clause on the SQLite query. This avoids fetching and iterating rows that will be immediately discarded, reducing work for large packages like mathlib.

The BMP parser crashed with FileNotFoundError when a source file referenced in a declaration's sourceLink was not present locally (e.g. after a package rename). The SQLite parser already handled this with a try/except. Apply the same pattern to the BMP parser: skip the declaration, count the error, and log a warning at the end.

Pinning doc-gen4 to specific release tags (e.g. @ "v4.30.0-rc1") caused transitive dependency conflicts with packages like plausible, breaking lake exe cache get. Using @ "main" lets Lake resolve a compatible version automatically since doc-gen4 main tracks the latest Lean toolchain. Also reorder require statements so doc-gen4 comes before the main package dependency, ensuring the main package's dependency versions take precedence during resolution.

Refresh lake-manifest.json and lean-toolchain files across all package workspaces after switching doc-gen4 to main branch.

Pinning doc-gen4 to @ "main" breaks packages on older toolchains (e.g. PhysLean at v4.29.0 vs doc-gen4 main at v4.30.0-rc1) due to incompatible olean headers. Restore version-matched pinning so each package gets a doc-gen4 compatible with its Lean toolchain. The require ordering fix (doc-gen4 first, main package last) is kept to avoid transitive dependency conflicts.

The cache-skip optimization for SQLite-format packages assumed oleans were always compatible. When the toolchain version changes between runs (e.g. from v4.30.0-rc1 to v4.29.0), stale oleans cause "incompatible header" build failures. Now the old toolchain is read before setup, and if it differs from the new one, the cache is cleared regardless of SQLite support.

The toolchain change detection was an overcorrection for a one-off issue caused by the @ "main" experiment. Lake handles toolchain changes and olean staleness itself — the cache-skip logic only needs to distinguish SQLite (incremental) from BMP (stale files) format.

Doc-gen4 can fail on individual modules (e.g. timeout computing equational lemmata) without invalidating the rest of the output. The explicit library build step should tolerate these failures like the :docs step already does, rather than aborting the entire pipeline.

The :docs facet transitively depends on :docInfo, which depends on the library oleans. Running lake build <lib> separately before lake build <lib>:docs causes Lake to process the doc-gen4 analysis twice — once during the explicit library build, then again as a replay during the :docs step. For mathlib this doubled the wall time (21K jobs instead of ~10K). The PhysLean issue this step was meant to fix is already resolved by the correct imports in PhysExtract.lean; the :docs target alone now builds everything it needs transitively.

justincasher · 2026-04-19T15:06:34Z

Superseded by the split:

Bump extractor workspaces for doc-gen4 SQLite support #79 (merged): lean/ workspace bumps + fresh lake update manifests
Support doc-gen4 SQLite database format in extractor #80: Python/scripts/tests for doc-gen4 SQLite support, rebased onto main

justincasher force-pushed the support-docgen4-sqlite-format branch from 03fcee4 to 444c0be Compare April 4, 2026 17:54

justincasher added 21 commits April 4, 2026 15:12

Use %-style logging consistently in new SQLite code

a042661

Convert remaining f-string logger calls to %-style formatting to match the convention used throughout the rest of the PR.

Log skipped_prefix and skipped_constructor counters

aaae8bb

These counters were tracked but never printed, making it harder to debug extraction issues. Now logged alongside the existing skipped_no_source and source_errors messages.

Remove unused module_name parameter from _construct_source_link

af03b63

The parameter was documented but never used in the function body.

Fix doc data script lint

e1d2e16

Fix doc-gen4 version handling and tests

37be254

Correct doc-gen4 SQLite cutoff references

3148f47

Filter non-rendered declarations in SQL instead of Python

03ff5d0

Move the render=1 check from a Python-side continue into a WHERE clause on the SQLite query. This avoids fetching and iterating rows that will be immediately discarded, reducing work for large packages like mathlib.

Update workspace toolchains and manifests

16b7474

Refresh lake-manifest.json and lean-toolchain files across all package workspaces after switching doc-gen4 to main branch.

Fix PhysLean extractor build targets

c89bae3

This was referenced Apr 19, 2026

Bump extractor workspaces for doc-gen4 SQLite support #79

Merged

Support doc-gen4 SQLite database format in extractor #80

Merged

justincasher closed this Apr 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support doc-gen4 SQLite database format for extraction#75

Support doc-gen4 SQLite database format for extraction#75
justincasher wants to merge 22 commits into
mainfrom
support-docgen4-sqlite-format

justincasher commented Mar 22, 2026 •

edited

Loading

Uh oh!

justincasher commented Apr 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

justincasher commented Mar 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

justincasher commented Apr 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

justincasher commented Mar 22, 2026 •

edited

Loading