Support doc-gen4 SQLite database format for extraction#75
Closed
justincasher wants to merge 22 commits into
Closed
Support doc-gen4 SQLite database format for extraction#75justincasher wants to merge 22 commits into
justincasher wants to merge 22 commits into
Conversation
Doc-gen4 >= v4.29.0-rc1 outputs declaration data to a SQLite database (api-docs.db) instead of individual BMP JSON files. The extraction pipeline now auto-detects which format each package uses and parses accordingly. For packages using the new SQLite format, the workspace cache clear is skipped during fresh builds since api-docs.db handles incremental updates correctly. The full cache clear is still performed for packages using the legacy BMP format to avoid stale output files.
03fcee4 to
444c0be
Compare
The check `kind == "constructor"` never matched because doc-gen4 stores the kind as "ctor" in the name_info table. Removing it rather than fixing the string keeps behavior consistent with the BMP parser, which also only filters by the .mk suffix.
Convert remaining f-string logger calls to %-style formatting to match the convention used throughout the rest of the PR.
These counters were tracked but never printed, making it harder to debug extraction issues. Now logged alongside the existing skipped_no_source and source_errors messages.
The parameter was documented but never used in the function body.
The SQLite extraction path was hard-coding dependencies=None for every declaration, silently breaking dependency-based search ranking and informalization layering for mathlib/cslib. Add a RenderedCode BLOB parser that deserializes the leansqlite binary format used in name_info.type to extract const references (declaration dependencies) from type signatures. The parser handles the full TaggedText/RenderedCode.Tag/Name encoding including variable-length Nat and nested structures. Also adds the n.type column to the SQLite query so the BLOB is available for parsing.
The fallback source links for Init/Lean/Std/Lake modules used a 'toolchain' placeholder as the git ref, producing non-functional GitHub URLs. Doc-gen4 itself uses the actual Lean compiler hash for the same purpose. Read the lean-toolchain file from each workspace to get the real version tag (e.g., v4.29.0-rc6) and use it as the git ref. Falls back to 'master' when the toolchain file is unavailable.
The SQLite query only joined declaration_markdown_docstrings, silently dropping docstrings stored in Verso format (declaration_verso_docstrings). Doc-gen4 stores each docstring as either markdown text or a Verso binary BLOB, never both. Add a LEFT JOIN on declaration_verso_docstrings to detect Verso-only cases. Since Verso BLOBs use a complex nested serialization format (Doc.Block/Doc.Inline trees with opaque Dynamic values), full deserialization is deferred. For now, Verso-only docstrings are counted and logged as a warning so the gap is visible rather than silent.
_detect_docgen_format previously checked only file existence, which would misclassify a zero-byte, corrupt, or incomplete api-docs.db as usable. Now _validate_docgen_sqlite verifies the file is non-empty, opens as valid SQLite, and contains the required tables (name_info, declaration_ranges, modules). On failure the detector falls back to BMP if available, or returns "none".
Move the render=1 check from a Python-side continue into a WHERE clause on the SQLite query. This avoids fetching and iterating rows that will be immediately discarded, reducing work for large packages like mathlib.
The BMP parser crashed with FileNotFoundError when a source file referenced in a declaration's sourceLink was not present locally (e.g. after a package rename). The SQLite parser already handled this with a try/except. Apply the same pattern to the BMP parser: skip the declaration, count the error, and log a warning at the end.
Pinning doc-gen4 to specific release tags (e.g. @ "v4.30.0-rc1") caused transitive dependency conflicts with packages like plausible, breaking lake exe cache get. Using @ "main" lets Lake resolve a compatible version automatically since doc-gen4 main tracks the latest Lean toolchain. Also reorder require statements so doc-gen4 comes before the main package dependency, ensuring the main package's dependency versions take precedence during resolution.
Refresh lake-manifest.json and lean-toolchain files across all package workspaces after switching doc-gen4 to main branch.
Pinning doc-gen4 to @ "main" breaks packages on older toolchains (e.g. PhysLean at v4.29.0 vs doc-gen4 main at v4.30.0-rc1) due to incompatible olean headers. Restore version-matched pinning so each package gets a doc-gen4 compatible with its Lean toolchain. The require ordering fix (doc-gen4 first, main package last) is kept to avoid transitive dependency conflicts.
The cache-skip optimization for SQLite-format packages assumed oleans were always compatible. When the toolchain version changes between runs (e.g. from v4.30.0-rc1 to v4.29.0), stale oleans cause "incompatible header" build failures. Now the old toolchain is read before setup, and if it differs from the new one, the cache is cleared regardless of SQLite support.
The toolchain change detection was an overcorrection for a one-off issue caused by the @ "main" experiment. Lake handles toolchain changes and olean staleness itself — the cache-skip logic only needs to distinguish SQLite (incremental) from BMP (stale files) format.
Doc-gen4 can fail on individual modules (e.g. timeout computing equational lemmata) without invalidating the rest of the output. The explicit library build step should tolerate these failures like the :docs step already does, rather than aborting the entire pipeline.
The :docs facet transitively depends on :docInfo, which depends on the library oleans. Running lake build <lib> separately before lake build <lib>:docs causes Lake to process the doc-gen4 analysis twice — once during the explicit library build, then again as a replay during the :docs step. For mathlib this doubled the wall time (21K jobs instead of ~10K). The PhysLean issue this step was meant to fix is already resolved by the correct imports in PhysExtract.lean; the :docs target alone now builds everything it needs transitively.
This was referenced Apr 19, 2026
Owner
Author
|
Superseded by the split:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
api-docs.db)api-docs.db, the workspace cache clear is skipped during fresh builds since the SQLite DB handles incremental updates correctly