Skip to content

Support doc-gen4 SQLite database format for extraction#75

Closed
justincasher wants to merge 22 commits into
mainfrom
support-docgen4-sqlite-format
Closed

Support doc-gen4 SQLite database format for extraction#75
justincasher wants to merge 22 commits into
mainfrom
support-docgen4-sqlite-format

Conversation

@justincasher
Copy link
Copy Markdown
Owner

@justincasher justincasher commented Mar 22, 2026

Summary

  • Doc-gen4 >= v4.29.0-rc2 (PR #347) switched from outputting individual BMP JSON files to a SQLite database (api-docs.db)
  • The extraction pipeline now auto-detects which format each package uses and parses accordingly
  • For packages with api-docs.db, the workspace cache clear is skipped during fresh builds since the SQLite DB handles incremental updates correctly
  • Currently affects mathlib and cslib (v4.29.0-rc6); physlean, flt, and formal-conjectures still use the legacy BMP format (v4.28.0 / v4.27.0)

Doc-gen4 >= v4.29.0-rc1 outputs declaration data to a SQLite database
(api-docs.db) instead of individual BMP JSON files. The extraction
pipeline now auto-detects which format each package uses and parses
accordingly.

For packages using the new SQLite format, the workspace cache clear
is skipped during fresh builds since api-docs.db handles incremental
updates correctly. The full cache clear is still performed for packages
using the legacy BMP format to avoid stale output files.
@justincasher justincasher force-pushed the support-docgen4-sqlite-format branch from 03fcee4 to 444c0be Compare April 4, 2026 17:54
The check `kind == "constructor"` never matched because doc-gen4
stores the kind as "ctor" in the name_info table. Removing it rather
than fixing the string keeps behavior consistent with the BMP parser,
which also only filters by the .mk suffix.
Convert remaining f-string logger calls to %-style formatting to
match the convention used throughout the rest of the PR.
These counters were tracked but never printed, making it harder to
debug extraction issues. Now logged alongside the existing
skipped_no_source and source_errors messages.
The parameter was documented but never used in the function body.
The SQLite extraction path was hard-coding dependencies=None for every
declaration, silently breaking dependency-based search ranking and
informalization layering for mathlib/cslib.

Add a RenderedCode BLOB parser that deserializes the leansqlite binary
format used in name_info.type to extract const references (declaration
dependencies) from type signatures. The parser handles the full
TaggedText/RenderedCode.Tag/Name encoding including variable-length Nat
and nested structures.

Also adds the n.type column to the SQLite query so the BLOB is available
for parsing.
The fallback source links for Init/Lean/Std/Lake modules used a
'toolchain' placeholder as the git ref, producing non-functional GitHub
URLs. Doc-gen4 itself uses the actual Lean compiler hash for the same
purpose.

Read the lean-toolchain file from each workspace to get the real version
tag (e.g., v4.29.0-rc6) and use it as the git ref. Falls back to
'master' when the toolchain file is unavailable.
The SQLite query only joined declaration_markdown_docstrings, silently
dropping docstrings stored in Verso format (declaration_verso_docstrings).
Doc-gen4 stores each docstring as either markdown text or a Verso binary
BLOB, never both.

Add a LEFT JOIN on declaration_verso_docstrings to detect Verso-only
cases. Since Verso BLOBs use a complex nested serialization format
(Doc.Block/Doc.Inline trees with opaque Dynamic values), full
deserialization is deferred. For now, Verso-only docstrings are counted
and logged as a warning so the gap is visible rather than silent.
_detect_docgen_format previously checked only file existence, which
would misclassify a zero-byte, corrupt, or incomplete api-docs.db as
usable. Now _validate_docgen_sqlite verifies the file is non-empty,
opens as valid SQLite, and contains the required tables (name_info,
declaration_ranges, modules). On failure the detector falls back to
BMP if available, or returns "none".
Move the render=1 check from a Python-side continue into a WHERE
clause on the SQLite query. This avoids fetching and iterating rows
that will be immediately discarded, reducing work for large packages
like mathlib.
The BMP parser crashed with FileNotFoundError when a source file
referenced in a declaration's sourceLink was not present locally
(e.g. after a package rename). The SQLite parser already handled
this with a try/except. Apply the same pattern to the BMP parser:
skip the declaration, count the error, and log a warning at the end.
Pinning doc-gen4 to specific release tags (e.g. @ "v4.30.0-rc1")
caused transitive dependency conflicts with packages like plausible,
breaking lake exe cache get. Using @ "main" lets Lake resolve a
compatible version automatically since doc-gen4 main tracks the
latest Lean toolchain.

Also reorder require statements so doc-gen4 comes before the main
package dependency, ensuring the main package's dependency versions
take precedence during resolution.
Refresh lake-manifest.json and lean-toolchain files across all
package workspaces after switching doc-gen4 to main branch.
Pinning doc-gen4 to @ "main" breaks packages on older toolchains
(e.g. PhysLean at v4.29.0 vs doc-gen4 main at v4.30.0-rc1) due to
incompatible olean headers. Restore version-matched pinning so each
package gets a doc-gen4 compatible with its Lean toolchain. The
require ordering fix (doc-gen4 first, main package last) is kept to
avoid transitive dependency conflicts.
The cache-skip optimization for SQLite-format packages assumed oleans
were always compatible. When the toolchain version changes between
runs (e.g. from v4.30.0-rc1 to v4.29.0), stale oleans cause
"incompatible header" build failures. Now the old toolchain is read
before setup, and if it differs from the new one, the cache is
cleared regardless of SQLite support.
The toolchain change detection was an overcorrection for a one-off
issue caused by the @ "main" experiment. Lake handles toolchain
changes and olean staleness itself — the cache-skip logic only needs
to distinguish SQLite (incremental) from BMP (stale files) format.
Doc-gen4 can fail on individual modules (e.g. timeout computing
equational lemmata) without invalidating the rest of the output.
The explicit library build step should tolerate these failures like
the :docs step already does, rather than aborting the entire pipeline.
The :docs facet transitively depends on :docInfo, which depends on
the library oleans. Running lake build <lib> separately before
lake build <lib>:docs causes Lake to process the doc-gen4 analysis
twice — once during the explicit library build, then again as a
replay during the :docs step. For mathlib this doubled the wall
time (21K jobs instead of ~10K).

The PhysLean issue this step was meant to fix is already resolved
by the correct imports in PhysExtract.lean; the :docs target alone
now builds everything it needs transitively.
@justincasher
Copy link
Copy Markdown
Owner Author

Superseded by the split:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant