Skip to content

Support doc-gen4 SQLite database format in extractor#80

Merged
justincasher merged 22 commits into
mainfrom
docgen4-sqlite-python
Apr 20, 2026
Merged

Support doc-gen4 SQLite database format in extractor#80
justincasher merged 22 commits into
mainfrom
docgen4-sqlite-python

Conversation

@justincasher
Copy link
Copy Markdown
Owner

Summary

  • Doc-gen4 >= v4.29.0-rc2 (leanprover/doc-gen4#347) switched from per-declaration BMP JSON files to a single SQLite database (api-docs.db). The extractor now auto-detects which format each package emits and parses accordingly.
  • Adds a SQLite parser that reads name_info, declaration_ranges, modules, and the markdown/Verso docstring tables; deserializes the leansqlite RenderedCode BLOB to recover const references for dependency tracking; and validates api-docs.db (non-empty, required tables present) before trusting the new format.
  • For SQLite-format packages the workspace cache clear is skipped during fresh builds since the DB handles incremental updates correctly. BMP packages still get the full clear to avoid stale output files.
  • Source links for core modules (Init/Lean/Std/Lake) now use the real toolchain tag from each workspace's lean-toolchain instead of a placeholder. BMP parser now skips declarations whose source files are missing locally instead of crashing.
  • Build pipeline: :docs transitively builds the library, so the redundant explicit library build step is removed; lake build <lib>:docs is allowed to tolerate per-module failures the way doc-gen4 itself does.
  • Splits off of Support doc-gen4 SQLite database format for extraction #75; the lean/ workspace bumps landed separately in Bump extractor workspaces for doc-gen4 SQLite support #79.

Doc-gen4 >= v4.29.0-rc1 outputs declaration data to a SQLite database
(api-docs.db) instead of individual BMP JSON files. The extraction
pipeline now auto-detects which format each package uses and parses
accordingly.

For packages using the new SQLite format, the workspace cache clear
is skipped during fresh builds since api-docs.db handles incremental
updates correctly. The full cache clear is still performed for packages
using the legacy BMP format to avoid stale output files.
The check `kind == "constructor"` never matched because doc-gen4
stores the kind as "ctor" in the name_info table. Removing it rather
than fixing the string keeps behavior consistent with the BMP parser,
which also only filters by the .mk suffix.
Convert remaining f-string logger calls to %-style formatting to
match the convention used throughout the rest of the PR.
These counters were tracked but never printed, making it harder to
debug extraction issues. Now logged alongside the existing
skipped_no_source and source_errors messages.
The parameter was documented but never used in the function body.
The SQLite extraction path was hard-coding dependencies=None for every
declaration, silently breaking dependency-based search ranking and
informalization layering for mathlib/cslib.

Add a RenderedCode BLOB parser that deserializes the leansqlite binary
format used in name_info.type to extract const references (declaration
dependencies) from type signatures. The parser handles the full
TaggedText/RenderedCode.Tag/Name encoding including variable-length Nat
and nested structures.

Also adds the n.type column to the SQLite query so the BLOB is available
for parsing.
The fallback source links for Init/Lean/Std/Lake modules used a
'toolchain' placeholder as the git ref, producing non-functional GitHub
URLs. Doc-gen4 itself uses the actual Lean compiler hash for the same
purpose.

Read the lean-toolchain file from each workspace to get the real version
tag (e.g., v4.29.0-rc6) and use it as the git ref. Falls back to
'master' when the toolchain file is unavailable.
The SQLite query only joined declaration_markdown_docstrings, silently
dropping docstrings stored in Verso format (declaration_verso_docstrings).
Doc-gen4 stores each docstring as either markdown text or a Verso binary
BLOB, never both.

Add a LEFT JOIN on declaration_verso_docstrings to detect Verso-only
cases. Since Verso BLOBs use a complex nested serialization format
(Doc.Block/Doc.Inline trees with opaque Dynamic values), full
deserialization is deferred. For now, Verso-only docstrings are counted
and logged as a warning so the gap is visible rather than silent.
_detect_docgen_format previously checked only file existence, which
would misclassify a zero-byte, corrupt, or incomplete api-docs.db as
usable. Now _validate_docgen_sqlite verifies the file is non-empty,
opens as valid SQLite, and contains the required tables (name_info,
declaration_ranges, modules). On failure the detector falls back to
BMP if available, or returns "none".
Move the render=1 check from a Python-side continue into a WHERE
clause on the SQLite query. This avoids fetching and iterating rows
that will be immediately discarded, reducing work for large packages
like mathlib.
The BMP parser crashed with FileNotFoundError when a source file
referenced in a declaration's sourceLink was not present locally
(e.g. after a package rename). The SQLite parser already handled
this with a try/except. Apply the same pattern to the BMP parser:
skip the declaration, count the error, and log a warning at the end.
Pinning doc-gen4 to specific release tags (e.g. @ "v4.30.0-rc1")
caused transitive dependency conflicts with packages like plausible,
breaking lake exe cache get. Using @ "main" lets Lake resolve a
compatible version automatically since doc-gen4 main tracks the
latest Lean toolchain.

Also reorder require statements so doc-gen4 comes before the main
package dependency, ensuring the main package's dependency versions
take precedence during resolution.
Pinning doc-gen4 to @ "main" breaks packages on older toolchains
(e.g. PhysLean at v4.29.0 vs doc-gen4 main at v4.30.0-rc1) due to
incompatible olean headers. Restore version-matched pinning so each
package gets a doc-gen4 compatible with its Lean toolchain. The
require ordering fix (doc-gen4 first, main package last) is kept to
avoid transitive dependency conflicts.
The cache-skip optimization for SQLite-format packages assumed oleans
were always compatible. When the toolchain version changes between
runs (e.g. from v4.30.0-rc1 to v4.29.0), stale oleans cause
"incompatible header" build failures. Now the old toolchain is read
before setup, and if it differs from the new one, the cache is
cleared regardless of SQLite support.
The toolchain change detection was an overcorrection for a one-off
issue caused by the @ "main" experiment. Lake handles toolchain
changes and olean staleness itself — the cache-skip logic only needs
to distinguish SQLite (incremental) from BMP (stale files) format.
Doc-gen4 can fail on individual modules (e.g. timeout computing
equational lemmata) without invalidating the rest of the output.
The explicit library build step should tolerate these failures like
the :docs step already does, rather than aborting the entire pipeline.
The :docs facet transitively depends on :docInfo, which depends on
the library oleans. Running lake build <lib> separately before
lake build <lib>:docs causes Lake to process the doc-gen4 analysis
twice — once during the explicit library build, then again as a
replay during the :docs step. For mathlib this doubled the wall
time (21K jobs instead of ~10K).

The PhysLean issue this step was meant to fix is already resolved
by the correct imports in PhysExtract.lean; the :docs target alone
now builds everything it needs transitively.
The :docs facet populates api-docs.db via :docInfo, then runs
doc-gen4 fromDb to generate every HTML page and the search index.
We read api-docs.db directly and never consume the HTML, so the
fromDb pass is wasted work — on Mathlib that's a second full scan
over thousands of modules per rebuild.

:docInfo stops after the database is populated, which is all the
extractor needs. It was added alongside the SQLite output in
doc-gen4 #347 (v4.29.0-rc2), so fall back to :docs on older
toolchains where the facet doesn't exist.
@justincasher justincasher merged commit 9da7566 into main Apr 20, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant