PDF pipeline round 2: pool memory halved, KV-cache TableFormer export, 17× SIMD page downscale#27
Merged
Conversation
Each pool worker owned a full TableFormer (encoder+decoder+bbox, ~0.4 GB of weights and arenas), multiplying that footprint by the pool size even though tables appear on a minority of pages. The pool now shares a single instance behind a mutex, loaded lazily on the first table region any worker sees, with the full intra-op thread budget (tables serialise on the mutex anyway, so one wide instance replaces several narrow ones). Peak RSS on the 16-page table-heavy paper (INT8 stack): 3816 -> 1880 MB with 4 workers, 2183 -> 1517 MB with 2; a table-free document drops 682 -> 331 MB because TableFormer never loads at all. Output is byte-identical under deterministic single-thread inference (FLEISCHWOLF_PDF_THREADS=1) across the corpus spot-set. Multi-threaded runs were already non-deterministic run-to-run in borderline table cells before this change (ONNX Runtime float-reduction order; the same reason ocr.rs pins one thread) — now documented in PDF_PERFORMANCE.md. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_017fKFbatf55BK5j8vr3dZjj
The legacy decoder graph re-projects self-attention K/V over the whole cached prefix in every layer on every autoregressive step (and re-embeds the full tag sequence for layer 0) — O(n^2) matmuls per table. The export now also produces decoder_kv.onnx: a one-tag-per-step graph whose per-layer projected K/V are the cache, verified argmax-identical to the legacy graph over a 64-step rollout at export time and byte-identical on corpus output in the Rust pipeline. The Rust decode loop auto-detects the graph flavour from the session's input names, so DOCLING_TABLEFORMER_DECODER works with either file; quantize_models.py and publish-models.yml cover the new graph too. Measured on corpus-sized tables (~100-300 tokens) the two are at parity — ORT batches the legacy prefix re-projection into one GEMM — so the smaller legacy file keeps default preference and decoder_kv* serves very-large-table workloads where O(past)-per-step wins; see PDF_PERFORMANCE.md. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_017fKFbatf55BK5j8vr3dZjj
…size) The 3x->2x supersample downscale ran through the image crate's scalar CatmullRom — ~15% of a text-heavy conversion (~25% after INT8). fast_image_resize's convolution uses the same a=-0.5 Catmull-Rom kernel with SIMD fixed-point arithmetic: image.resize drops 2607 -> 152 ms summed on the 16-page paper. The fixed-point path differs from the scalar one by +-1/255 on some pixels, which can flip borderline table cells downstream, so the switch was gated like the INT8 models: deterministic corpus run scored against the committed groundtruth gives 817 (SIMD) vs 818 (scalar) summed diff-lines — conformance-neutral, 9/16 fixtures byte-identical. FLEISCHWOLF_SLOW_RESIZE=1 restores the scalar path; pdf_conformance.sh / pdf_groundtruth.sh pin it (alongside the fp32 models they already pin) so the committed snapshot baselines stay valid. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_017fKFbatf55BK5j8vr3dZjj
…arse The content-stream interpreter re-parsed every font — ToUnicode CMap decompression + tokenization, embedded Type1 program scan, width tables — on every page and every Form XObject invocation, and re-inflated + re-decoded form content streams on every Do, even though fonts and forms are indirect objects shared across the whole document. pdf_all_cells (and the debug walks) now thread per-document caches keyed by the referenced object id — fonts also by resource name, which feeds the docling-parse font hash — through run_content; inline (non-reference) dicts stay uncached. Output is identical across the corpus (--no-ocr byte-for-byte on every PDF fixture; ML path spot-checked deterministic). textparse drops 3-10% on the test fixtures, whose ToUnicode CMaps are small — CJK/form-heavy documents where font parsing dominates benefit far more. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_017fKFbatf55BK5j8vr3dZjj
scripts/install.sh installs a self-contained tree for dev boxes and CI
pipelines: checks for a Rust toolchain (installs via rustup when cargo
is missing), builds the CLI in release, lays out
/usr/local/fleischwolf/{bin,models,.pdfium} via
download_dependencies.sh, symlinks /usr/local/bin/fleischwolf, and
writes /etc/profile.d/fleischwolf.sh with the DOCLING_*/PDFIUM_*
exports. FLEISCHWOLF_PREFIX/BIN_DIR/REF/NO_ASR/SUDO knobs; idempotent
re-runs; smoke-tests the installed command.
To make the bare symlink work from any directory with no environment,
default asset paths now resolve CWD-relative first, then next to the
(symlink-canonicalized) executable and one level above it — applied to
the layout/OCR/TableFormer models, pdfium, and the ASR models. Explicit
env overrides are untouched, and a repo checkout resolves exactly as
before (verified byte-identical).
Docs: README install section + crate table (asr, rag rows);
MIGRATION.md extensions brought up to date (fleischwolf-rag and the
node bindings are shipped) and pointed at PDF_PERFORMANCE.md, which
now opens with a results-at-a-glance summary of both optimization
rounds.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_017fKFbatf55BK5j8vr3dZjj
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Follow-up to #26 — the next items from the
PDF_PERFORMANCE.mdbacklog plus install tooling, one commit each, every change regression-gated against the corpus.9d44ed5): one lazily-loaded TableFormer shared behind a mutex instead of a copy per worker. Peak RSS on the 16-page table-heavy paper: 3816 → 1880 MB (4 workers); table-free docs 682 → 331 MB (TableFormer never loads). Byte-identical output.422812d):decoder_kv.onnxexport (one tag per step), argmax-identical at export, byte-identical in the pipeline. Honest result: parity at corpus table sizes — kept as an opt-in for very-large-table workloads; Rust auto-detects either graph.a995706):fast_image_resize, same Catmull-Rom kernel:image.resize2607 → 152 ms (17×). Gated like INT8: groundtruth distance 817 (SIMD) vs 818 (scalar).FLEISCHWOLF_SLOW_RESIZE=1opts out; conformance scripts pin the scalar path so snapshots stay valid.0d38da5): fonts (ToUnicode/Type1/width parsing) and decoded Form XObjects cached per document instead of re-parsed per page / perDo. Identical output; 3–10% offtextparseon this corpus, far more on CJK/form-heavy PDFs.scripts/install.sh+ exe-relative asset resolution (c4ae69c):curl | bashinstaller for dev boxes and CI — checks/installs Rust, builds the CLI, installs/usr/local/fleischwolf/{bin,models,.pdfium}, symlinks/usr/local/bin/fleischwolf, writes/etc/profile.d/fleischwolf.sh. Default asset paths now also resolve next to the (symlink-canonicalized) binary, so the installed command works from any directory with no env vars. Docs: README install section, crate-table rows forfleischwolf-asr/-rag, MIGRATION extensions brought up to date, and a results-at-a-glance summary openingPDF_PERFORMANCE.md.Also documented: multi-threaded ORT inference was already non-deterministic run-to-run in borderline table cells (0–20 diff-lines between identical invocations) — regression checks here compare under
FLEISCHWOLF_PDF_THREADS=1, where output is byte-stable.Testing
FLEISCHWOLF_PDF_THREADS=1(byte-identical for structural changes; groundtruth-distance gate for SIMD resize).env -iconversion of a real PDF from a foreign CWD through the symlink (full ML output, zero env vars).After merge
Re-run
gh workflow run publish-models.ymlto adddecoder_kv.onnx/decoder_kv_int8.onnxto the models release (existing assets unaffected).🤖 Generated with Claude Code
https://claude.ai/code/session_017fKFbatf55BK5j8vr3dZjj