Skip to content

PDF pipeline round 2: pool memory halved, KV-cache TableFormer export, 17× SIMD page downscale#27

Merged
artiz merged 5 commits into
masterfrom
claude/docling-rust-review-profile-d26fdq
Jul 3, 2026
Merged

PDF pipeline round 2: pool memory halved, KV-cache TableFormer export, 17× SIMD page downscale#27
artiz merged 5 commits into
masterfrom
claude/docling-rust-review-profile-d26fdq

Conversation

@artiz

@artiz artiz commented Jul 3, 2026

Copy link
Copy Markdown
Owner

Summary

Follow-up to #26 — the next items from the PDF_PERFORMANCE.md backlog plus install tooling, one commit each, every change regression-gated against the corpus.

  • Worker-pool memory halved (9d44ed5): one lazily-loaded TableFormer shared behind a mutex instead of a copy per worker. Peak RSS on the 16-page table-heavy paper: 3816 → 1880 MB (4 workers); table-free docs 682 → 331 MB (TableFormer never loads). Byte-identical output.
  • True-KV-cache TableFormer decoder (422812d): decoder_kv.onnx export (one tag per step), argmax-identical at export, byte-identical in the pipeline. Honest result: parity at corpus table sizes — kept as an opt-in for very-large-table workloads; Rust auto-detects either graph.
  • SIMD page downscale (a995706): fast_image_resize, same Catmull-Rom kernel: image.resize 2607 → 152 ms (17×). Gated like INT8: groundtruth distance 817 (SIMD) vs 818 (scalar). FLEISCHWOLF_SLOW_RESIZE=1 opts out; conformance scripts pin the scalar path so snapshots stay valid.
  • textparse caches (0d38da5): fonts (ToUnicode/Type1/width parsing) and decoded Form XObjects cached per document instead of re-parsed per page / per Do. Identical output; 3–10% off textparse on this corpus, far more on CJK/form-heavy PDFs.
  • scripts/install.sh + exe-relative asset resolution (c4ae69c): curl | bash installer for dev boxes and CI — checks/installs Rust, builds the CLI, installs /usr/local/fleischwolf/{bin,models,.pdfium}, symlinks /usr/local/bin/fleischwolf, writes /etc/profile.d/fleischwolf.sh. Default asset paths now also resolve next to the (symlink-canonicalized) binary, so the installed command works from any directory with no env vars. Docs: README install section, crate-table rows for fleischwolf-asr/-rag, MIGRATION extensions brought up to date, and a results-at-a-glance summary opening PDF_PERFORMANCE.md.

Also documented: multi-threaded ORT inference was already non-deterministic run-to-run in borderline table cells (0–20 diff-lines between identical invocations) — regression checks here compare under FLEISCHWOLF_PDF_THREADS=1, where output is byte-stable.

Testing

  • Full workspace test suite green (16 suites), fmt/pinned-clippy clean.
  • Every commit A/B'd old-vs-new binary under FLEISCHWOLF_PDF_THREADS=1 (byte-identical for structural changes; groundtruth-distance gate for SIMD resize).
  • Installer exercised end-to-end in a clean container: rustup path, build, install, env -i conversion of a real PDF from a foreign CWD through the symlink (full ML output, zero env vars).

After merge

Re-run gh workflow run publish-models.yml to add decoder_kv.onnx / decoder_kv_int8.onnx to the models release (existing assets unaffected).

🤖 Generated with Claude Code

https://claude.ai/code/session_017fKFbatf55BK5j8vr3dZjj

claude added 5 commits July 3, 2026 08:30
Each pool worker owned a full TableFormer (encoder+decoder+bbox, ~0.4 GB
of weights and arenas), multiplying that footprint by the pool size even
though tables appear on a minority of pages. The pool now shares a
single instance behind a mutex, loaded lazily on the first table region
any worker sees, with the full intra-op thread budget (tables serialise
on the mutex anyway, so one wide instance replaces several narrow ones).

Peak RSS on the 16-page table-heavy paper (INT8 stack): 3816 -> 1880 MB
with 4 workers, 2183 -> 1517 MB with 2; a table-free document drops
682 -> 331 MB because TableFormer never loads at all.

Output is byte-identical under deterministic single-thread inference
(FLEISCHWOLF_PDF_THREADS=1) across the corpus spot-set. Multi-threaded
runs were already non-deterministic run-to-run in borderline table
cells before this change (ONNX Runtime float-reduction order; the same
reason ocr.rs pins one thread) — now documented in PDF_PERFORMANCE.md.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_017fKFbatf55BK5j8vr3dZjj
The legacy decoder graph re-projects self-attention K/V over the whole
cached prefix in every layer on every autoregressive step (and re-embeds
the full tag sequence for layer 0) — O(n^2) matmuls per table. The
export now also produces decoder_kv.onnx: a one-tag-per-step graph
whose per-layer projected K/V are the cache, verified argmax-identical
to the legacy graph over a 64-step rollout at export time and
byte-identical on corpus output in the Rust pipeline.

The Rust decode loop auto-detects the graph flavour from the session's
input names, so DOCLING_TABLEFORMER_DECODER works with either file;
quantize_models.py and publish-models.yml cover the new graph too.

Measured on corpus-sized tables (~100-300 tokens) the two are at
parity — ORT batches the legacy prefix re-projection into one GEMM —
so the smaller legacy file keeps default preference and decoder_kv*
serves very-large-table workloads where O(past)-per-step wins; see
PDF_PERFORMANCE.md.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_017fKFbatf55BK5j8vr3dZjj
…size)

The 3x->2x supersample downscale ran through the image crate's scalar
CatmullRom — ~15% of a text-heavy conversion (~25% after INT8).
fast_image_resize's convolution uses the same a=-0.5 Catmull-Rom kernel
with SIMD fixed-point arithmetic: image.resize drops 2607 -> 152 ms
summed on the 16-page paper.

The fixed-point path differs from the scalar one by +-1/255 on some
pixels, which can flip borderline table cells downstream, so the switch
was gated like the INT8 models: deterministic corpus run scored against
the committed groundtruth gives 817 (SIMD) vs 818 (scalar) summed
diff-lines — conformance-neutral, 9/16 fixtures byte-identical.

FLEISCHWOLF_SLOW_RESIZE=1 restores the scalar path;
pdf_conformance.sh / pdf_groundtruth.sh pin it (alongside the fp32
models they already pin) so the committed snapshot baselines stay
valid.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_017fKFbatf55BK5j8vr3dZjj
…arse

The content-stream interpreter re-parsed every font — ToUnicode CMap
decompression + tokenization, embedded Type1 program scan, width
tables — on every page and every Form XObject invocation, and
re-inflated + re-decoded form content streams on every Do, even though
fonts and forms are indirect objects shared across the whole document.

pdf_all_cells (and the debug walks) now thread per-document caches
keyed by the referenced object id — fonts also by resource name, which
feeds the docling-parse font hash — through run_content; inline
(non-reference) dicts stay uncached.

Output is identical across the corpus (--no-ocr byte-for-byte on every
PDF fixture; ML path spot-checked deterministic). textparse drops
3-10% on the test fixtures, whose ToUnicode CMaps are small —
CJK/form-heavy documents where font parsing dominates benefit far
more.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_017fKFbatf55BK5j8vr3dZjj
scripts/install.sh installs a self-contained tree for dev boxes and CI
pipelines: checks for a Rust toolchain (installs via rustup when cargo
is missing), builds the CLI in release, lays out
/usr/local/fleischwolf/{bin,models,.pdfium} via
download_dependencies.sh, symlinks /usr/local/bin/fleischwolf, and
writes /etc/profile.d/fleischwolf.sh with the DOCLING_*/PDFIUM_*
exports. FLEISCHWOLF_PREFIX/BIN_DIR/REF/NO_ASR/SUDO knobs; idempotent
re-runs; smoke-tests the installed command.

To make the bare symlink work from any directory with no environment,
default asset paths now resolve CWD-relative first, then next to the
(symlink-canonicalized) executable and one level above it — applied to
the layout/OCR/TableFormer models, pdfium, and the ASR models. Explicit
env overrides are untouched, and a repo checkout resolves exactly as
before (verified byte-identical).

Docs: README install section + crate table (asr, rag rows);
MIGRATION.md extensions brought up to date (fleischwolf-rag and the
node bindings are shipped) and pointed at PDF_PERFORMANCE.md, which
now opens with a results-at-a-glance summary of both optimization
rounds.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_017fKFbatf55BK5j8vr3dZjj
@artiz artiz merged commit b743781 into master Jul 3, 2026
3 checks passed
@artiz artiz deleted the claude/docling-rust-review-profile-d26fdq branch July 3, 2026 18:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants