PDF pipeline: performance review, INT8 quantization, decode-loop and text-layer optimizations by artiz · Pull Request #26 · artiz/fleischwolf

artiz · 2026-07-02T14:51:39Z

Summary

Post-migration performance review of the PDF pipeline: profiling across the test corpus (including the 1913-page dotnet-csharp-language-reference.pdf), validated optimizations, and a ranked backlog in the new PDF_PERFORMANCE.md.

Profiling shows ONNX inference is 85–95% of PDF conversion time (layout.predict alone is 80% on large text documents), so the work targets the model path:

INT8 quantization, on by default — scripts/quantize_models.py produces a calibrated static QDQ INT8 (Conv-only) layout model (~2.4× faster layout, 1.83× end-to-end measured on the 1913-page doc; groundtruth conformance unchanged — 812 vs 833 summed diff-lines; the rejected full-INT8 and dynamic variants are documented) and a dynamic-INT8 TableFormer decoder (byte-identical output, ~10% faster tables). The pipeline prefers the *_int8 files automatically when they sit next to the fp32 models; FLEISCHWOLF_FP32=1 forces full precision and explicit DOCLING_* paths always win. The conformance/groundtruth scripts pin fp32, so snapshots stay deterministic.
Toolchain integration — publish-models.yml quantizes after export and publishes the int8 release assets; download_dependencies.sh fetches them by default (--no-int8 skips); pdf_setup.sh quantizes locally unless FLEISCHWOLF_FP32=1; examples/Dockerfile bakes both precisions and defaults to int8 (--build-arg INT8=0).
TableFormer decode loop — the growing KV cache was extracted and re-wrapped on every autoregressive step (O(steps²) copy traffic); it now feeds the previous step's out_cache value straight back, and the encoder outputs stay owned ort values. ~9% faster structure decode, byte-identical output.
Text layer — line_cells/word_cells ran the identical build+contract pass twice per page; one shared contraction now emits both views (~1.25× faster --no-ocr conversion, identical output).
Fix: the " show-text operator ignored its word/char-spacing operands (PDF 32000-1 §9.4.3); documents using aw ac string " got wrong inter-word advances.
Instrumentation: ocr.page is now a timed stage; tableformer splits into structure / inter_area sub-stages.

Measured head-to-head vs Python docling (16-page 2203.01017v2.pdf, int8): 4.3× faster warm conversion, 4.7× end-to-end.

Testing

Full workspace test suite green (12 suites incl. the new fleischwolf-rag), rebased on v0.13.0 master.
INT8 quality gate: Markdown A/B over the whole PDF+scanned corpus vs fp32, plus per-fixture diff-line distance against tests/data/pdf/groundtruth (see PDF_PERFORMANCE.md tables).
All model-resolution paths verified against reference outputs: int8 default, FLEISCHWOLF_FP32=1, explicit DOCLING_* override, and fp32 fallback when the int8 files are absent.
Both code optimizations verified byte-identical on corpus output with interleaved old/new-binary benchmarks.

After merge

Run gh workflow run publish-models.yml once so the models-v1 release hosts the int8 assets and download_dependencies.sh picks them up with zero local Python.

🤖 Generated with Claude Code

https://claude.ai/code/session_017fKFbatf55BK5j8vr3dZjj

The "aw ac string" form of the show-text operator must set word spacing (Tw) and char spacing (Tc) from its first two operands before showing the string (PDF 32000-1 §9.4.3); they were silently dropped, giving wrong inter-word advances in documents that use " instead of explicit Tw/Tc. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_017fKFbatf55BK5j8vr3dZjj

Profiling the pipeline (FLEISCHWOLF_TIMING) shows ONNX inference is 85-95% of PDF conversion time: layout.predict alone is 80% on a 1913-page text document, TableFormer up to 50% on table-heavy pages. - scripts/quantize_models.py: calibrated static QDQ INT8 (Conv-only) of the layout model — 2.4x faster layout, ~1.4-1.7x end-to-end, corpus groundtruth conformance unchanged (812 vs 833 summed diff-lines); plus dynamic INT8 of the TableFormer decoder (byte-identical output, ~10% faster tables). Opt in via DOCLING_LAYOUT_ONNX / DOCLING_TABLEFORMER_DECODER. - timing: instrument ocr.page (previously invisible in profiles) and split tableformer into structure vs inter_area sub-stages. - PDF_PERFORMANCE.md: full profiling data, validated/rejected quantization variants, and a ranked backlog of further ideas. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_017fKFbatf55BK5j8vr3dZjj

The autoregressive decode loop extracted the whole self-attention cache (out_cache) to a Vec and re-wrapped it as a tensor on every step; the cache grows each step, so that was O(steps^2) float copy traffic per table. Feed the previous step's out_cache value straight back into the next run instead, and hold the encoder's cross_k/cross_v/enc_out as owned ort values so decode steps and the bbox run borrow them directly (the bbox path also cloned the full encoder output per table). ~9% faster table-structure decode on the 12-table 2203.01017v2.pdf (structure stage 10.3 -> 9.4 s), byte-identical corpus output. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_017fKFbatf55BK5j8vr3dZjj

line_cells and word_cells each ran the identical build_cells + contract pass over the page's glyphs — the expensive step of the text layer — and the default path calls both for every page. line_and_word_cells runs the contraction once and emits both views from the contracted cells (lines from cell text/box, words from the recorded segments). Roughly halves the parser text-layer cost; the --no-ocr conversion of a 16-page paper drops ~1.25x end-to-end. Corpus output is identical on both the ML and --no-ocr paths. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_017fKFbatf55BK5j8vr3dZjj

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_017fKFbatf55BK5j8vr3dZjj

Make the validated INT8 quantization (scripts/quantize_models.py) a first-class option everywhere the fp32 models flow: - publish-models.yml quantizes after export and publishes layout_heron_int8.onnx / decoder_int8.onnx as release assets (continue-on-error: fp32 assets publish regardless). - download_dependencies.sh --int8 fetches them, printing the DOCLING_* opt-in exports; falls back to local-quantization instructions when a release tag doesn't host them. - pdf_setup.sh quantizes locally with FLEISCHWOLF_INT8=1 and echoes the int8 export lines when the files exist. - performance.sh benchmarks the int8 stack with FLEISCHWOLF_INT8=1. - examples/Dockerfile bakes both precisions, calibrating on the corpus copied into the models stage; the runtime's *_active symlinks default to int8 (--build-arg INT8=0 for fp32, or override per container via the DOCLING_* env vars). - quantize_models.py takes FLEISCHWOLF_MODELS_DIR / FLEISCHWOLF_CALIBRATION_DIR so it runs outside a checkout layout (verified byte-identical output under the override). - README: INT8 section under 'Getting the ML models'. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_017fKFbatf55BK5j8vr3dZjj

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_017fKFbatf55BK5j8vr3dZjj

The PDF head-to-head predates the pipeline optimizations on this branch; re-measured with scripts/performance.sh (fp32): 2.3x warm conversion, 4.1x end-to-end, 2.3x less peak memory. Note the INT8 option on top and fix the stale 47-page label (the fixture row now uses the re-measured picture_classification.pdf). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_017fKFbatf55BK5j8vr3dZjj

When layout_heron_int8.onnx / tableformer/decoder_int8.onnx sit next to the fp32 models at the default paths, the pipeline now loads them automatically — the quantized stack is conformance-validated (see PDF_PERFORMANCE.md) and ~2x faster on layout, so it is the right default for CPU deployments. Resolution order per model: explicit DOCLING_* env override > int8 default file (unless FLEISCHWOLF_FP32 is set) > fp32 default file. - download_dependencies.sh fetches the int8 assets by default (--no-int8 skips; --int8 kept as a no-op for compatibility). - pdf_setup.sh quantizes locally unless FLEISCHWOLF_FP32=1. - performance.sh mirrors the pipeline default (FLEISCHWOLF_FP32=1 benchmarks fp32); the short-lived FLEISCHWOLF_INT8 toggle is gone. - pdf_conformance.sh / pdf_groundtruth.sh already pin explicit fp32 paths, so snapshot baselines stay byte-deterministic. Verified: default resolves to int8 when present, FLEISCHWOLF_FP32=1 and explicit env paths both produce the fp32 reference output, and the fp32 fallback engages when the int8 files are absent. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_017fKFbatf55BK5j8vr3dZjj

claude added 9 commits July 2, 2026 16:29

docs(pdf): mark implemented backlog items in PDF_PERFORMANCE.md

9a4e4a0

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_017fKFbatf55BK5j8vr3dZjj

docs(pdf): add the 1913-page INT8 vs fp32 measurement (1.83x end-to-end)

0d05fd2

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_017fKFbatf55BK5j8vr3dZjj

artiz force-pushed the claude/docling-rust-review-profile-d26fdq branch from 074b820 to 4038d22 Compare July 2, 2026 16:44

artiz merged commit 67cd148 into master Jul 3, 2026
3 checks passed

artiz deleted the claude/docling-rust-review-profile-d26fdq branch July 3, 2026 08:12

artiz mentioned this pull request Jul 3, 2026

PDF pipeline round 2: pool memory halved, KV-cache TableFormer export, 17× SIMD page downscale #27

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDF pipeline: performance review, INT8 quantization, decode-loop and text-layer optimizations#26

PDF pipeline: performance review, INT8 quantization, decode-loop and text-layer optimizations#26
artiz merged 9 commits into
masterfrom
claude/docling-rust-review-profile-d26fdq

artiz commented Jul 2, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

artiz commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

After merge

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

artiz commented Jul 2, 2026 •

edited

Loading