Skip to content

PDF pipeline: performance review, INT8 quantization, decode-loop and text-layer optimizations#26

Merged
artiz merged 9 commits into
masterfrom
claude/docling-rust-review-profile-d26fdq
Jul 3, 2026
Merged

PDF pipeline: performance review, INT8 quantization, decode-loop and text-layer optimizations#26
artiz merged 9 commits into
masterfrom
claude/docling-rust-review-profile-d26fdq

Conversation

@artiz

@artiz artiz commented Jul 2, 2026

Copy link
Copy Markdown
Owner

Summary

Post-migration performance review of the PDF pipeline: profiling across the test corpus (including the 1913-page dotnet-csharp-language-reference.pdf), validated optimizations, and a ranked backlog in the new PDF_PERFORMANCE.md.

Profiling shows ONNX inference is 85–95% of PDF conversion time (layout.predict alone is 80% on large text documents), so the work targets the model path:

  • INT8 quantization, on by defaultscripts/quantize_models.py produces a calibrated static QDQ INT8 (Conv-only) layout model (~2.4× faster layout, 1.83× end-to-end measured on the 1913-page doc; groundtruth conformance unchanged — 812 vs 833 summed diff-lines; the rejected full-INT8 and dynamic variants are documented) and a dynamic-INT8 TableFormer decoder (byte-identical output, ~10% faster tables). The pipeline prefers the *_int8 files automatically when they sit next to the fp32 models; FLEISCHWOLF_FP32=1 forces full precision and explicit DOCLING_* paths always win. The conformance/groundtruth scripts pin fp32, so snapshots stay deterministic.
  • Toolchain integrationpublish-models.yml quantizes after export and publishes the int8 release assets; download_dependencies.sh fetches them by default (--no-int8 skips); pdf_setup.sh quantizes locally unless FLEISCHWOLF_FP32=1; examples/Dockerfile bakes both precisions and defaults to int8 (--build-arg INT8=0).
  • TableFormer decode loop — the growing KV cache was extracted and re-wrapped on every autoregressive step (O(steps²) copy traffic); it now feeds the previous step's out_cache value straight back, and the encoder outputs stay owned ort values. ~9% faster structure decode, byte-identical output.
  • Text layerline_cells/word_cells ran the identical build+contract pass twice per page; one shared contraction now emits both views (~1.25× faster --no-ocr conversion, identical output).
  • Fix: the " show-text operator ignored its word/char-spacing operands (PDF 32000-1 §9.4.3); documents using aw ac string " got wrong inter-word advances.
  • Instrumentation: ocr.page is now a timed stage; tableformer splits into structure / inter_area sub-stages.

Measured head-to-head vs Python docling (16-page 2203.01017v2.pdf, int8): 4.3× faster warm conversion, 4.7× end-to-end.

Testing

  • Full workspace test suite green (12 suites incl. the new fleischwolf-rag), rebased on v0.13.0 master.
  • INT8 quality gate: Markdown A/B over the whole PDF+scanned corpus vs fp32, plus per-fixture diff-line distance against tests/data/pdf/groundtruth (see PDF_PERFORMANCE.md tables).
  • All model-resolution paths verified against reference outputs: int8 default, FLEISCHWOLF_FP32=1, explicit DOCLING_* override, and fp32 fallback when the int8 files are absent.
  • Both code optimizations verified byte-identical on corpus output with interleaved old/new-binary benchmarks.

After merge

Run gh workflow run publish-models.yml once so the models-v1 release hosts the int8 assets and download_dependencies.sh picks them up with zero local Python.

🤖 Generated with Claude Code

https://claude.ai/code/session_017fKFbatf55BK5j8vr3dZjj

claude added 9 commits July 2, 2026 16:29
The "aw ac string" form of the show-text operator must set word spacing
(Tw) and char spacing (Tc) from its first two operands before showing the
string (PDF 32000-1 §9.4.3); they were silently dropped, giving wrong
inter-word advances in documents that use " instead of explicit Tw/Tc.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_017fKFbatf55BK5j8vr3dZjj
Profiling the pipeline (FLEISCHWOLF_TIMING) shows ONNX inference is
85-95% of PDF conversion time: layout.predict alone is 80% on a
1913-page text document, TableFormer up to 50% on table-heavy pages.

- scripts/quantize_models.py: calibrated static QDQ INT8 (Conv-only) of
  the layout model — 2.4x faster layout, ~1.4-1.7x end-to-end, corpus
  groundtruth conformance unchanged (812 vs 833 summed diff-lines);
  plus dynamic INT8 of the TableFormer decoder (byte-identical output,
  ~10% faster tables). Opt in via DOCLING_LAYOUT_ONNX /
  DOCLING_TABLEFORMER_DECODER.
- timing: instrument ocr.page (previously invisible in profiles) and
  split tableformer into structure vs inter_area sub-stages.
- PDF_PERFORMANCE.md: full profiling data, validated/rejected
  quantization variants, and a ranked backlog of further ideas.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_017fKFbatf55BK5j8vr3dZjj
The autoregressive decode loop extracted the whole self-attention cache
(out_cache) to a Vec and re-wrapped it as a tensor on every step; the
cache grows each step, so that was O(steps^2) float copy traffic per
table. Feed the previous step's out_cache value straight back into the
next run instead, and hold the encoder's cross_k/cross_v/enc_out as
owned ort values so decode steps and the bbox run borrow them directly
(the bbox path also cloned the full encoder output per table).

~9% faster table-structure decode on the 12-table 2203.01017v2.pdf
(structure stage 10.3 -> 9.4 s), byte-identical corpus output.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_017fKFbatf55BK5j8vr3dZjj
line_cells and word_cells each ran the identical build_cells + contract
pass over the page's glyphs — the expensive step of the text layer —
and the default path calls both for every page. line_and_word_cells
runs the contraction once and emits both views from the contracted
cells (lines from cell text/box, words from the recorded segments).

Roughly halves the parser text-layer cost; the --no-ocr conversion of
a 16-page paper drops ~1.25x end-to-end. Corpus output is identical on
both the ML and --no-ocr paths.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_017fKFbatf55BK5j8vr3dZjj
Make the validated INT8 quantization (scripts/quantize_models.py) a
first-class option everywhere the fp32 models flow:

- publish-models.yml quantizes after export and publishes
  layout_heron_int8.onnx / decoder_int8.onnx as release assets
  (continue-on-error: fp32 assets publish regardless).
- download_dependencies.sh --int8 fetches them, printing the DOCLING_*
  opt-in exports; falls back to local-quantization instructions when a
  release tag doesn't host them.
- pdf_setup.sh quantizes locally with FLEISCHWOLF_INT8=1 and echoes the
  int8 export lines when the files exist.
- performance.sh benchmarks the int8 stack with FLEISCHWOLF_INT8=1.
- examples/Dockerfile bakes both precisions, calibrating on the corpus
  copied into the models stage; the runtime's *_active symlinks default
  to int8 (--build-arg INT8=0 for fp32, or override per container via
  the DOCLING_* env vars).
- quantize_models.py takes FLEISCHWOLF_MODELS_DIR /
  FLEISCHWOLF_CALIBRATION_DIR so it runs outside a checkout layout
  (verified byte-identical output under the override).
- README: INT8 section under 'Getting the ML models'.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_017fKFbatf55BK5j8vr3dZjj
The PDF head-to-head predates the pipeline optimizations on this branch;
re-measured with scripts/performance.sh (fp32): 2.3x warm conversion,
4.1x end-to-end, 2.3x less peak memory. Note the INT8 option on top and
fix the stale 47-page label (the fixture row now uses the re-measured
picture_classification.pdf).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_017fKFbatf55BK5j8vr3dZjj
When layout_heron_int8.onnx / tableformer/decoder_int8.onnx sit next to
the fp32 models at the default paths, the pipeline now loads them
automatically — the quantized stack is conformance-validated (see
PDF_PERFORMANCE.md) and ~2x faster on layout, so it is the right
default for CPU deployments. Resolution order per model: explicit
DOCLING_* env override > int8 default file (unless FLEISCHWOLF_FP32 is
set) > fp32 default file.

- download_dependencies.sh fetches the int8 assets by default
  (--no-int8 skips; --int8 kept as a no-op for compatibility).
- pdf_setup.sh quantizes locally unless FLEISCHWOLF_FP32=1.
- performance.sh mirrors the pipeline default (FLEISCHWOLF_FP32=1
  benchmarks fp32); the short-lived FLEISCHWOLF_INT8 toggle is gone.
- pdf_conformance.sh / pdf_groundtruth.sh already pin explicit fp32
  paths, so snapshot baselines stay byte-deterministic.

Verified: default resolves to int8 when present, FLEISCHWOLF_FP32=1
and explicit env paths both produce the fp32 reference output, and the
fp32 fallback engages when the int8 files are absent.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_017fKFbatf55BK5j8vr3dZjj
@artiz artiz force-pushed the claude/docling-rust-review-profile-d26fdq branch from 074b820 to 4038d22 Compare July 2, 2026 16:44
@artiz artiz merged commit 67cd148 into master Jul 3, 2026
3 checks passed
@artiz artiz deleted the claude/docling-rust-review-profile-d26fdq branch July 3, 2026 08:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants