PDF pipeline: performance review, INT8 quantization, decode-loop and text-layer optimizations#26
Merged
Conversation
The "aw ac string" form of the show-text operator must set word spacing (Tw) and char spacing (Tc) from its first two operands before showing the string (PDF 32000-1 §9.4.3); they were silently dropped, giving wrong inter-word advances in documents that use " instead of explicit Tw/Tc. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_017fKFbatf55BK5j8vr3dZjj
Profiling the pipeline (FLEISCHWOLF_TIMING) shows ONNX inference is 85-95% of PDF conversion time: layout.predict alone is 80% on a 1913-page text document, TableFormer up to 50% on table-heavy pages. - scripts/quantize_models.py: calibrated static QDQ INT8 (Conv-only) of the layout model — 2.4x faster layout, ~1.4-1.7x end-to-end, corpus groundtruth conformance unchanged (812 vs 833 summed diff-lines); plus dynamic INT8 of the TableFormer decoder (byte-identical output, ~10% faster tables). Opt in via DOCLING_LAYOUT_ONNX / DOCLING_TABLEFORMER_DECODER. - timing: instrument ocr.page (previously invisible in profiles) and split tableformer into structure vs inter_area sub-stages. - PDF_PERFORMANCE.md: full profiling data, validated/rejected quantization variants, and a ranked backlog of further ideas. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_017fKFbatf55BK5j8vr3dZjj
The autoregressive decode loop extracted the whole self-attention cache (out_cache) to a Vec and re-wrapped it as a tensor on every step; the cache grows each step, so that was O(steps^2) float copy traffic per table. Feed the previous step's out_cache value straight back into the next run instead, and hold the encoder's cross_k/cross_v/enc_out as owned ort values so decode steps and the bbox run borrow them directly (the bbox path also cloned the full encoder output per table). ~9% faster table-structure decode on the 12-table 2203.01017v2.pdf (structure stage 10.3 -> 9.4 s), byte-identical corpus output. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_017fKFbatf55BK5j8vr3dZjj
line_cells and word_cells each ran the identical build_cells + contract pass over the page's glyphs — the expensive step of the text layer — and the default path calls both for every page. line_and_word_cells runs the contraction once and emits both views from the contracted cells (lines from cell text/box, words from the recorded segments). Roughly halves the parser text-layer cost; the --no-ocr conversion of a 16-page paper drops ~1.25x end-to-end. Corpus output is identical on both the ML and --no-ocr paths. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_017fKFbatf55BK5j8vr3dZjj
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_017fKFbatf55BK5j8vr3dZjj
Make the validated INT8 quantization (scripts/quantize_models.py) a first-class option everywhere the fp32 models flow: - publish-models.yml quantizes after export and publishes layout_heron_int8.onnx / decoder_int8.onnx as release assets (continue-on-error: fp32 assets publish regardless). - download_dependencies.sh --int8 fetches them, printing the DOCLING_* opt-in exports; falls back to local-quantization instructions when a release tag doesn't host them. - pdf_setup.sh quantizes locally with FLEISCHWOLF_INT8=1 and echoes the int8 export lines when the files exist. - performance.sh benchmarks the int8 stack with FLEISCHWOLF_INT8=1. - examples/Dockerfile bakes both precisions, calibrating on the corpus copied into the models stage; the runtime's *_active symlinks default to int8 (--build-arg INT8=0 for fp32, or override per container via the DOCLING_* env vars). - quantize_models.py takes FLEISCHWOLF_MODELS_DIR / FLEISCHWOLF_CALIBRATION_DIR so it runs outside a checkout layout (verified byte-identical output under the override). - README: INT8 section under 'Getting the ML models'. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_017fKFbatf55BK5j8vr3dZjj
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_017fKFbatf55BK5j8vr3dZjj
The PDF head-to-head predates the pipeline optimizations on this branch; re-measured with scripts/performance.sh (fp32): 2.3x warm conversion, 4.1x end-to-end, 2.3x less peak memory. Note the INT8 option on top and fix the stale 47-page label (the fixture row now uses the re-measured picture_classification.pdf). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_017fKFbatf55BK5j8vr3dZjj
When layout_heron_int8.onnx / tableformer/decoder_int8.onnx sit next to the fp32 models at the default paths, the pipeline now loads them automatically — the quantized stack is conformance-validated (see PDF_PERFORMANCE.md) and ~2x faster on layout, so it is the right default for CPU deployments. Resolution order per model: explicit DOCLING_* env override > int8 default file (unless FLEISCHWOLF_FP32 is set) > fp32 default file. - download_dependencies.sh fetches the int8 assets by default (--no-int8 skips; --int8 kept as a no-op for compatibility). - pdf_setup.sh quantizes locally unless FLEISCHWOLF_FP32=1. - performance.sh mirrors the pipeline default (FLEISCHWOLF_FP32=1 benchmarks fp32); the short-lived FLEISCHWOLF_INT8 toggle is gone. - pdf_conformance.sh / pdf_groundtruth.sh already pin explicit fp32 paths, so snapshot baselines stay byte-deterministic. Verified: default resolves to int8 when present, FLEISCHWOLF_FP32=1 and explicit env paths both produce the fp32 reference output, and the fp32 fallback engages when the int8 files are absent. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_017fKFbatf55BK5j8vr3dZjj
074b820 to
4038d22
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Post-migration performance review of the PDF pipeline: profiling across the test corpus (including the 1913-page
dotnet-csharp-language-reference.pdf), validated optimizations, and a ranked backlog in the newPDF_PERFORMANCE.md.Profiling shows ONNX inference is 85–95% of PDF conversion time (
layout.predictalone is 80% on large text documents), so the work targets the model path:scripts/quantize_models.pyproduces a calibrated static QDQ INT8 (Conv-only) layout model (~2.4× faster layout, 1.83× end-to-end measured on the 1913-page doc; groundtruth conformance unchanged — 812 vs 833 summed diff-lines; the rejected full-INT8 and dynamic variants are documented) and a dynamic-INT8 TableFormer decoder (byte-identical output, ~10% faster tables). The pipeline prefers the*_int8files automatically when they sit next to the fp32 models;FLEISCHWOLF_FP32=1forces full precision and explicitDOCLING_*paths always win. The conformance/groundtruth scripts pin fp32, so snapshots stay deterministic.publish-models.ymlquantizes after export and publishes the int8 release assets;download_dependencies.shfetches them by default (--no-int8skips);pdf_setup.shquantizes locally unlessFLEISCHWOLF_FP32=1;examples/Dockerfilebakes both precisions and defaults to int8 (--build-arg INT8=0).out_cachevalue straight back, and the encoder outputs stay ownedortvalues. ~9% faster structure decode, byte-identical output.line_cells/word_cellsran the identical build+contract pass twice per page; one shared contraction now emits both views (~1.25× faster--no-ocrconversion, identical output)."show-text operator ignored its word/char-spacing operands (PDF 32000-1 §9.4.3); documents usingaw ac string "got wrong inter-word advances.ocr.pageis now a timed stage;tableformersplits intostructure/inter_areasub-stages.Measured head-to-head vs Python docling (16-page
2203.01017v2.pdf, int8): 4.3× faster warm conversion, 4.7× end-to-end.Testing
fleischwolf-rag), rebased on v0.13.0 master.tests/data/pdf/groundtruth(see PDF_PERFORMANCE.md tables).FLEISCHWOLF_FP32=1, explicitDOCLING_*override, and fp32 fallback when the int8 files are absent.After merge
Run
gh workflow run publish-models.ymlonce so themodels-v1release hosts the int8 assets anddownload_dependencies.shpicks them up with zero local Python.🤖 Generated with Claude Code
https://claude.ai/code/session_017fKFbatf55BK5j8vr3dZjj