feat(pdf-quality) [slice 5 r1]: drop O'Reilly running headers by mrviduus · Pull Request #241 · mrviduus/textstack

mrviduus · 2026-05-23T00:50:33Z

Summary

First heuristic-ratchet round (feat-0007 slice 5). Studied the (messy →
cleaned) pair Phase 3 produced on AI Engineering ch5; the highest-signal
recurring fix Claude made was removing running headers in two shapes:

<p><strong>4 | Chapter 1: Introduction to Building AI Applications…</strong></p>
<p><strong>The Rise of AI Engineering | 3</strong></p>

The page number varies per page, so PdfTextExtractor's cross-page
identical-text filter can't catch them — but the structural signature
(small int + " | " + text on a short paragraph) is distinctive.

Changes

PdfPageTextExtractor.IsArtifactNoise — new regex catches the two
running-header shapes; capped at 200 chars to avoid false positives.
IsArtifactNoise opened up to internal for direct tests
(InternalsVisibleTo("TextStack.Extraction.Tests") in the csproj).
13 new tests: 5 header positives, 3 legacy artifact regressions, 4 prose
non-matches, 1 length-cap guard.

Now: these headers drop at extraction time, never reach the body, never
trigger Phase 3. Claude usage on books with this pattern shrinks.

Tests

220 extraction tests pass.

Notes

Ratchet round 2 needs more pairs — gated on real user uploads accumulating
material in data/pdf-cleanup-dataset/.

🤖 Generated with Claude Code

First heuristic-ratchet round. Studied the Claude cleanup pair from the prod test run; the highest-signal recurring fix Claude made on AI Engineering was removing running-header paragraphs in two shapes: 4 | Chapter 1: Introduction… The Rise of AI Engineering | 3 The page number varies per page, so PdfTextExtractor's cross-page identical-text filter couldn't catch them — but the structural signature (small int + " | " + text on a short paragraph) is distinctive. Encoded as a regex in PdfPageTextExtractor.IsArtifactNoise — these now drop at extraction time, never reach the body, never need an LLM call. Made IsArtifactNoise internal + InternalsVisibleTo for direct unit tests. 13 new tests (5 header positives, 3 legacy regressions, 4 prose non-matches, 1 length cap). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

mrviduus merged commit ed5d54a into main May 23, 2026
5 checks passed

mrviduus deleted the feat/pdf-quality-ratchet-r1 branch May 23, 2026 01:45

mrviduus mentioned this pull request May 23, 2026

fix(pdf-quality): preserve typography in Phase 3 + ratchet R1 log #242

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(pdf-quality) [slice 5 r1]: drop O'Reilly running headers#241

feat(pdf-quality) [slice 5 r1]: drop O'Reilly running headers#241
mrviduus merged 1 commit into
mainfrom
feat/pdf-quality-ratchet-r1

mrviduus commented May 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mrviduus commented May 23, 2026

Summary

Changes

Tests

Notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant