feat(pdf-quality) [slice 5 r1]: drop O'Reilly running headers#241
Merged
Conversation
First heuristic-ratchet round. Studied the Claude cleanup pair from the
prod test run; the highest-signal recurring fix Claude made on AI
Engineering was removing running-header paragraphs in two shapes:
<p><strong>4 | Chapter 1: Introduction…</strong></p>
<p><strong>The Rise of AI Engineering | 3</strong></p>
The page number varies per page, so PdfTextExtractor's cross-page
identical-text filter couldn't catch them — but the structural signature
(small int + " | " + text on a short paragraph) is distinctive. Encoded
as a regex in PdfPageTextExtractor.IsArtifactNoise — these now drop at
extraction time, never reach the body, never need an LLM call.
Made IsArtifactNoise internal + InternalsVisibleTo for direct unit tests.
13 new tests (5 header positives, 3 legacy regressions, 4 prose
non-matches, 1 length cap).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
First heuristic-ratchet round (feat-0007 slice 5). Studied the (messy →
cleaned) pair Phase 3 produced on AI Engineering ch5; the highest-signal
recurring fix Claude made was removing running headers in two shapes:
The page number varies per page, so PdfTextExtractor's cross-page
identical-text filter can't catch them — but the structural signature
(small int + " | " + text on a short paragraph) is distinctive.
Changes
PdfPageTextExtractor.IsArtifactNoise— new regex catches the tworunning-header shapes; capped at 200 chars to avoid false positives.
IsArtifactNoiseopened up tointernalfor direct tests(
InternalsVisibleTo("TextStack.Extraction.Tests")in the csproj).non-matches, 1 length-cap guard.
Now: these headers drop at extraction time, never reach the body, never
trigger Phase 3. Claude usage on books with this pattern shrinks.
Tests
220 extraction tests pass.
Notes
Ratchet round 2 needs more pairs — gated on real user uploads accumulating
material in
data/pdf-cleanup-dataset/.🤖 Generated with Claude Code