fix(pdf): bug-report sweep — content-TOC + multi-col + bullets + file-check + root-cause cleanup by mrviduus · Pull Request #245 · mrviduus/textstack

mrviduus · 2026-05-23T18:24:01Z

Summary

Two rounds of bug-report cleanup on top of PR #244. The second round (latest commit) traces each remaining bug to its underlying invariant.

#	Bug	Fix
Critical	Round-1 content-TOC detection silently never matched in prod	`ProcessingPipeline.ExtractPlainText` collapses `\n` → " ", so the plainText input had one "line". Detection now runs against chapter HTML, splitting on `</p>\|</h\d>\|</li>`.
1	Index/Glossary false positive (also leader-dotted)	Two guards: (a) only drop chapters in the front half of the book; (b) `IsKnownBackMatter` veto when bookmark title matches Index/Glossary/Bibliography/References/Notes (en + ru/uk).
2	Multi-column 50% threshold flip-flop on borderline pages	Dominance ratio instead: modal margin trusted only when ≥ 2.5× the runner-up. Real 2-column pages sit at ~1×, single-column at ~17×.
3	`RetryAsync` opened a file stream just to probe existence	New `IFileStorageService.ExistsAsync(path)`; `LocalFileStorageService` implements as `File.Exists`.
Drive-by	`PdfToHtmlConverter.plainBuilder` was dead code	Built via `AppendLine` in the loop, never used (return value comes from `HtmlCleaner.Clean`). Removed.

Plus the original PR #245 changes:

Content-level TOC detection (LooksLikeTableOfContentsBody).
Bullet glyph set extended (◇ ❖ ▶ ➤ ★ ✓ ✗ etc.).

Tests

dotnet test tests/TextStack.Extraction.Tests → 287 passing (14 new in round 2: IsKnownBackMatter matches/non-matches, HTML-based LooksLikeTableOfContentsBody for leader-dot, ellipsis, prose negative, too-short, null/empty).
Solution build clean.

Rollback

No flag. Revert the commit(s) to restore prior behaviour. Re-extracts only — already-extracted chapters are not retroactively re-evaluated.

🤖 Generated with Claude Code

…-check Four follow-ups from PR #244's bug report list: 1. Content-level TOC detection. The bookmark-title-only path missed TOCs that came in via the page-split fallback (no bookmark, chapter labeled "Pages 1–15"). Now FrontMatterFilter.LooksLikeTableOfContentsBody inspects the plain text: ≥40% of substantive lines ending in a leader-dot run (or "…") + page number ⇒ TOC. Same single-chapter safety guard. 2. Multi-column / mixed-layout guard for StartsWithIndent. The modal left margin is now only trusted when it covers ≥50% of all lines. On a 2-column academic paper the modal share is well under half; we fall back to y-gap and bullet detection only, instead of over-splitting on every column shift. 3. Bullet glyph set expanded — ◇ ❖ ❍ ▶ ▸ ▻ ➤ ➔ ➢ ★ ☆ ✓ ✔ ✗ ✘ to cover modern textbook list markers. 4. RetryAsync now probes the backing file via storage.GetFileAsync before queuing the job. A missing source returned success and left the book stuck in Processing forever. Tests: 273 passing in Extraction.Tests — 8 new across content-TOC detection (leader-dot, ellipsis, prose negative, too-short, null) and expanded bullet glyph coverage. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Senior-dev pass over the bug reports from round 1. Each one was traced to its underlying invariant; the fix removes the bug instead of just working around the symptom. PR #245 round-1 plain-text TOC detection silently DID NOT WORK ============================================================ Root cause: ProcessingPipeline.ExtractPlainText runs WhitespaceRegex.Replace(text, " ") which collapses '\n' into a single space. LooksLikeTableOfContentsBody split on '\n', got 1 line, < 5 significant → always returned false. Fix: detection now operates on the chapter HTML, splitting on </p>|</h\d>|</li> instead of newlines. The HTML retains paragraph boundaries by construction (PdfToHtmlConverter emits one <p> per extracted paragraph). #1 Index/Glossary false positive on content-detected TOC ========================================================= Root cause: Index and Glossary are *also* leader-dotted "term … 47". The single "look like TOC body" signal isn't enough. Fix: two extra guards. - Position: only drop chapters in the front half of the book. - Title: new IsKnownBackMatter — vetoes the drop when the bookmark title is Index / Glossary / Bibliography / References / Notes / Abbreviations / Colophon (en + ru/uk). #2 Multi-column threshold flips on borderline pages ==================================================== Root cause: hard 50% modal-coverage cutoff is brittle. Fix: dominance ratio. Modal margin trusted only when its count is ≥ 2.5× the runner-up. Real 2-column pages sit near 1.0× (~40/40); single-column body pages sit near 17× (~85/5). Cutoff is far from either distribution. #3 RetryAsync opened a file stream just to probe existence =========================================================== Root cause: IFileStorageService had no existence primitive, so the guard had to use the heavy GetFileAsync. Fix: new ExistsAsync(path) on the interface, implemented in LocalFileStorageService as File.Exists(GetFullPath). RetryAsync uses it; old GetFileAsync path replaced. Drive-by cleanup: PdfToHtmlConverter.plainBuilder ================================================= Built up via AppendLine in the loop, never used — the returned plainText comes from HtmlCleaner.Clean's pipeline. Removed to stop implying the local copy is authoritative. Tests: 287 passing in Extraction.Tests (+14 new across IsKnownBackMatter, HTML-based LooksLikeTableOfContentsBody). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three remaining bug reports addressed at their underlying invariant. #2 (HTML coupling) — root cause: detection ran on chapter HTML and split on </p>|</h\d>|</li>, coupling it to PdfToHtmlConverter's markup. The actual signal was always per-paragraph plain text; we just lost it during HTML conversion. Refactor: LooksLikeTableOfContentsBody now takes IEnumerable<string> of paragraph texts, called BEFORE HTML conversion in PdfTextExtractor. Bonus: skips the HtmlCleaner pipeline entirely for chapters we're about to drop. #4 (localized back-matter) — root cause: whitelist was English + Russian/Ukrainian only. Extended to German, French, Spanish, Italian, Portuguese for Glossary / Bibliography / References / Notes / Appendix plus their localized forms. #3 (sidenote dominance) — analysis: with sidenotes the typical split is ~70/25 = 2.8× which is above the 2.5× threshold, so the multi-column guard correctly keeps trusting the modal margin. Sidenote lines (a small minority) are then NOT treated as indent breaks — they just register as regular paragraph content. The body remains correctly split. No code change needed; documenting the analysis. Latent bug — IsTableOfContents drop had no position guard, so an Italian "Indice" / Spanish "Índice" (same word means "Index" at the back of the book) would be mis-dropped when it's the Index. Added isFrontHalf guard to the title-based drop path too. Tests: 302 passing in Extraction.Tests (+15 new across IsKnownBackMatter in 7 languages + string-list LooksLikeTableOfContentsBody). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Last actionable item from the bug list: hardcoded bullet set missed custom dingbat-font glyphs in modern textbooks. Root cause: detection was an enumerated whitelist; new fonts shipped new shapes; we kept playing whack-a-mole. Generalization: after the fast-path whitelist check, fall back to Unicode category lookup. If the first word is a single character in category "Symbol, Other" (So) AND not in the existing NoisePunctuation set, treat it as a bullet. Po (Punctuation Other) is deliberately excluded — that category contains † ‡ § ¶ ※ which are footnote-reference markers, not paragraph starts. Tests assert both directions: - ☑ ☐ ✦ ✺ ♦ ⇒ bullet (So) - † ‡ § ¶ ※ ⇒ NOT bullet (Po) Tests: 312 passing in Extraction.Tests (+10 new). Other items remaining on the bug list are tradeoffs we declined: • sidenote columns = XYCut territory (intentionally not pursued) • wrapped TOC entries with hanging indent = context-aware extraction, larger refactor Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

In 3e90b5c (PR #245 amend) I used `git add -A` without a path scope, which swept up 58 files from the working tree that were never meant to be committed: personal resumes, marketing drafts, LibreOffice lock files, scratch .tmp blobs, the v1.0.0 .apk, gemma4 submission package binaries, and the daily marketing routine notes. This change: • `git rm --cached` for all 58 files (kept on disk locally, removed from the repo). • .gitignore patterns so a future `git add -A` won't re-add the same set — by literal name where reasonable, by pattern where the set is open-ended (Vasyl-*, *.tmp, *.apk, *.pyc, __pycache__). Big .apk (96 MB) is gone from HEAD but stays in history — a full purge would need git-filter-repo + force-push of main, deliberately deferred. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

mrviduus and others added 2 commits May 23, 2026 14:23

mrviduus changed the title ~~fix(pdf): bug-report sweep — content-TOC detect, multi-col guard, bullets, file-check~~ fix(pdf): bug-report sweep — content-TOC + multi-col + bullets + file-check + root-cause cleanup May 23, 2026

mrviduus and others added 2 commits May 23, 2026 14:35

mrviduus force-pushed the fix/pdf-extraction-bug-reports branch from d9a8117 to 37c0344 Compare May 23, 2026 19:02

mrviduus merged commit 3e90b5c into main May 23, 2026
5 checks passed

mrviduus deleted the fix/pdf-extraction-bug-reports branch May 23, 2026 19:55

mrviduus mentioned this pull request May 23, 2026

chore(repo): purge 58 accidentally-committed personal/draft files #246

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(pdf): bug-report sweep — content-TOC + multi-col + bullets + file-check + root-cause cleanup#245

fix(pdf): bug-report sweep — content-TOC + multi-col + bullets + file-check + root-cause cleanup#245
mrviduus merged 4 commits into
mainfrom
fix/pdf-extraction-bug-reports

mrviduus commented May 23, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mrviduus commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Tests

Rollback

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mrviduus commented May 23, 2026 •

edited

Loading