mrviduus · mrviduus · May 23, 2026 · May 23, 2026
diff --git a/README.md b/README.md
@@ -93,6 +93,12 @@ out; only technical vocabulary gets surfaced.
 **Library**
 - 1,500+ curated technical and classic books (starter corpus, self-hostable)
 - Your own uploads — EPUB / PDF / FB2, auto-parsed with metadata enrichment
+- **PDF content-quality pipeline** — heuristic extraction first (instant,
+  readable), then a gated Claude pass cleans flagged chapters in the
+  background (running headers, fragmented paragraphs, line-wrap hyphens).
+  Every fix is logged so deterministic rules can absorb recurring patterns
+  and Claude usage trends down. See
+  [architecture](docs/01-architecture/README.md#pdf-content-quality-pipeline).
 - Reading progress sync, bookmarks, highlights, reading stats
 
 **Mobile**

diff --git a/docs/01-architecture/README.md b/docs/01-architecture/README.md
@@ -144,6 +144,94 @@ packages/             # Shared TS code
 
 **Cache keys**: Server: `SHA256(text+voice+rate)[:16].mp3`. Client: `{lang}:{SHA256(text)[:16]}`.
 
+## PDF Content Quality Pipeline
+
+Self-improving cleanup for PDF-extracted books. Heuristics (PdfPig + the
+post-processing chain) hit ~70-75%; the gap to ~90% is semantic — running
+headers in body, fragmented paragraphs, unmerged line-wrap hyphens, inlined
+footnotes. The pipeline closes it with a gated Claude pass, and logs every fix
+so the deterministic processors can ratchet up over time and the Claude
+dependency shrinks. Full design: [`docs/05-features/feat-0007-pdf-content-quality.md`](../05-features/feat-0007-pdf-content-quality.md).
+
+### Synchronous path — ingest
+
+```
+┌──────────┐    ┌──────────────────┐    ┌──────────────────────────────────────┐
+│  Upload  │───►│ POST /me/books/  │───►│ Worker — UserIngestionService         │
+│   PDF    │    │     upload       │    │   PdfPig extract                      │
+└──────────┘    └──────────────────┘    │   Spelling→Hyphenation→Typography→    │
+                                        │   Semantic→Linter   ◄── ratchet lands │
+                                        │   ChapterContentQualityAnalyzer       │
+                                        │   → ContentQualityScore (0-100) per ch│
+                                        │   TryQueueQualityJobAsync             │
+                                        └──────────────────────────────────────┘
+                                                       │
+                       ┌───────────────────────────────┴─────────────────┐
+                       ▼                                                 ▼
+              ┌────────────────┐                               ┌──────────────────┐
+              │   user_books   │                               │ book_quality_jobs│
+              │  status=Ready  │   (reader can open it now)   │  status=Queued   │
+              └────────────────┘                               └──────────────────┘
+```
+
+Book is readable immediately at heuristic quality (~70-75%). Phase 3 cleanup
+runs asynchronously and refines flagged chapters in place.
+
+### Asynchronous path — cleanup
+
+```
+┌──────────────────────────────────┐
+│ quality-poller.service (systemd) │     polls every 30s
+│ infra/scripts/quality-poll.sh    │     claude CLI on host (Max sub)
+└──────────────────────────────────┘
+                │
+                ▼  claim Queued BookQualityJob
+┌───────────────────────────────────────────────────────────────────────────┐
+│ Phase 1 — validate chapter STRUCTURE                                       │
+│   claude -p → ISSUES_JSON  (empty / fragment / giant / placeholder title) │
+├───────────────────────────────────────────────────────────────────────────┤
+│ Phase 2 — apply structure fixes via internal API                           │
+│   DELETE / PUT(rename) / POST(merge)   /internal/.../chapters/{n}         │
+├───────────────────────────────────────────────────────────────────────────┤
+│ Phase 3 — content cleanup (gated by per-chapter ContentQualityScore)       │
+│                                                                            │
+│   for each chapter where score < CONTENT_QUALITY_THRESHOLD:                │
+│       GET  /internal/.../chapters/{n}/content   ── messy HTML              │
+│       claude -p (preserve verbatim, fix structure only)                    │
+│       pdf-cleanup-gate.py: word-multiset diff                              │
+│           ACCEPT → PUT cleaned + write (messy→clean) pair to dataset       │
+│           REJECT → keep original, write rejected pair                      │
+└───────────────────────────────────────────────────────────────────────────┘
+                │
+                ▼
+        data/pdf-cleanup-dataset/   ──►  manual study (Slice 5)
+                                        ──►  new processors land in the
+                                             Semantic / Linter chain above
+                                             — Claude usage trends down.
+```
+
+### Components
+
+| Piece | Where | Role |
+|-------|-------|------|
+| `ChapterContentQualityAnalyzer` | `backend/src/Extraction/.../Quality/` | Deterministic 0-100 score + issue codes — the gate that decides which chapters reach the LLM |
+| `BookQualityJob` | DB entity | Tracks Phase 1-2 (`IssuesFound/Fixed`) and Phase 3 (`ContentChaptersCleaned/Rejected/Skipped`) |
+| `quality-poll.sh` | host systemd | Three-phase orchestrator, calls `claude` CLI |
+| `pdf-cleanup-gate.py` | host | Preservation gate — strips whitespace + hyphens, rejects hallucination (>3% novel tokens) or over-deletion (<70% retention) |
+| Internal chapter endpoints | API | `GET .../chapters/{n}/content` + `PUT .../chapters/{n}` with `{html}` — single source of truth for chapter HTML |
+| `data/pdf-cleanup-dataset/` | host | Append-only pair log — fuel for the heuristic ratchet |
+
+### Configuration
+
+| Knob | Default | Effect |
+|---|---|---|
+| `CONTENT_CLEANUP_ENABLED` (`.env`) | `false` | Master switch — Phase 3 no-op when off |
+| `CONTENT_QUALITY_THRESHOLD` (`.env`) | `60` | Chapters scoring below this go through Claude. 60 ≈ obviously broken only |
+| `CLEANUP_TIMEOUT` (`.env`) | `1500` (s) | Per-chapter Claude budget; 20k-word chapters hit ~15-20 min |
+| `quality.autoQueueForUserBooks` (`admin_settings`) | `false` | Auto-enqueue a `BookQualityJob` after every user-book ingest |
+
+Setup: `make quality-poll-setup` (one-time host systemd install).
+
 ## See Also
 
 - [Multisite](multisite.md) — Host resolution and data isolation