diff --git a/README.md b/README.md index c83a7a8f..fc6f742d 100644 --- a/README.md +++ b/README.md @@ -93,6 +93,12 @@ out; only technical vocabulary gets surfaced. **Library** - 1,500+ curated technical and classic books (starter corpus, self-hostable) - Your own uploads — EPUB / PDF / FB2, auto-parsed with metadata enrichment +- **PDF content-quality pipeline** — heuristic extraction first (instant, + readable), then a gated Claude pass cleans flagged chapters in the + background (running headers, fragmented paragraphs, line-wrap hyphens). + Every fix is logged so deterministic rules can absorb recurring patterns + and Claude usage trends down. See + [architecture](docs/01-architecture/README.md#pdf-content-quality-pipeline). - Reading progress sync, bookmarks, highlights, reading stats **Mobile** diff --git a/docs/01-architecture/README.md b/docs/01-architecture/README.md index e1e92c53..59663d7f 100644 --- a/docs/01-architecture/README.md +++ b/docs/01-architecture/README.md @@ -144,6 +144,94 @@ packages/ # Shared TS code **Cache keys**: Server: `SHA256(text+voice+rate)[:16].mp3`. Client: `{lang}:{SHA256(text)[:16]}`. +## PDF Content Quality Pipeline + +Self-improving cleanup for PDF-extracted books. Heuristics (PdfPig + the +post-processing chain) hit ~70-75%; the gap to ~90% is semantic — running +headers in body, fragmented paragraphs, unmerged line-wrap hyphens, inlined +footnotes. The pipeline closes it with a gated Claude pass, and logs every fix +so the deterministic processors can ratchet up over time and the Claude +dependency shrinks. Full design: [`docs/05-features/feat-0007-pdf-content-quality.md`](../05-features/feat-0007-pdf-content-quality.md). + +### Synchronous path — ingest + +``` +┌──────────┐ ┌──────────────────┐ ┌──────────────────────────────────────┐ +│ Upload │───►│ POST /me/books/ │───►│ Worker — UserIngestionService │ +│ PDF │ │ upload │ │ PdfPig extract │ +└──────────┘ └──────────────────┘ │ Spelling→Hyphenation→Typography→ │ + │ Semantic→Linter ◄── ratchet lands │ + │ ChapterContentQualityAnalyzer │ + │ → ContentQualityScore (0-100) per ch│ + │ TryQueueQualityJobAsync │ + └──────────────────────────────────────┘ + │ + ┌───────────────────────────────┴─────────────────┐ + ▼ ▼ + ┌────────────────┐ ┌──────────────────┐ + │ user_books │ │ book_quality_jobs│ + │ status=Ready │ (reader can open it now) │ status=Queued │ + └────────────────┘ └──────────────────┘ +``` + +Book is readable immediately at heuristic quality (~70-75%). Phase 3 cleanup +runs asynchronously and refines flagged chapters in place. + +### Asynchronous path — cleanup + +``` +┌──────────────────────────────────┐ +│ quality-poller.service (systemd) │ polls every 30s +│ infra/scripts/quality-poll.sh │ claude CLI on host (Max sub) +└──────────────────────────────────┘ + │ + ▼ claim Queued BookQualityJob +┌───────────────────────────────────────────────────────────────────────────┐ +│ Phase 1 — validate chapter STRUCTURE │ +│ claude -p → ISSUES_JSON (empty / fragment / giant / placeholder title) │ +├───────────────────────────────────────────────────────────────────────────┤ +│ Phase 2 — apply structure fixes via internal API │ +│ DELETE / PUT(rename) / POST(merge) /internal/.../chapters/{n} │ +├───────────────────────────────────────────────────────────────────────────┤ +│ Phase 3 — content cleanup (gated by per-chapter ContentQualityScore) │ +│ │ +│ for each chapter where score < CONTENT_QUALITY_THRESHOLD: │ +│ GET /internal/.../chapters/{n}/content ── messy HTML │ +│ claude -p (preserve verbatim, fix structure only) │ +│ pdf-cleanup-gate.py: word-multiset diff │ +│ ACCEPT → PUT cleaned + write (messy→clean) pair to dataset │ +│ REJECT → keep original, write rejected pair │ +└───────────────────────────────────────────────────────────────────────────┘ + │ + ▼ + data/pdf-cleanup-dataset/ ──► manual study (Slice 5) + ──► new processors land in the + Semantic / Linter chain above + — Claude usage trends down. +``` + +### Components + +| Piece | Where | Role | +|-------|-------|------| +| `ChapterContentQualityAnalyzer` | `backend/src/Extraction/.../Quality/` | Deterministic 0-100 score + issue codes — the gate that decides which chapters reach the LLM | +| `BookQualityJob` | DB entity | Tracks Phase 1-2 (`IssuesFound/Fixed`) and Phase 3 (`ContentChaptersCleaned/Rejected/Skipped`) | +| `quality-poll.sh` | host systemd | Three-phase orchestrator, calls `claude` CLI | +| `pdf-cleanup-gate.py` | host | Preservation gate — strips whitespace + hyphens, rejects hallucination (>3% novel tokens) or over-deletion (<70% retention) | +| Internal chapter endpoints | API | `GET .../chapters/{n}/content` + `PUT .../chapters/{n}` with `{html}` — single source of truth for chapter HTML | +| `data/pdf-cleanup-dataset/` | host | Append-only pair log — fuel for the heuristic ratchet | + +### Configuration + +| Knob | Default | Effect | +|---|---|---| +| `CONTENT_CLEANUP_ENABLED` (`.env`) | `false` | Master switch — Phase 3 no-op when off | +| `CONTENT_QUALITY_THRESHOLD` (`.env`) | `60` | Chapters scoring below this go through Claude. 60 ≈ obviously broken only | +| `CLEANUP_TIMEOUT` (`.env`) | `1500` (s) | Per-chapter Claude budget; 20k-word chapters hit ~15-20 min | +| `quality.autoQueueForUserBooks` (`admin_settings`) | `false` | Auto-enqueue a `BookQualityJob` after every user-book ingest | + +Setup: `make quality-poll-setup` (one-time host systemd install). + ## See Also - [Multisite](multisite.md) — Host resolution and data isolation