Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -93,6 +93,12 @@ out; only technical vocabulary gets surfaced.
**Library**
- 1,500+ curated technical and classic books (starter corpus, self-hostable)
- Your own uploads — EPUB / PDF / FB2, auto-parsed with metadata enrichment
- **PDF content-quality pipeline** — heuristic extraction first (instant,
readable), then a gated Claude pass cleans flagged chapters in the
background (running headers, fragmented paragraphs, line-wrap hyphens).
Every fix is logged so deterministic rules can absorb recurring patterns
and Claude usage trends down. See
[architecture](docs/01-architecture/README.md#pdf-content-quality-pipeline).
- Reading progress sync, bookmarks, highlights, reading stats

**Mobile**
Expand Down
88 changes: 88 additions & 0 deletions docs/01-architecture/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -144,6 +144,94 @@ packages/ # Shared TS code

**Cache keys**: Server: `SHA256(text+voice+rate)[:16].mp3`. Client: `{lang}:{SHA256(text)[:16]}`.

## PDF Content Quality Pipeline

Self-improving cleanup for PDF-extracted books. Heuristics (PdfPig + the
post-processing chain) hit ~70-75%; the gap to ~90% is semantic — running
headers in body, fragmented paragraphs, unmerged line-wrap hyphens, inlined
footnotes. The pipeline closes it with a gated Claude pass, and logs every fix
so the deterministic processors can ratchet up over time and the Claude
dependency shrinks. Full design: [`docs/05-features/feat-0007-pdf-content-quality.md`](../05-features/feat-0007-pdf-content-quality.md).

### Synchronous path — ingest

```
┌──────────┐ ┌──────────────────┐ ┌──────────────────────────────────────┐
│ Upload │───►│ POST /me/books/ │───►│ Worker — UserIngestionService │
│ PDF │ │ upload │ │ PdfPig extract │
└──────────┘ └──────────────────┘ │ Spelling→Hyphenation→Typography→ │
│ Semantic→Linter ◄── ratchet lands │
│ ChapterContentQualityAnalyzer │
│ → ContentQualityScore (0-100) per ch│
│ TryQueueQualityJobAsync │
└──────────────────────────────────────┘
┌───────────────────────────────┴─────────────────┐
▼ ▼
┌────────────────┐ ┌──────────────────┐
│ user_books │ │ book_quality_jobs│
│ status=Ready │ (reader can open it now) │ status=Queued │
└────────────────┘ └──────────────────┘
```

Book is readable immediately at heuristic quality (~70-75%). Phase 3 cleanup
runs asynchronously and refines flagged chapters in place.

### Asynchronous path — cleanup

```
┌──────────────────────────────────┐
│ quality-poller.service (systemd) │ polls every 30s
│ infra/scripts/quality-poll.sh │ claude CLI on host (Max sub)
└──────────────────────────────────┘
▼ claim Queued BookQualityJob
┌───────────────────────────────────────────────────────────────────────────┐
│ Phase 1 — validate chapter STRUCTURE │
│ claude -p → ISSUES_JSON (empty / fragment / giant / placeholder title) │
├───────────────────────────────────────────────────────────────────────────┤
│ Phase 2 — apply structure fixes via internal API │
│ DELETE / PUT(rename) / POST(merge) /internal/.../chapters/{n} │
├───────────────────────────────────────────────────────────────────────────┤
│ Phase 3 — content cleanup (gated by per-chapter ContentQualityScore) │
│ │
│ for each chapter where score < CONTENT_QUALITY_THRESHOLD: │
│ GET /internal/.../chapters/{n}/content ── messy HTML │
│ claude -p (preserve verbatim, fix structure only) │
│ pdf-cleanup-gate.py: word-multiset diff │
│ ACCEPT → PUT cleaned + write (messy→clean) pair to dataset │
│ REJECT → keep original, write rejected pair │
└───────────────────────────────────────────────────────────────────────────┘
data/pdf-cleanup-dataset/ ──► manual study (Slice 5)
──► new processors land in the
Semantic / Linter chain above
— Claude usage trends down.
```

### Components

| Piece | Where | Role |
|-------|-------|------|
| `ChapterContentQualityAnalyzer` | `backend/src/Extraction/.../Quality/` | Deterministic 0-100 score + issue codes — the gate that decides which chapters reach the LLM |
| `BookQualityJob` | DB entity | Tracks Phase 1-2 (`IssuesFound/Fixed`) and Phase 3 (`ContentChaptersCleaned/Rejected/Skipped`) |
| `quality-poll.sh` | host systemd | Three-phase orchestrator, calls `claude` CLI |
| `pdf-cleanup-gate.py` | host | Preservation gate — strips whitespace + hyphens, rejects hallucination (>3% novel tokens) or over-deletion (<70% retention) |
| Internal chapter endpoints | API | `GET .../chapters/{n}/content` + `PUT .../chapters/{n}` with `{html}` — single source of truth for chapter HTML |
| `data/pdf-cleanup-dataset/` | host | Append-only pair log — fuel for the heuristic ratchet |

### Configuration

| Knob | Default | Effect |
|---|---|---|
| `CONTENT_CLEANUP_ENABLED` (`.env`) | `false` | Master switch — Phase 3 no-op when off |
| `CONTENT_QUALITY_THRESHOLD` (`.env`) | `60` | Chapters scoring below this go through Claude. 60 ≈ obviously broken only |
| `CLEANUP_TIMEOUT` (`.env`) | `1500` (s) | Per-chapter Claude budget; 20k-word chapters hit ~15-20 min |
| `quality.autoQueueForUserBooks` (`admin_settings`) | `false` | Auto-enqueue a `BookQualityJob` after every user-book ingest |

Setup: `make quality-poll-setup` (one-time host systemd install).

## See Also

- [Multisite](multisite.md) — Host resolution and data isolation
Expand Down
Loading