From f4b9c74148623dc3bc7018846b89b23df2f95ccc Mon Sep 17 00:00:00 2001 From: Vasyl Vdovychenko Date: Fri, 22 May 2026 12:32:38 -0400 Subject: [PATCH 1/7] =?UTF-8?q?docs:=20plan=20=E2=80=94=20PDF=20content=20?= =?UTF-8?q?quality=20(Claude=20cleanup=20+=20heuristic=20ratchet)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Architecture + 5-slice breakdown for feat-0007. Marker shelved (4GB VRAM infeasible); this reuses quality-poll.sh + BookQualityJob + internal chapter endpoints — extends rather than builds new. Co-Authored-By: Claude Opus 4.7 (1M context) --- .../feat-0007-pdf-content-quality.md | 228 ++++++++++++++++++ 1 file changed, 228 insertions(+) create mode 100644 docs/05-features/feat-0007-pdf-content-quality.md diff --git a/docs/05-features/feat-0007-pdf-content-quality.md b/docs/05-features/feat-0007-pdf-content-quality.md new file mode 100644 index 00000000..925a3393 --- /dev/null +++ b/docs/05-features/feat-0007-pdf-content-quality.md @@ -0,0 +1,228 @@ +# PDF Content Quality — Claude-assisted cleanup + heuristic ratchet + +Make PDF-extracted books actually readable. PdfPig + heuristics get us to +~70-75%; the gap to ~90% is semantic (running headers in body, fragmented +paragraphs, hyphenation, footnotes mixed in). Close it with a gated Claude +cleanup pass — and feed what Claude does back into deterministic rules so the +heuristics keep getting better and the Claude dependency keeps shrinking. + +**Status**: Planned (branch `feat/pdf-content-quality`) +**Author**: Vasyl Vdovychenko + Claude +**Started**: 2026-05-22 + +--- + +## Why + +PDF is the hard format. EPUB/FB2 carry publisher semantics (`

`, `

`, +lists) → our extractors already produce ~90%. PDF carries only glyphs + +geometry; structure must be reconstructed. + +Observed on a real upload (*AI Engineering*, Chip Huyen, 535 pp.) after the +round-1 heuristic fix (`feat`/Option A): + +``` +

Chapter 1: Introduction… | 4

running header leaked into body +

about

new

chal‐

fragmented — one word per

+"mod‐ els", "appli‐ cation" line-wrap hyphens not merged +"1 In this book, I use…" footnote inline, unlinked +``` + +These are **semantic** judgements ("is this line chrome or content?") that +pure geometry heuristics handle poorly. An LLM handles them well. + +### Why not Marker + +Evaluated and shelved (`shelf/marker-integration`). Marker (Surya ML pipeline) +needs ~3.6 GB VRAM for its model set; the prod GTX 1650 Ti has 4 GB → CUDA OOM. +CPU mode works but ~1.5 h/book. It is also an 11 GB image + fragile CUDA/torch +deps — a general-purpose document-AI tool, overweight for our narrow case +(digital book PDFs → readable prose). + +### Why this approach + +- **Reuses what exists.** `quality-poll.sh` already runs Claude CLI via a + systemd poller; `BookQualityJob` already enqueues post-ingestion; internal + chapter GET/PUT endpoints already read/write chapter HTML. The new work is + one phase in one script + one analyzer + tracking fields. +- **Claude via the Max subscription** — `$0` marginal, no API key. +- **Self-improving.** Every Claude (messy → clean) pair is logged. Recurring + fixes get distilled into deterministic processors in the existing + `Spelling → Hyphenation → Typography → Semantic → Linter` chain. Heuristics + ratchet up; Claude is called less over time. + +--- + +## Architecture + +``` +┌─ Ingestion — Worker (container) ───────────────────────────┐ +│ PdfPig extract │ +│ → processor chain: Spelling→Hyphenation→Typography→ │ +│ Semantic→Linter ◄── heuristic ratchet lands here │ +│ → ChapterContentQualityAnalyzer → per-chapter score 0-100 │ +│ → enqueue BookQualityJob (records flagged chapters) │ +└────────────────────────────────────────────────────────────┘ + │ DB +┌─ quality-poll.sh — systemd (host) ─────────────────────────┐ +│ Phase 1 validate chapter structure (exists) │ +│ Phase 2 fix structure: delete/rename/merge (exists) │ +│ Phase 3 CONTENT CLEANUP (NEW) │ +│ for each flagged PDF chapter: │ +│ GET /internal/.../chapters/{n}/content (messy HTML) │ +│ → claude -p (cleanup prompt, preserve verbatim) │ +│ → preservation gate (word-set diff; reject drift) │ +│ → PUT /internal/.../chapters/{n} {html} │ +│ → append (in,out) pair to dataset log │ +└────────────────────────────────────────────────────────────┘ + │ + data/pdf-cleanup-dataset/*.json (messy→clean pairs) + │ + periodic (manual): study pairs → encode recurring fixes as + Semantic/Linter processors ──► ratchet back into the chain +``` + +### Components + +| Component | State | Responsibility | +|---|---|---| +| `ChapterContentQualityAnalyzer` | new | Deterministic 0-100 content score + issue list per chapter. Pure C#, no LLM. The **gate** — only low-scoring PDF chapters reach Claude. | +| `BookQualityJob` | extend | + content-phase tracking fields (chapters cleaned / rejected / skipped). | +| `UserChapter` / `Chapter` | extend | + `ContentQualityScore` column. | +| `quality-poll.sh` | extend | + Phase 3 content cleanup. | +| Internal chapter endpoints | reuse | `GET …/content`, `PUT …` already read/write HTML + recompute plainText. | +| `data/pdf-cleanup-dataset/` | new | Append-only (messy → clean) pair log — the ratchet's fuel. | +| Processor chain | reuse | Destination for distilled deterministic rules. | + +### Key design decisions + +1. **HTML-cleanup, not geometry-classifier.** Claude rewrites the chapter HTML + rather than labelling geometry lines. Reason: the geometry path needs heavy + plumbing (capture/persist X-Y, new endpoints) for a zero-hallucination + guarantee that the **preservation gate** delivers deterministically and + nearly for free. We lose geometry signal, but text patterns ("`Chapter 1:… + | 4`" repeats" → header) are enough for digital PDFs. + +2. **Preservation gate** — the anti-hallucination guard. Strip whitespace **and + hyphens** from original + cleaned plaintext, tokenize, compare multisets: + - cleaned introduces tokens absent from original → **reject** (hallucination) + - cleaned drops > N% of original tokens → **reject** (over-deletion) + - else → accept. + Hyphen-stripping is load-bearing: a legit `chal‐ lenges → challenges` merge + must not look like a hallucinated new word. + +3. **One job, one poller.** Content cleanup is Phase 3 of the existing + `BookQualityJob` / `quality-poll.sh`, not a parallel system. Phase 3 runs + after Phase 1-2 (structure fixes renumber chapters). + +4. **PDF-only.** EPUB/FB2 already carry semantics. Phase 3 skips non-PDF books + by source-format check. + +5. **Gated.** Only chapters the analyzer scores below threshold get a Claude + call. Clean books cost `$0` and add no latency. + +6. **Per-chapter Claude calls within a per-book job.** Smaller input/output, + per-chapter preservation gate, partial success. The *job* is per-book. + +### Feature flag & rollback + +- Config flag `Quality:ContentCleanupEnabled` (default off). Phase 3 is a no-op + when off. +- Rollback: flip the flag off, or drop Phase 3 from the script and redeploy. + Phases 1-2 unaffected; chapters keep their last-good HTML. + +--- + +## Slices + +Each slice is independently shippable and testable. Branch +`feat/pdf-content-quality`; one commit/PR per slice. + +### Slice 1 — Content-quality analyzer · ~1 day · no infra, no LLM + +`ChapterContentQualityAnalyzer` — pure C#. Input: chapter HTML. Output: +score 0-100 + issues (`FragmentedParagraphs`, `RunningHeaderInBody`, +`HyphenationArtifacts`, `OrphanPageNumbers`, `SuspectedFootnotes`). + +- Detectors are deterministic heuristics over the HTML/text. +- Fully unit-tested against known-bad and known-clean fixtures. +- **Acceptance**: bad chapter → low score + correct issue codes; clean + chapter → high score. No DB, no network. +- **Ships value alone**: enables a "may have formatting issues" signal. + +### Slice 2 — Detection wiring + schema · ~1 day + +- Migration: `ContentQualityScore int?` on `UserChapter` + `Chapter`. +- Worker runs the analyzer after ingestion, persists per-chapter score. +- `BookQualityJob` + content-phase fields (`ContentChaptersCleaned`, + `ContentChaptersRejected`, `ContentChaptersSkipped`). +- **Acceptance**: upload a bad PDF → flagged chapters carry a low score in DB; + clean PDF → high scores. Integration-tested. + +### Slice 3 — Claude cleanup phase · ~1.5 days · the core + +- `quality-poll.sh` Phase 3: loop flagged PDF chapters → GET content → + `claude -p` cleanup prompt → preservation gate → PUT back. +- Cleanup prompt: fix structure, drop running headers/page numbers, merge + hyphenation, rejoin fragments — **preserve every word and all code verbatim**. +- Preservation gate (inline python) — see design decision 2. +- Pair logging → `data/pdf-cleanup-dataset/{bookId}-{chapter}.json`. +- Behind `Quality:ContentCleanupEnabled`. +- **Acceptance**: bad chapter → readable HTML, content preserved; injected + hallucination → gate rejects, original kept; pair file written. + +### Slice 4 — Observability & admin · ~0.5 day + +- Admin `BookQualityJob` view surfaces Phase 3 results (cleaned / rejected / + skipped counts, per-chapter). +- Worker logs score distribution per book. +- **Acceptance**: admin can see what Phase 3 did and why anything was rejected. + +### Slice 5 — Heuristic ratchet, round 1 · ~1 day · ongoing thereafter + +- Study accumulated pairs (Claude-assisted meta-analysis: "what recurring + fix-patterns appear? propose deterministic rules"). +- Encode the rule-expressible ~70-80% as `Semantic`/`Linter` processors in the + extraction chain (`RULES.md`). +- **Acceptance**: re-running affected fixtures, the new processors fix them + with no Claude call; analyzer score rises; fewer chapters flagged. +- Repeat as a standing maintenance ritual — Claude usage trends down. + +**Total to first useful state (Slices 1-3): ~3.5 days. Full: ~5 days.** + +--- + +## Sequencing & dependencies + +``` +Slice 1 ─► Slice 2 ─► Slice 3 ─► Slice 4 + └─────► Slice 5 (needs pairs from Slice 3) +``` + +Slices 1-2 are pure backend, zero risk, deployable behind the disabled flag. +Slice 3 is the only one touching the host poller + Claude. Slice 5 is recurring. + +## Risks + +| Risk | Mitigation | +|---|---| +| Claude hallucinates / drops content | Preservation gate rejects; original kept. | +| Gate false-rejects legit hyphen merges | Strip hyphens + whitespace before diffing. | +| Claude mangles code blocks | Prompt: "preserve code verbatim"; gate catches token drift. | +| Heavily-flagged book = many Claude calls | Per-book ≈ 5-15 min typical, ~30-45 min worst case. Async job — acceptable. Ratchet shrinks it over time. | +| Poller script grows large | Phase 3 reuses existing helpers; preservation gate is one python block. Acceptable; revisit if it crosses ~600 lines. | + +## Performance + +- Clean book: ~10-30 s (heuristics only, no Claude) — unchanged. +- Lightly-flagged book (2-4 chapters): ~5-15 min async. +- Heavily-flagged book: ~30-45 min async. Still far below Marker's ~1.5 h, and + it trends down as the ratchet absorbs recurring fixes. + +## Open questions + +- Threshold score for flagging a chapter — start ~60, tune on real books? +- Editions (admin catalog) too, or user-books first? Endpoints exist for both. +- Cleanup granularity if a chapter is huge (>30k words) — chunk, or rely on + Claude's context window? +- Pair-log retention — keep all, or cap at N most-recent per book? From a78381f82a0e092b17d52847a4c9ae0df83b682d Mon Sep 17 00:00:00 2001 From: Vasyl Vdovychenko Date: Fri, 22 May 2026 12:50:20 -0400 Subject: [PATCH 2/7] feat(pdf-quality) [slice 1]: ChapterContentQualityAnalyzer MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Deterministic 0-100 content-quality score + issue codes for extracted chapter HTML. Detects the recurring PDF-extraction defects: fragmented paragraphs, running headers in body, unmerged hyphenation, orphan page numbers, inlined footnotes. Pure C#, no I/O — the gate that decides which chapters warrant an LLM cleanup pass (slice 3). 12 unit tests. Co-Authored-By: Claude Opus 4.7 (1M context) --- .../Quality/ChapterContentQualityAnalyzer.cs | 151 +++++++++++++++++ .../Quality/ContentQualityReport.cs | 37 +++++ .../ChapterContentQualityAnalyzerTests.cs | 154 ++++++++++++++++++ 3 files changed, 342 insertions(+) create mode 100644 backend/src/Extraction/TextStack.Extraction/Quality/ChapterContentQualityAnalyzer.cs create mode 100644 backend/src/Extraction/TextStack.Extraction/Quality/ContentQualityReport.cs create mode 100644 tests/TextStack.Extraction.Tests/ChapterContentQualityAnalyzerTests.cs diff --git a/backend/src/Extraction/TextStack.Extraction/Quality/ChapterContentQualityAnalyzer.cs b/backend/src/Extraction/TextStack.Extraction/Quality/ChapterContentQualityAnalyzer.cs new file mode 100644 index 00000000..975749e7 --- /dev/null +++ b/backend/src/Extraction/TextStack.Extraction/Quality/ChapterContentQualityAnalyzer.cs @@ -0,0 +1,151 @@ +using System.Text.RegularExpressions; +using HtmlAgilityPack; + +namespace TextStack.Extraction.Quality; + +///

+/// Scores extracted chapter HTML for the structural defects typical of PDF +/// extraction (see ). Pure, deterministic, +/// no I/O — the gate that decides which chapters are worth an LLM cleanup pass. +/// +/// Score starts at 100; each detected defect subtracts a frequency-scaled +/// penalty. A defect is only reported once it crosses a floor, so trivial +/// one-off noise doesn't flag an otherwise-clean chapter. +/// +public static class ChapterContentQualityAnalyzer +{ + // Fragment-fraction is only meaningful once a chapter has enough paragraphs. + private const int MinParagraphsForFragmentCheck = 4; + + // Compiled, not [GeneratedRegex] — ARM64 SIGILL bug (see Extraction/RULES.md). + private static readonly Regex PageNumberOnly = + new(@"^\s*\d{1,4}\s*$", RegexOptions.Compiled); + private static readonly Regex RunningHeaderPipe = + new(@"(^\s*\d{1,4}\s*\|)|(\|\s*\d{1,4}\s*$)", RegexOptions.Compiled); + private static readonly Regex HyphenArtifact = + new(@"\p{L}[‐­‑] \p{Ll}", RegexOptions.Compiled); + private static readonly Regex FootnoteStart = + new(@"^\s*\d{1,3}\s+\p{Lu}", RegexOptions.Compiled); + private static readonly Regex Whitespace = + new(@"\s+", RegexOptions.Compiled); + + private static readonly HashSet NoiseGlyphs = + new(StringComparer.Ordinal) { "|", "•", "·", "*", "■", "□", "—", "–" }; + + public static ContentQualityReport Analyze(string? html) + { + if (string.IsNullOrWhiteSpace(html)) + return ContentQualityReport.Clean; + + var doc = new HtmlDocument(); + doc.LoadHtml(html); + + var paragraphs = (doc.DocumentNode.SelectNodes("//p") ?? Enumerable.Empty()) + .Select(p => NormalizeText(p.InnerText)) + .Where(t => t.Length > 0) + .ToList(); + + if (paragraphs.Count == 0) + return ContentQualityReport.Clean; + + var issues = new List(); + var penalty = 0; + + penalty += ScoreFragments(paragraphs, issues); + penalty += ScoreRunningHeaders(paragraphs, issues); + penalty += ScoreHyphenation(paragraphs, issues); + penalty += ScoreOrphanNumbers(paragraphs, issues); + penalty += ScoreFootnotes(paragraphs, issues); + + return new ContentQualityReport(Math.Clamp(100 - penalty, 0, 100), issues); + } + + // ── Detectors ────────────────────────────────────────────────────────── + // Each returns a penalty (0 = nothing wrong) and appends its issue code + // when the defect is real, not incidental. + + private static int ScoreFragments(List paragraphs, List issues) + { + if (paragraphs.Count < MinParagraphsForFragmentCheck) + return 0; + + var fragments = paragraphs.Count(IsFragment); + var fraction = (double)fragments / paragraphs.Count; + + // Real signal: ≥12% of paragraphs are fragments, or ≥8 of them outright. + if (fraction < 0.12 && fragments < 8) + return 0; + + issues.Add(ContentQualityIssue.FragmentedParagraphs); + return (int)Math.Min(60, fraction * 150); + } + + private static int ScoreRunningHeaders(List paragraphs, List issues) + { + var pipeHeaders = paragraphs.Count(p => RunningHeaderPipe.IsMatch(p)); + + // Identical short paragraphs repeating within one chapter = leaked chrome. + var repeats = paragraphs + .Where(p => p.Length <= 100) + .GroupBy(p => p, StringComparer.Ordinal) + .Where(g => g.Count() >= 2) + .Sum(g => g.Count() - 1); + + var count = pipeHeaders + repeats; + if (count < 2) + return 0; + + issues.Add(ContentQualityIssue.RunningHeaderInBody); + return Math.Min(25, count * 7); + } + + private static int ScoreHyphenation(List paragraphs, List issues) + { + var count = paragraphs.Sum(p => HyphenArtifact.Matches(p).Count); + if (count < 3) + return 0; + + issues.Add(ContentQualityIssue.HyphenationArtifacts); + return Math.Min(20, count * 2); + } + + private static int ScoreOrphanNumbers(List paragraphs, List issues) + { + var count = paragraphs.Count(IsOrphanNumberOrGlyph); + if (count < 2) + return 0; + + issues.Add(ContentQualityIssue.OrphanPageNumbers); + return Math.Min(15, count * 5); + } + + private static int ScoreFootnotes(List paragraphs, List issues) + { + var count = paragraphs.Count(p => FootnoteStart.IsMatch(p)); + if (count < 3) + return 0; + + issues.Add(ContentQualityIssue.SuspectedFootnotes); + return Math.Min(10, count * 2); + } + + // ── Helpers ──────────────────────────────────────────────────────────── + + /// A stray ≤2-word paragraph that doesn't end a sentence. + private static bool IsFragment(string text) + { + var words = text.Split(' ', StringSplitOptions.RemoveEmptyEntries); + if (words.Length > 2) + return false; + var last = text[^1]; + return last is not ('.' or '!' or '?' or '…' or ':' or ';'); + } + + private static bool IsOrphanNumberOrGlyph(string text) + => PageNumberOnly.IsMatch(text) + || (text.Length <= 2 && NoiseGlyphs.Contains(text)); + + /// De-entitize, collapse whitespace, trim. + private static string NormalizeText(string raw) + => Whitespace.Replace(HtmlEntity.DeEntitize(raw) ?? string.Empty, " ").Trim(); +} diff --git a/backend/src/Extraction/TextStack.Extraction/Quality/ContentQualityReport.cs b/backend/src/Extraction/TextStack.Extraction/Quality/ContentQualityReport.cs new file mode 100644 index 00000000..7818a533 --- /dev/null +++ b/backend/src/Extraction/TextStack.Extraction/Quality/ContentQualityReport.cs @@ -0,0 +1,37 @@ +namespace TextStack.Extraction.Quality; + +/// +/// A structural defect detected in extracted chapter HTML. These are the +/// recurring failure modes of PDF text extraction — the things a clean EPUB +/// never has. +/// +public enum ContentQualityIssue +{ + /// Many one/two-word <p> in a row — paragraph reconstruction failed. + FragmentedParagraphs, + + /// Running headers/footers ("Title | 4") leaked into the body. + RunningHeaderInBody, + + /// Line-wrap hyphens left unmerged ("chal‐ lenges"). + HyphenationArtifacts, + + /// Bare page numbers / dividers surviving as their own paragraphs. + OrphanPageNumbers, + + /// Footnote bodies ("1 In this book…") inlined into the flow. + SuspectedFootnotes, +} + +/// +/// Deterministic content-quality verdict for one chapter. +/// is 0-100, higher is cleaner. The caller decides the +/// flag threshold (default 60 — see feat-0007). +/// +public sealed record ContentQualityReport( + int Score, + IReadOnlyList Issues) +{ + /// A chapter with no analyzable paragraphs — nothing to score against. + public static ContentQualityReport Clean { get; } = new(100, []); +} diff --git a/tests/TextStack.Extraction.Tests/ChapterContentQualityAnalyzerTests.cs b/tests/TextStack.Extraction.Tests/ChapterContentQualityAnalyzerTests.cs new file mode 100644 index 00000000..a8c3ecdd --- /dev/null +++ b/tests/TextStack.Extraction.Tests/ChapterContentQualityAnalyzerTests.cs @@ -0,0 +1,154 @@ +using TextStack.Extraction.Quality; + +namespace TextStack.Extraction.Tests; + +public class ChapterContentQualityAnalyzerTests +{ + private const string CleanChapter = """ +

The Rise of AI Engineering

+

Foundation models emerged from large language models, which in turn originated as language models.

+

While applications like ChatGPT may seem to have come out of nowhere, they are the culmination of decades of advancement.

+

This section traces the key breakthroughs that enabled the evolution from language models to AI engineering.

+

A language model encodes statistical information about one or more languages in a compact form.

+

Intuitively, this information tells us how likely a word is to appear in a given context.

+

The statistical nature of language was discovered centuries ago by curious mathematicians.

+ """; + + [Fact] + public void Analyze_CleanChapter_ScoresHighWithNoIssues() + { + var report = ChapterContentQualityAnalyzer.Analyze(CleanChapter); + + Assert.Equal(100, report.Score); + Assert.Empty(report.Issues); + } + + [Fact] + public void Analyze_NullOrEmpty_ReturnsClean() + { + Assert.Equal(100, ChapterContentQualityAnalyzer.Analyze(null).Score); + Assert.Equal(100, ChapterContentQualityAnalyzer.Analyze("").Score); + Assert.Equal(100, ChapterContentQualityAnalyzer.Analyze(" ").Score); + } + + [Fact] + public void Analyze_FragmentedParagraphs_FlagsAndScoresLow() + { + var html = "

about

new

possibilities

and

new

" + + "

challenges

which

are

" + + "

This is one genuine full sentence that ends properly.

" + + "

And here is a second complete sentence with real content.

"; + + var report = ChapterContentQualityAnalyzer.Analyze(html); + + Assert.Contains(ContentQualityIssue.FragmentedParagraphs, report.Issues); + Assert.True(report.Score < 60, $"expected low score, got {report.Score}"); + } + + [Fact] + public void Analyze_FewParagraphs_SkipsFragmentCheck() + { + // Below MinParagraphsForFragmentCheck — fragment fraction is too noisy. + var report = ChapterContentQualityAnalyzer.Analyze("

one

two

"); + + Assert.DoesNotContain(ContentQualityIssue.FragmentedParagraphs, report.Issues); + } + + [Fact] + public void Analyze_RunningHeaderInBody_Flags() + { + var html = CleanChapter + + "

4 | Chapter 1: Introduction to Building AI Applications

" + + "

The Rise of AI Engineering | 5

"; + + var report = ChapterContentQualityAnalyzer.Analyze(html); + + Assert.Contains(ContentQualityIssue.RunningHeaderInBody, report.Issues); + Assert.True(report.Score < 100); + } + + [Fact] + public void Analyze_RepeatedShortParagraph_FlagsAsRunningHeader() + { + var html = CleanChapter + + "

Introduction to Building AI Applications

" + + "

Introduction to Building AI Applications

" + + "

Introduction to Building AI Applications

"; + + var report = ChapterContentQualityAnalyzer.Analyze(html); + + Assert.Contains(ContentQualityIssue.RunningHeaderInBody, report.Issues); + } + + [Fact] + public void Analyze_HyphenationArtifacts_Flags() + { + // U+2010 hyphen + space + lowercase = unmerged line-wrap hyphen. + var html = CleanChapter + + "

This text discusses mod‐ els and appli‐ cations and the engi‐ neering process.

"; + + var report = ChapterContentQualityAnalyzer.Analyze(html); + + Assert.Contains(ContentQualityIssue.HyphenationArtifacts, report.Issues); + } + + [Fact] + public void Analyze_RealHyphenatedWords_DoNotFlag() + { + // ASCII hyphen with no wrap-space — "self-supervision" is a real word. + var html = CleanChapter + + "

Self-supervision and large-scale pre-training are well-known techniques.

"; + + var report = ChapterContentQualityAnalyzer.Analyze(html); + + Assert.DoesNotContain(ContentQualityIssue.HyphenationArtifacts, report.Issues); + } + + [Fact] + public void Analyze_OrphanPageNumbers_Flags() + { + var html = CleanChapter + "

2

|

405

"; + + var report = ChapterContentQualityAnalyzer.Analyze(html); + + Assert.Contains(ContentQualityIssue.OrphanPageNumbers, report.Issues); + } + + [Fact] + public void Analyze_SuspectedFootnotes_Flags() + { + var html = CleanChapter + + "

1 In this book, I use traditional ML to refer to non-foundation models.

" + + "

2 For non-English languages, a character can map to multiple tokens.

" + + "

3 Autoregressive models are sometimes called causal language models.

"; + + var report = ChapterContentQualityAnalyzer.Analyze(html); + + Assert.Contains(ContentQualityIssue.SuspectedFootnotes, report.Issues); + } + + [Fact] + public void Analyze_CombinedGarbage_ScoresVeryLowWithMultipleIssues() + { + var html = "

about

new

possibilities

and

chal‐

" + + "

large-scale models bring lenges which matter here today now

" + + "

4 | Chapter 1: Introduction

2

|

" + + "

1 A footnote body that leaked into the reading flow here.

" + + "

street

food

love

more

"; + + var report = ChapterContentQualityAnalyzer.Analyze(html); + + Assert.True(report.Score < 40, $"expected very low score, got {report.Score}"); + Assert.Contains(ContentQualityIssue.FragmentedParagraphs, report.Issues); + Assert.True(report.Issues.Count >= 2); + } + + [Fact] + public void Analyze_NoParagraphs_ReturnsClean() + { + var report = ChapterContentQualityAnalyzer.Analyze("

Title

"); + + Assert.Equal(100, report.Score); + Assert.Empty(report.Issues); + } +} From 0eeeecb70c6d66a7accb3a24b54cc549c0fe8173 Mon Sep 17 00:00:00 2001 From: Vasyl Vdovychenko Date: Fri, 22 May 2026 13:05:37 -0400 Subject: [PATCH 3/7] feat(pdf-quality) [slice 2]: persist content-quality score on ingest MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - ContentQualityScore (int?) on Chapter + UserChapter — set at chapter creation in both ingestion paths via ChapterContentQualityAnalyzer. - BookQualityJob: ContentChaptersCleaned/Rejected/Skipped (int?) — tracking fields for the Phase 3 cleanup pass (written by the poller). - EF migration AddChapterContentQualityScore (5 nullable columns). Score is informational until slice 3 acts on it. Runs on every format; EPUB/FB2 score high, PDF surfaces the low ones. Co-Authored-By: Claude Opus 4.7 (1M context) --- .../Application/Ingestion/IngestionService.cs | 5 +- backend/src/Domain/Entities/BookQualityJob.cs | 8 + backend/src/Domain/Entities/Chapter.cs | 6 + backend/src/Domain/Entities/UserChapter.cs | 7 + ..._AddChapterContentQualityScore.Designer.cs | 4265 +++++++++++++++++ ...522170250_AddChapterContentQualityScore.cs | 68 + .../Migrations/AppDbContextModelSnapshot.cs | 22 +- .../Worker/Services/UserIngestionService.cs | 7 +- 8 files changed, 4384 insertions(+), 4 deletions(-) create mode 100644 backend/src/Infrastructure/Migrations/20260522170250_AddChapterContentQualityScore.Designer.cs create mode 100644 backend/src/Infrastructure/Migrations/20260522170250_AddChapterContentQualityScore.cs diff --git a/backend/src/Application/Ingestion/IngestionService.cs b/backend/src/Application/Ingestion/IngestionService.cs index 9d559f8a..2d6e058c 100644 --- a/backend/src/Application/Ingestion/IngestionService.cs +++ b/backend/src/Application/Ingestion/IngestionService.cs @@ -4,6 +4,7 @@ using Domain.Enums; using Domain.Utilities; using Microsoft.EntityFrameworkCore; +using TextStack.Extraction.Quality; namespace Application.Ingestion; @@ -89,6 +90,7 @@ public async Task ProcessParsedBookAsync( foreach (var ch in parsed.Chapters) { var chapterSlug = SlugGenerator.GenerateChapterSlug(ch.Title, ch.Order); + var chapterHtml = SanitizeText(ch.Html); var chapter = new Chapter { Id = Guid.NewGuid(), @@ -96,9 +98,10 @@ public async Task ProcessParsedBookAsync( ChapterNumber = ch.Order, Slug = chapterSlug, Title = SanitizeText(ch.Title), - Html = SanitizeText(ch.Html), + Html = chapterHtml, PlainText = SanitizeText(ch.PlainText), WordCount = ch.WordCount, + ContentQualityScore = ChapterContentQualityAnalyzer.Analyze(chapterHtml).Score, OriginalChapterNumber = ch.OriginalChapterNumber, PartNumber = ch.PartNumber, TotalParts = ch.TotalParts, diff --git a/backend/src/Domain/Entities/BookQualityJob.cs b/backend/src/Domain/Entities/BookQualityJob.cs index 44e5d566..4d7c9001 100644 --- a/backend/src/Domain/Entities/BookQualityJob.cs +++ b/backend/src/Domain/Entities/BookQualityJob.cs @@ -16,6 +16,14 @@ public class BookQualityJob public int? IssuesFound { get; set; } public int? IssuesFixed { get; set; } + // ── Content-cleanup phase (Phase 3) — populated by quality-poll.sh ── + /// Chapters whose HTML the LLM cleanup pass rewrote and the gate accepted. + public int? ContentChaptersCleaned { get; set; } + /// Chapters where the LLM output was rejected by the preservation gate. + public int? ContentChaptersRejected { get; set; } + /// Flagged chapters skipped (non-PDF, or cleanup disabled). + public int? ContentChaptersSkipped { get; set; } + public string? Error { get; set; } public string? LogOutput { get; set; } diff --git a/backend/src/Domain/Entities/Chapter.cs b/backend/src/Domain/Entities/Chapter.cs index 246f3742..86a589ef 100644 --- a/backend/src/Domain/Entities/Chapter.cs +++ b/backend/src/Domain/Entities/Chapter.cs @@ -22,6 +22,12 @@ public class Chapter /// Total parts the original chapter was split into (for "Part 2 of 5" display) public int? TotalParts { get; set; } + /// + /// Deterministic extraction-quality score 0-100 (see ChapterContentQualityAnalyzer). + /// Null = not yet analyzed. Below the flag threshold → candidate for LLM cleanup. + /// + public int? ContentQualityScore { get; set; } + public NpgsqlTsVector SearchVector { get; set; } = null!; public DateTimeOffset CreatedAt { get; set; } public DateTimeOffset UpdatedAt { get; set; } diff --git a/backend/src/Domain/Entities/UserChapter.cs b/backend/src/Domain/Entities/UserChapter.cs index 90fdecee..8e62014d 100644 --- a/backend/src/Domain/Entities/UserChapter.cs +++ b/backend/src/Domain/Entities/UserChapter.cs @@ -10,6 +10,13 @@ public class UserChapter public required string Html { get; set; } public required string PlainText { get; set; } public int? WordCount { get; set; } + + /// + /// Deterministic extraction-quality score 0-100 (see ChapterContentQualityAnalyzer). + /// Null = not yet analyzed. Below the flag threshold → candidate for LLM cleanup. + /// + public int? ContentQualityScore { get; set; } + public DateTimeOffset CreatedAt { get; set; } public UserBook UserBook { get; set; } = null!; diff --git a/backend/src/Infrastructure/Migrations/20260522170250_AddChapterContentQualityScore.Designer.cs b/backend/src/Infrastructure/Migrations/20260522170250_AddChapterContentQualityScore.Designer.cs new file mode 100644 index 00000000..44f4e644 --- /dev/null +++ b/backend/src/Infrastructure/Migrations/20260522170250_AddChapterContentQualityScore.Designer.cs @@ -0,0 +1,4265 @@ +// +using System; +using Infrastructure.Persistence; +using Microsoft.EntityFrameworkCore; +using Microsoft.EntityFrameworkCore.Infrastructure; +using Microsoft.EntityFrameworkCore.Migrations; +using Microsoft.EntityFrameworkCore.Storage.ValueConversion; +using Npgsql.EntityFrameworkCore.PostgreSQL.Metadata; +using NpgsqlTypes; + +#nullable disable + +namespace Infrastructure.Migrations +{ + [DbContext(typeof(AppDbContext))] + [Migration("20260522170250_AddChapterContentQualityScore")] + partial class AddChapterContentQualityScore + { + /// + protected override void BuildTargetModel(ModelBuilder modelBuilder) + { +#pragma warning disable 612, 618 + modelBuilder + .HasAnnotation("ProductVersion", "10.0.8") + .HasAnnotation("Relational:MaxIdentifierLength", 63); + + NpgsqlModelBuilderExtensions.UseIdentityByDefaultColumns(modelBuilder); + + modelBuilder.Entity("Domain.Entities.AdminRefreshToken", b => + { + b.Property("Id") + .ValueGeneratedOnAdd() + .HasColumnType("uuid") + .HasColumnName("id"); + + b.Property("AdminUserId") + .HasColumnType("uuid") + .HasColumnName("admin_user_id"); + + b.Property("CreatedAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("created_at"); + + b.Property("ExpiresAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("expires_at"); + + b.Property("Token") + .IsRequired() + .HasColumnType("text") + .HasColumnName("token"); + + b.HasKey("Id") + .HasName("pk_admin_refresh_tokens"); + + b.HasIndex("AdminUserId") + .HasDatabaseName("ix_admin_refresh_tokens_admin_user_id"); + + b.HasIndex("ExpiresAt") + .HasDatabaseName("ix_admin_refresh_tokens_expires_at"); + + b.HasIndex("Token") + .IsUnique() + .HasDatabaseName("ix_admin_refresh_tokens_token"); + + b.ToTable("admin_refresh_tokens", (string)null); + }); + + modelBuilder.Entity("Domain.Entities.AdminSettings", b => + { + b.Property("Key") + .HasMaxLength(100) + .HasColumnType("character varying(100)") + .HasColumnName("key"); + + b.Property("UpdatedAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("updated_at"); + + b.Property("Value") + .IsRequired() + .HasMaxLength(500) + .HasColumnType("character varying(500)") + .HasColumnName("value"); + + b.HasKey("Key") + .HasName("pk_admin_settings"); + + b.ToTable("admin_settings", (string)null); + }); + + modelBuilder.Entity("Domain.Entities.AdminUser", b => + { + b.Property("Id") + .ValueGeneratedOnAdd() + .HasColumnType("uuid") + .HasColumnName("id"); + + b.Property("CreatedAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("created_at"); + + b.Property("Email") + .IsRequired() + .HasColumnType("text") + .HasColumnName("email"); + + b.Property("IsActive") + .HasColumnType("boolean") + .HasColumnName("is_active"); + + b.Property("PasswordHash") + .IsRequired() + .HasColumnType("text") + .HasColumnName("password_hash"); + + b.Property("Role") + .HasColumnType("integer") + .HasColumnName("role"); + + b.Property("UpdatedAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("updated_at"); + + b.HasKey("Id") + .HasName("pk_admin_users"); + + b.HasIndex("Email") + .IsUnique() + .HasDatabaseName("ix_admin_users_email"); + + b.ToTable("admin_users", (string)null); + }); + + modelBuilder.Entity("Domain.Entities.Author", b => + { + b.Property("Id") + .ValueGeneratedOnAdd() + .HasColumnType("uuid") + .HasColumnName("id"); + + b.Property("Bio") + .HasColumnType("text") + .HasColumnName("bio"); + + b.Property("CanonicalOverride") + .HasColumnType("text") + .HasColumnName("canonical_override"); + + b.Property("CreatedAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("created_at"); + + b.Property("ExternalLinksJson") + .HasColumnType("jsonb") + .HasColumnName("external_links_json"); + + b.Property("Indexable") + .HasColumnType("boolean") + .HasColumnName("indexable"); + + b.Property("Name") + .IsRequired() + .HasMaxLength(255) + .HasColumnType("character varying(255)") + .HasColumnName("name"); + + b.Property("PhotoPath") + .HasColumnType("text") + .HasColumnName("photo_path"); + + b.Property("SeoDescription") + .HasColumnType("text") + .HasColumnName("seo_description"); + + b.Property("SeoFaqsJson") + .HasColumnType("text") + .HasColumnName("seo_faqs_json"); + + b.Property("SeoRelevanceText") + .HasColumnType("text") + .HasColumnName("seo_relevance_text"); + + b.Property("SeoSource") + .ValueGeneratedOnAdd() + .HasColumnType("integer") + .HasDefaultValue(0) + .HasColumnName("seo_source"); + + b.Property("SeoThemesJson") + .HasColumnType("text") + .HasColumnName("seo_themes_json"); + + b.Property("SeoTitle") + .HasColumnType("text") + .HasColumnName("seo_title"); + + b.Property("SiteId") + .HasColumnType("uuid") + .HasColumnName("site_id"); + + b.Property("Slug") + .IsRequired() + .HasMaxLength(255) + .HasColumnType("character varying(255)") + .HasColumnName("slug"); + + b.Property("UpdatedAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("updated_at"); + + b.HasKey("Id") + .HasName("pk_authors"); + + b.HasIndex("SiteId") + .HasDatabaseName("ix_authors_site_id"); + + b.HasIndex("SiteId", "Slug") + .IsUnique() + .HasDatabaseName("ix_authors_site_id_slug"); + + b.ToTable("authors", (string)null); + }); + + modelBuilder.Entity("Domain.Entities.AutoPublishJob", b => + { + b.Property("Id") + .ValueGeneratedOnAdd() + .HasColumnType("uuid") + .HasColumnName("id"); + + b.Property("CreatedAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("created_at"); + + b.Property("EditionId") + .HasColumnType("uuid") + .HasColumnName("edition_id"); + + b.Property("Error") + .HasColumnType("text") + .HasColumnName("error"); + + b.Property("FinishedAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("finished_at"); + + b.Property("GeneratedAuthorSeo") + .HasColumnType("boolean") + .HasColumnName("generated_author_seo"); + + b.Property("GeneratedEditionSeo") + .HasColumnType("boolean") + .HasColumnName("generated_edition_seo"); + + b.Property("LogOutput") + .HasColumnType("text") + .HasColumnName("log_output"); + + b.Property("Priority") + .HasColumnType("boolean") + .HasColumnName("priority"); + + b.Property("SiteId") + .HasColumnType("uuid") + .HasColumnName("site_id"); + + b.Property("StartedAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("started_at"); + + b.Property("Status") + .HasColumnType("integer") + .HasColumnName("status"); + + b.HasKey("Id") + .HasName("pk_auto_publish_jobs"); + + b.HasIndex("EditionId") + .HasDatabaseName("ix_auto_publish_jobs_edition_id"); + + b.HasIndex("SiteId") + .HasDatabaseName("ix_auto_publish_jobs_site_id"); + + b.ToTable("auto_publish_jobs", (string)null); + }); + + modelBuilder.Entity("Domain.Entities.BookAsset", b => + { + b.Property("Id") + .ValueGeneratedOnAdd() + .HasColumnType("uuid") + .HasColumnName("id"); + + b.Property("ByteSize") + .HasColumnType("bigint") + .HasColumnName("byte_size"); + + b.Property("ContentType") + .IsRequired() + .HasMaxLength(100) + .HasColumnType("character varying(100)") + .HasColumnName("content_type"); + + b.Property("CreatedAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("created_at"); + + b.Property("EditionId") + .HasColumnType("uuid") + .HasColumnName("edition_id"); + + b.Property("Kind") + .IsRequired() + .HasMaxLength(20) + .HasColumnType("character varying(20)") + .HasColumnName("kind"); + + b.Property("OriginalPath") + .IsRequired() + .HasMaxLength(500) + .HasColumnType("character varying(500)") + .HasColumnName("original_path"); + + b.Property("StoragePath") + .IsRequired() + .HasMaxLength(500) + .HasColumnType("character varying(500)") + .HasColumnName("storage_path"); + + b.HasKey("Id") + .HasName("pk_book_assets"); + + b.HasIndex("EditionId") + .HasDatabaseName("ix_book_assets_edition_id"); + + b.HasIndex("EditionId", "OriginalPath") + .IsUnique() + .HasDatabaseName("ix_book_assets_edition_id_original_path"); + + b.ToTable("book_assets", (string)null); + }); + + modelBuilder.Entity("Domain.Entities.BookCollection", b => + { + b.Property("CollectionId") + .HasColumnType("uuid") + .HasColumnName("collection_id"); + + b.Property("BookId") + .HasColumnType("uuid") + .HasColumnName("book_id"); + + b.Property("BookType") + .HasMaxLength(20) + .HasColumnType("character varying(20)") + .HasColumnName("book_type"); + + b.Property("AddedAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("added_at"); + + b.HasKey("CollectionId", "BookId", "BookType") + .HasName("pk_book_collections"); + + b.HasIndex("BookId") + .HasDatabaseName("ix_book_collections_book_id"); + + b.ToTable("book_collections", (string)null); + }); + + modelBuilder.Entity("Domain.Entities.BookFile", b => + { + b.Property("Id") + .ValueGeneratedOnAdd() + .HasColumnType("uuid") + .HasColumnName("id"); + + b.Property("EditionId") + .HasColumnType("uuid") + .HasColumnName("edition_id"); + + b.Property("Format") + .HasColumnType("integer") + .HasColumnName("format"); + + b.Property("OriginalFileName") + .IsRequired() + .HasColumnType("text") + .HasColumnName("original_file_name"); + + b.Property("Sha256") + .HasColumnType("text") + .HasColumnName("sha256"); + + b.Property("StoragePath") + .IsRequired() + .HasColumnType("text") + .HasColumnName("storage_path"); + + b.Property("UploadedAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("uploaded_at"); + + b.HasKey("Id") + .HasName("pk_book_files"); + + b.HasIndex("EditionId") + .HasDatabaseName("ix_book_files_edition_id"); + + b.HasIndex("Sha256") + .HasDatabaseName("ix_book_files_sha256"); + + b.ToTable("book_files", (string)null); + }); + + modelBuilder.Entity("Domain.Entities.BookQualityJob", b => + { + b.Property("Id") + .ValueGeneratedOnAdd() + .HasColumnType("uuid") + .HasColumnName("id"); + + b.Property("ContentChaptersCleaned") + .HasColumnType("integer") + .HasColumnName("content_chapters_cleaned"); + + b.Property("ContentChaptersRejected") + .HasColumnType("integer") + .HasColumnName("content_chapters_rejected"); + + b.Property("ContentChaptersSkipped") + .HasColumnType("integer") + .HasColumnName("content_chapters_skipped"); + + b.Property("CreatedAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("created_at"); + + b.Property("EditionId") + .HasColumnType("uuid") + .HasColumnName("edition_id"); + + b.Property("Error") + .HasColumnType("text") + .HasColumnName("error"); + + b.Property("FinishedAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("finished_at"); + + b.Property("IssuesFixed") + .HasColumnType("integer") + .HasColumnName("issues_fixed"); + + b.Property("IssuesFound") + .HasColumnType("integer") + .HasColumnName("issues_found"); + + b.Property("IssuesJson") + .HasColumnType("jsonb") + .HasColumnName("issues_json"); + + b.Property("LogOutput") + .HasColumnType("text") + .HasColumnName("log_output"); + + b.Property("StartedAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("started_at"); + + b.Property("Status") + .HasColumnType("integer") + .HasColumnName("status"); + + b.Property("UserBookId") + .HasColumnType("uuid") + .HasColumnName("user_book_id"); + + b.HasKey("Id") + .HasName("pk_book_quality_jobs"); + + b.HasIndex("EditionId") + .HasDatabaseName("ix_book_quality_jobs_edition_id"); + + b.HasIndex("Status") + .HasDatabaseName("ix_book_quality_jobs_status"); + + b.HasIndex("UserBookId") + .HasDatabaseName("ix_book_quality_jobs_user_book_id"); + + b.ToTable("book_quality_jobs", (string)null); + }); + + modelBuilder.Entity("Domain.Entities.Bookmark", b => + { + b.Property("Id") + .ValueGeneratedOnAdd() + .HasColumnType("uuid") + .HasColumnName("id"); + + b.Property("ChapterId") + .HasColumnType("uuid") + .HasColumnName("chapter_id"); + + b.Property("CreatedAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("created_at"); + + b.Property("EditionId") + .HasColumnType("uuid") + .HasColumnName("edition_id"); + + b.Property("Locator") + .IsRequired() + .HasColumnType("text") + .HasColumnName("locator"); + + b.Property("SiteId") + .HasColumnType("uuid") + .HasColumnName("site_id"); + + b.Property("Title") + .HasColumnType("text") + .HasColumnName("title"); + + b.Property("UserId") + .HasColumnType("uuid") + .HasColumnName("user_id"); + + b.HasKey("Id") + .HasName("pk_bookmarks"); + + b.HasIndex("ChapterId") + .HasDatabaseName("ix_bookmarks_chapter_id"); + + b.HasIndex("EditionId") + .HasDatabaseName("ix_bookmarks_edition_id"); + + b.HasIndex("SiteId") + .HasDatabaseName("ix_bookmarks_site_id"); + + b.HasIndex("UserId", "SiteId", "EditionId") + .HasDatabaseName("ix_bookmarks_user_id_site_id_edition_id"); + + b.ToTable("bookmarks", (string)null); + }); + + modelBuilder.Entity("Domain.Entities.Chapter", b => + { + b.Property("Id") + .ValueGeneratedOnAdd() + .HasColumnType("uuid") + .HasColumnName("id"); + + b.Property("ChapterNumber") + .HasColumnType("integer") + .HasColumnName("chapter_number"); + + b.Property("ContentQualityScore") + .HasColumnType("integer") + .HasColumnName("content_quality_score"); + + b.Property("CreatedAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("created_at"); + + b.Property("EditionId") + .HasColumnType("uuid") + .HasColumnName("edition_id"); + + b.Property("Html") + .IsRequired() + .HasColumnType("text") + .HasColumnName("html"); + + b.Property("OriginalChapterNumber") + .HasColumnType("integer") + .HasColumnName("original_chapter_number"); + + b.Property("PartNumber") + .HasColumnType("integer") + .HasColumnName("part_number"); + + b.Property("PlainText") + .IsRequired() + .HasColumnType("text") + .HasColumnName("plain_text"); + + b.Property("SearchVector") + .IsRequired() + .HasColumnType("tsvector") + .HasColumnName("search_vector"); + + b.Property("Slug") + .HasColumnType("text") + .HasColumnName("slug"); + + b.Property("Title") + .IsRequired() + .HasColumnType("text") + .HasColumnName("title"); + + b.Property("TotalParts") + .HasColumnType("integer") + .HasColumnName("total_parts"); + + b.Property("UpdatedAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("updated_at"); + + b.Property("WordCount") + .HasColumnType("integer") + .HasColumnName("word_count"); + + b.HasKey("Id") + .HasName("pk_chapters"); + + b.HasIndex("SearchVector") + .HasDatabaseName("ix_chapters_search_vector"); + + NpgsqlIndexBuilderExtensions.HasMethod(b.HasIndex("SearchVector"), "GIN"); + + b.HasIndex("EditionId", "ChapterNumber") + .IsUnique() + .HasDatabaseName("ix_chapters_edition_id_chapter_number"); + + b.HasIndex("EditionId", "Slug") + .HasDatabaseName("ix_chapters_edition_id_slug"); + + b.ToTable("chapters", (string)null); + }); + + modelBuilder.Entity("Domain.Entities.Collection", b => + { + b.Property("Id") + .ValueGeneratedOnAdd() + .HasColumnType("uuid") + .HasColumnName("id"); + + b.Property("Color") + .IsRequired() + .ValueGeneratedOnAdd() + .HasMaxLength(20) + .HasColumnType("character varying(20)") + .HasDefaultValue("default") + .HasColumnName("color"); + + b.Property("CreatedAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("created_at"); + + b.Property("Name") + .IsRequired() + .HasMaxLength(100) + .HasColumnType("character varying(100)") + .HasColumnName("name"); + + b.Property("SortOrder") + .HasColumnType("integer") + .HasColumnName("sort_order"); + + b.Property("UpdatedAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("updated_at"); + + b.Property("UserId") + .HasColumnType("uuid") + .HasColumnName("user_id"); + + b.HasKey("Id") + .HasName("pk_collections"); + + b.HasIndex("UserId") + .HasDatabaseName("ix_collections_user_id"); + + b.HasIndex("UserId", "SortOrder") + .HasDatabaseName("ix_collections_user_id_sort_order"); + + b.ToTable("collections", (string)null); + }); + + modelBuilder.Entity("Domain.Entities.Edition", b => + { + b.Property("Id") + .ValueGeneratedOnAdd() + .HasColumnType("uuid") + .HasColumnName("id"); + + b.Property("CanonicalOverride") + .HasColumnType("text") + .HasColumnName("canonical_override"); + + b.Property("CoverPath") + .HasColumnType("text") + .HasColumnName("cover_path"); + + b.Property("CreatedAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("created_at"); + + b.Property("Description") + .HasColumnType("text") + .HasColumnName("description"); + + b.Property("Indexable") + .HasColumnType("boolean") + .HasColumnName("indexable"); + + b.Property("IsPublicDomain") + .HasColumnType("boolean") + .HasColumnName("is_public_domain"); + + b.Property("Language") + .IsRequired() + .HasMaxLength(8) + .HasColumnType("character varying(8)") + .HasColumnName("language"); + + b.Property("PublishedAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("published_at"); + + b.Property("SeoDescription") + .HasColumnType("text") + .HasColumnName("seo_description"); + + b.Property("SeoFaqsJson") + .HasColumnType("text") + .HasColumnName("seo_faqs_json"); + + b.Property("SeoRelevanceText") + .HasColumnType("text") + .HasColumnName("seo_relevance_text"); + + b.Property("SeoSource") + .ValueGeneratedOnAdd() + .HasColumnType("integer") + .HasDefaultValue(0) + .HasColumnName("seo_source"); + + b.Property("SeoThemesJson") + .HasColumnType("text") + .HasColumnName("seo_themes_json"); + + b.Property("SeoTitle") + .HasColumnType("text") + .HasColumnName("seo_title"); + + b.Property("SiteId") + .HasColumnType("uuid") + .HasColumnName("site_id"); + + b.Property("Slug") + .IsRequired() + .HasColumnType("text") + .HasColumnName("slug"); + + b.Property("SourceEditionId") + .HasColumnType("uuid") + .HasColumnName("source_edition_id"); + + b.Property("Status") + .HasColumnType("integer") + .HasColumnName("status"); + + b.Property("Title") + .IsRequired() + .HasColumnType("text") + .HasColumnName("title"); + + b.Property("TocJson") + .HasColumnType("jsonb") + .HasColumnName("toc_json"); + + b.Property("UpdatedAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("updated_at"); + + b.Property("WorkId") + .HasColumnType("uuid") + .HasColumnName("work_id"); + + b.HasKey("Id") + .HasName("pk_editions"); + + b.HasIndex("SiteId") + .HasDatabaseName("ix_editions_site_id"); + + b.HasIndex("SourceEditionId") + .HasDatabaseName("ix_editions_source_edition_id"); + + b.HasIndex("Status") + .HasDatabaseName("ix_editions_status"); + + b.HasIndex("WorkId", "Language") + .IsUnique() + .HasDatabaseName("ix_editions_work_id_language"); + + b.HasIndex("SiteId", "Language", "Slug") + .IsUnique() + .HasDatabaseName("ix_editions_site_id_language_slug"); + + b.ToTable("editions", (string)null); + }); + + modelBuilder.Entity("Domain.Entities.EditionAuthor", b => + { + b.Property("EditionId") + .HasColumnType("uuid") + .HasColumnName("edition_id"); + + b.Property("AuthorId") + .HasColumnType("uuid") + .HasColumnName("author_id"); + + b.Property("Order") + .HasColumnType("integer") + .HasColumnName("order"); + + b.Property("Role") + .IsRequired() + .HasMaxLength(50) + .HasColumnType("character varying(50)") + .HasColumnName("role"); + + b.HasKey("EditionId", "AuthorId") + .HasName("pk_edition_authors"); + + b.HasIndex("AuthorId") + .HasDatabaseName("ix_edition_authors_author_id"); + + b.ToTable("edition_authors", (string)null); + }); + + modelBuilder.Entity("Domain.Entities.Genre", b => + { + b.Property("Id") + .ValueGeneratedOnAdd() + .HasColumnType("uuid") + .HasColumnName("id"); + + b.Property("CreatedAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("created_at"); + + b.Property("Description") + .HasColumnType("text") + .HasColumnName("description"); + + b.Property("Indexable") + .HasColumnType("boolean") + .HasColumnName("indexable"); + + b.Property("Name") + .IsRequired() + .HasMaxLength(100) + .HasColumnType("character varying(100)") + .HasColumnName("name"); + + b.Property("SeoDescription") + .HasColumnType("text") + .HasColumnName("seo_description"); + + b.Property("SeoSource") + .ValueGeneratedOnAdd() + .HasColumnType("integer") + .HasDefaultValue(0) + .HasColumnName("seo_source"); + + b.Property("SeoTitle") + .HasColumnType("text") + .HasColumnName("seo_title"); + + b.Property("SiteId") + .HasColumnType("uuid") + .HasColumnName("site_id"); + + b.Property("Slug") + .IsRequired() + .HasMaxLength(100) + .HasColumnType("character varying(100)") + .HasColumnName("slug"); + + b.Property("UpdatedAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("updated_at"); + + b.HasKey("Id") + .HasName("pk_genres"); + + b.HasIndex("SiteId") + .HasDatabaseName("ix_genres_site_id"); + + b.HasIndex("SiteId", "Slug") + .IsUnique() + .HasDatabaseName("ix_genres_site_id_slug"); + + b.ToTable("genres", (string)null); + }); + + modelBuilder.Entity("Domain.Entities.Highlight", b => + { + b.Property("Id") + .ValueGeneratedOnAdd() + .HasColumnType("uuid") + .HasColumnName("id"); + + b.Property("AnchorJson") + .IsRequired() + .HasColumnType("jsonb") + .HasColumnName("anchor_json"); + + b.Property("ChapterId") + .HasColumnType("uuid") + .HasColumnName("chapter_id"); + + b.Property("Color") + .IsRequired() + .HasMaxLength(20) + .HasColumnType("character varying(20)") + .HasColumnName("color"); + + b.Property("CreatedAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("created_at"); + + b.Property("EditionId") + .HasColumnType("uuid") + .HasColumnName("edition_id"); + + b.Property("LastReviewedAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("last_reviewed_at"); + + b.Property("NoteText") + .HasColumnType("text") + .HasColumnName("note_text"); + + b.Property("SelectedText") + .IsRequired() + .HasColumnType("text") + .HasColumnName("selected_text"); + + b.Property("SiteId") + .HasColumnType("uuid") + .HasColumnName("site_id"); + + b.Property("UpdatedAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("updated_at"); + + b.Property("UserBookId") + .HasColumnType("uuid") + .HasColumnName("user_book_id"); + + b.Property("UserChapterId") + .HasColumnType("uuid") + .HasColumnName("user_chapter_id"); + + b.Property("UserId") + .HasColumnType("uuid") + .HasColumnName("user_id"); + + b.Property("Version") + .HasColumnType("integer") + .HasColumnName("version"); + + b.HasKey("Id") + .HasName("pk_highlights"); + + b.HasIndex("ChapterId") + .HasDatabaseName("ix_highlights_chapter_id"); + + b.HasIndex("EditionId") + .HasDatabaseName("ix_highlights_edition_id"); + + b.HasIndex("SiteId") + .HasDatabaseName("ix_highlights_site_id"); + + b.HasIndex("UserBookId") + .HasDatabaseName("ix_highlights_user_book_id"); + + b.HasIndex("UserChapterId") + .HasDatabaseName("ix_highlights_user_chapter_id"); + + b.HasIndex("UserId", "SiteId", "EditionId") + .HasDatabaseName("ix_highlights_user_id_site_id_edition_id") + .HasFilter("edition_id IS NOT NULL"); + + b.HasIndex("UserId", "SiteId", "UserBookId") + .HasDatabaseName("ix_highlights_user_id_site_id_user_book_id") + .HasFilter("user_book_id IS NOT NULL"); + + b.ToTable("highlights", (string)null); + }); + + modelBuilder.Entity("Domain.Entities.IngestionJob", b => + { + b.Property("Id") + .ValueGeneratedOnAdd() + .HasColumnType("uuid") + .HasColumnName("id"); + + b.Property("AttemptCount") + .HasColumnType("integer") + .HasColumnName("attempt_count"); + + b.Property("BookFileId") + .HasColumnType("uuid") + .HasColumnName("book_file_id"); + + b.Property("Confidence") + .HasColumnType("double precision") + .HasColumnName("confidence"); + + b.Property("CreatedAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("created_at"); + + b.Property("EditionId") + .HasColumnType("uuid") + .HasColumnName("edition_id"); + + b.Property("Error") + .HasColumnType("text") + .HasColumnName("error"); + + b.Property("FinishedAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("finished_at"); + + b.Property("SourceEditionId") + .HasColumnType("uuid") + .HasColumnName("source_edition_id"); + + b.Property("SourceFormat") + .HasMaxLength(20) + .HasColumnType("character varying(20)") + .HasColumnName("source_format"); + + b.Property("StartedAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("started_at"); + + b.Property("Status") + .HasColumnType("integer") + .HasColumnName("status"); + + b.Property("TargetLanguage") + .IsRequired() + .HasMaxLength(8) + .HasColumnType("character varying(8)") + .HasColumnName("target_language"); + + b.Property("TextSource") + .HasMaxLength(20) + .HasColumnType("character varying(20)") + .HasColumnName("text_source"); + + b.Property("UnitsCount") + .HasColumnType("integer") + .HasColumnName("units_count"); + + b.Property("WarningsJson") + .HasColumnType("jsonb") + .HasColumnName("warnings_json"); + + b.Property("WorkId") + .HasColumnType("uuid") + .HasColumnName("work_id"); + + b.HasKey("Id") + .HasName("pk_ingestion_jobs"); + + b.HasIndex("BookFileId") + .HasDatabaseName("ix_ingestion_jobs_book_file_id"); + + b.HasIndex("CreatedAt") + .HasDatabaseName("ix_ingestion_jobs_created_at"); + + b.HasIndex("EditionId") + .HasDatabaseName("ix_ingestion_jobs_edition_id"); + + b.HasIndex("SourceEditionId") + .HasDatabaseName("ix_ingestion_jobs_source_edition_id"); + + b.HasIndex("Status") + .HasDatabaseName("ix_ingestion_jobs_status"); + + b.HasIndex("WorkId") + .HasDatabaseName("ix_ingestion_jobs_work_id"); + + b.ToTable("ingestion_jobs", (string)null); + }); + + modelBuilder.Entity("Domain.Entities.LintResult", b => + { + b.Property("Id") + .ValueGeneratedOnAdd() + .HasColumnType("uuid") + .HasColumnName("id"); + + b.Property("ChapterNumber") + .HasColumnType("integer") + .HasColumnName("chapter_number"); + + b.Property("Code") + .IsRequired() + .HasMaxLength(10) + .HasColumnType("character varying(10)") + .HasColumnName("code"); + + b.Property("Context") + .HasColumnType("text") + .HasColumnName("context"); + + b.Property("CreatedAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("created_at"); + + b.Property("EditionId") + .HasColumnType("uuid") + .HasColumnName("edition_id"); + + b.Property("LineNumber") + .HasColumnType("integer") + .HasColumnName("line_number"); + + b.Property("Message") + .IsRequired() + .HasColumnType("text") + .HasColumnName("message"); + + b.Property("Severity") + .IsRequired() + .HasMaxLength(20) + .HasColumnType("character varying(20)") + .HasColumnName("severity"); + + b.HasKey("Id") + .HasName("pk_lint_results"); + + b.HasIndex("EditionId") + .HasDatabaseName("ix_lint_results_edition_id"); + + b.ToTable("lint_results", (string)null); + }); + + modelBuilder.Entity("Domain.Entities.Note", b => + { + b.Property("Id") + .ValueGeneratedOnAdd() + .HasColumnType("uuid") + .HasColumnName("id"); + + b.Property("ChapterId") + .HasColumnType("uuid") + .HasColumnName("chapter_id"); + + b.Property("CreatedAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("created_at"); + + b.Property("EditionId") + .HasColumnType("uuid") + .HasColumnName("edition_id"); + + b.Property("HighlightId") + .HasColumnType("uuid") + .HasColumnName("highlight_id"); + + b.Property("Locator") + .IsRequired() + .HasColumnType("text") + .HasColumnName("locator"); + + b.Property("SiteId") + .HasColumnType("uuid") + .HasColumnName("site_id"); + + b.Property("Text") + .IsRequired() + .HasColumnType("text") + .HasColumnName("text"); + + b.Property("UpdatedAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("updated_at"); + + b.Property("UserId") + .HasColumnType("uuid") + .HasColumnName("user_id"); + + b.Property("Version") + .HasColumnType("integer") + .HasColumnName("version"); + + b.HasKey("Id") + .HasName("pk_notes"); + + b.HasIndex("ChapterId") + .HasDatabaseName("ix_notes_chapter_id"); + + b.HasIndex("EditionId") + .HasDatabaseName("ix_notes_edition_id"); + + b.HasIndex("HighlightId") + .IsUnique() + .HasDatabaseName("ix_notes_highlight_id"); + + b.HasIndex("SiteId") + .HasDatabaseName("ix_notes_site_id"); + + b.HasIndex("UserId", "SiteId", "EditionId") + .HasDatabaseName("ix_notes_user_id_site_id_edition_id"); + + b.ToTable("notes", (string)null); + }); + + modelBuilder.Entity("Domain.Entities.PasswordResetToken", b => + { + b.Property("Id") + .ValueGeneratedOnAdd() + .HasColumnType("uuid") + .HasColumnName("id"); + + b.Property("CreatedAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("created_at"); + + b.Property("ExpiresAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("expires_at"); + + b.Property("TokenHash") + .IsRequired() + .HasMaxLength(128) + .HasColumnType("character varying(128)") + .HasColumnName("token_hash"); + + b.Property("Used") + .HasColumnType("boolean") + .HasColumnName("used"); + + b.Property("UserId") + .HasColumnType("uuid") + .HasColumnName("user_id"); + + b.HasKey("Id") + .HasName("pk_password_reset_tokens"); + + b.HasIndex("TokenHash") + .IsUnique() + .HasDatabaseName("ix_password_reset_tokens_token_hash"); + + b.HasIndex("UserId") + .HasDatabaseName("ix_password_reset_tokens_user_id"); + + b.ToTable("password_reset_tokens", (string)null); + }); + + modelBuilder.Entity("Domain.Entities.PendingVocabularyWord", b => + { + b.Property("Id") + .ValueGeneratedOnAdd() + .HasColumnType("uuid") + .HasColumnName("id"); + + b.Property("BookTitle") + .HasMaxLength(500) + .HasColumnType("character varying(500)") + .HasColumnName("book_title"); + + b.Property("ChapterId") + .HasColumnType("uuid") + .HasColumnName("chapter_id"); + + b.Property("CreatedAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("created_at"); + + b.Property("Definition") + .HasMaxLength(2000) + .HasColumnType("character varying(2000)") + .HasColumnName("definition"); + + b.Property("EditionId") + .HasColumnType("uuid") + .HasColumnName("edition_id"); + + b.Property("Language") + .IsRequired() + .HasMaxLength(8) + .HasColumnType("character varying(8)") + .HasColumnName("language"); + + b.Property("Priority") + .HasColumnType("double precision") + .HasColumnName("priority"); + + b.Property("Sentence") + .HasMaxLength(1000) + .HasColumnType("character varying(1000)") + .HasColumnName("sentence"); + + b.Property("SiteId") + .HasColumnType("uuid") + .HasColumnName("site_id"); + + b.Property("Source") + .IsRequired() + .HasMaxLength(40) + .HasColumnType("character varying(40)") + .HasColumnName("source"); + + b.Property("Translation") + .HasMaxLength(500) + .HasColumnType("character varying(500)") + .HasColumnName("translation"); + + b.Property("UserBookId") + .HasColumnType("uuid") + .HasColumnName("user_book_id"); + + b.Property("UserId") + .HasColumnType("uuid") + .HasColumnName("user_id"); + + b.Property("Word") + .IsRequired() + .HasMaxLength(200) + .HasColumnType("character varying(200)") + .HasColumnName("word"); + + b.Property("ZipfRank") + .HasColumnType("integer") + .HasColumnName("zipf_rank"); + + b.Property("ZipfScore") + .HasColumnType("double precision") + .HasColumnName("zipf_score"); + + b.HasKey("Id") + .HasName("pk_pending_vocabulary_words"); + + b.HasIndex("ChapterId") + .HasDatabaseName("ix_pending_vocabulary_words_chapter_id"); + + b.HasIndex("EditionId") + .HasDatabaseName("ix_pending_vocabulary_words_edition_id"); + + b.HasIndex("SiteId") + .HasDatabaseName("ix_pending_vocabulary_words_site_id"); + + b.HasIndex("UserBookId") + .HasDatabaseName("ix_pending_vocabulary_words_user_book_id"); + + b.HasIndex("UserId", "SiteId", "CreatedAt") + .HasDatabaseName("ix_pending_vocabulary_words_user_id_site_id_created_at"); + + b.HasIndex("UserId", "SiteId", "Priority") + .IsDescending(false, false, true) + .HasDatabaseName("ix_pending_vocabulary_words_user_id_site_id_priority"); + + b.HasIndex("UserId", "SiteId", "Word", "Language") + .IsUnique() + .HasDatabaseName("ix_pending_vocabulary_words_user_id_site_id_word_language"); + + b.ToTable("pending_vocabulary_words", (string)null); + }); + + modelBuilder.Entity("Domain.Entities.ReadingGoal", b => + { + b.Property("Id") + .ValueGeneratedOnAdd() + .HasColumnType("uuid") + .HasColumnName("id"); + + b.Property("CreatedAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("created_at"); + + b.Property("GoalType") + .IsRequired() + .HasMaxLength(50) + .HasColumnType("character varying(50)") + .HasColumnName("goal_type"); + + b.Property("IsActive") + .HasColumnType("boolean") + .HasColumnName("is_active"); + + b.Property("SiteId") + .HasColumnType("uuid") + .HasColumnName("site_id"); + + b.Property("StreakMinMinutes") + .HasColumnType("integer") + .HasColumnName("streak_min_minutes"); + + b.Property("TargetValue") + .HasColumnType("integer") + .HasColumnName("target_value"); + + b.Property("UpdatedAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("updated_at"); + + b.Property("UserId") + .HasColumnType("uuid") + .HasColumnName("user_id"); + + b.Property("Year") + .HasColumnType("integer") + .HasColumnName("year"); + + b.HasKey("Id") + .HasName("pk_reading_goals"); + + b.HasIndex("SiteId") + .HasDatabaseName("ix_reading_goals_site_id"); + + b.HasIndex("UserId", "SiteId") + .HasDatabaseName("ix_reading_goals_user_id_site_id"); + + b.HasIndex("UserId", "SiteId", "GoalType") + .IsUnique() + .HasDatabaseName("ix_reading_goals_user_id_site_id_goal_type"); + + b.ToTable("reading_goals", (string)null); + }); + + modelBuilder.Entity("Domain.Entities.ReadingProgress", b => + { + b.Property("Id") + .ValueGeneratedOnAdd() + .HasColumnType("uuid") + .HasColumnName("id"); + + b.Property("ChapterId") + .HasColumnType("uuid") + .HasColumnName("chapter_id"); + + b.Property("EditionId") + .HasColumnType("uuid") + .HasColumnName("edition_id"); + + b.Property("Locator") + .IsRequired() + .HasColumnType("text") + .HasColumnName("locator"); + + b.Property("Percent") + .HasColumnType("double precision") + .HasColumnName("percent"); + + b.Property("SiteId") + .HasColumnType("uuid") + .HasColumnName("site_id"); + + b.Property("UpdatedAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("updated_at"); + + b.Property("UserId") + .HasColumnType("uuid") + .HasColumnName("user_id"); + + b.HasKey("Id") + .HasName("pk_reading_progresses"); + + b.HasIndex("ChapterId") + .HasDatabaseName("ix_reading_progresses_chapter_id"); + + b.HasIndex("EditionId") + .HasDatabaseName("ix_reading_progresses_edition_id"); + + b.HasIndex("SiteId") + .HasDatabaseName("ix_reading_progresses_site_id"); + + b.HasIndex("UserId", "SiteId", "EditionId") + .IsUnique() + .HasDatabaseName("ix_reading_progresses_user_id_site_id_edition_id"); + + b.ToTable("reading_progresses", (string)null); + }); + + modelBuilder.Entity("Domain.Entities.ReadingSession", b => + { + b.Property("Id") + .ValueGeneratedOnAdd() + .HasColumnType("uuid") + .HasColumnName("id"); + + b.Property("CreatedAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("created_at"); + + b.Property("DurationSeconds") + .HasColumnType("integer") + .HasColumnName("duration_seconds"); + + b.Property("EditionId") + .HasColumnType("uuid") + .HasColumnName("edition_id"); + + b.Property("EndPercent") + .HasColumnType("double precision") + .HasColumnName("end_percent"); + + b.Property("EndedAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("ended_at"); + + b.Property("SiteId") + .HasColumnType("uuid") + .HasColumnName("site_id"); + + b.Property("StartPercent") + .HasColumnType("double precision") + .HasColumnName("start_percent"); + + b.Property("StartedAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("started_at"); + + b.Property("UserBookId") + .HasColumnType("uuid") + .HasColumnName("user_book_id"); + + b.Property("UserId") + .HasColumnType("uuid") + .HasColumnName("user_id"); + + b.Property("WordsRead") + .HasColumnType("integer") + .HasColumnName("words_read"); + + b.HasKey("Id") + .HasName("pk_reading_sessions"); + + b.HasIndex("EditionId") + .HasDatabaseName("ix_reading_sessions_edition_id"); + + b.HasIndex("SiteId") + .HasDatabaseName("ix_reading_sessions_site_id"); + + b.HasIndex("UserBookId") + .HasDatabaseName("ix_reading_sessions_user_book_id"); + + b.HasIndex("UserId", "SiteId") + .HasDatabaseName("ix_reading_sessions_user_id_site_id"); + + b.HasIndex("UserId", "StartedAt") + .HasDatabaseName("ix_reading_sessions_user_id_started_at"); + + b.HasIndex("UserId", "EditionId", "StartedAt") + .IsUnique() + .HasDatabaseName("ix_reading_sessions_user_id_edition_id_started_at") + .HasFilter("edition_id IS NOT NULL"); + + b.HasIndex("UserId", "UserBookId", "StartedAt") + .IsUnique() + .HasDatabaseName("ix_reading_sessions_user_id_user_book_id_started_at") + .HasFilter("user_book_id IS NOT NULL"); + + b.ToTable("reading_sessions", (string)null); + }); + + modelBuilder.Entity("Domain.Entities.SeoBackfillJob", b => + { + b.Property("Id") + .ValueGeneratedOnAdd() + .HasColumnType("uuid") + .HasColumnName("id"); + + b.Property("AfterSnapshot") + .HasColumnType("jsonb") + .HasColumnName("after_snapshot"); + + b.Property("ApprovedAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("approved_at"); + + b.Property("ApprovedByUserId") + .HasColumnType("uuid") + .HasColumnName("approved_by_user_id"); + + b.Property("BeforeSnapshot") + .HasColumnType("jsonb") + .HasColumnName("before_snapshot"); + + b.Property("CompletedAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("completed_at"); + + b.Property("CreatedAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("created_at"); + + b.Property("EntityId") + .HasColumnType("uuid") + .HasColumnName("entity_id"); + + b.Property("EntityType") + .HasColumnType("integer") + .HasColumnName("entity_type"); + + b.Property("Error") + .HasColumnType("text") + .HasColumnName("error"); + + b.Property("GeneratedContent") + .HasColumnType("jsonb") + .HasColumnName("generated_content"); + + b.Property("InputSnapshot") + .HasColumnType("jsonb") + .HasColumnName("input_snapshot"); + + b.Property("RawOutputs") + .HasColumnType("jsonb") + .HasColumnName("raw_outputs"); + + b.Property("RenderedPrompts") + .HasColumnType("jsonb") + .HasColumnName("rendered_prompts"); + + b.Property("RequiresReview") + .HasColumnType("boolean") + .HasColumnName("requires_review"); + + b.Property("StartedAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("started_at"); + + b.Property("Status") + .HasColumnType("integer") + .HasColumnName("status"); + + b.PrimitiveCollection("TargetFields") + .IsRequired() + .HasColumnType("text[]") + .HasColumnName("target_fields"); + + b.PrimitiveCollection("TemplateIds") + .IsRequired() + .HasColumnType("uuid[]") + .HasColumnName("template_ids"); + + b.PrimitiveCollection("TemplateVersions") + .IsRequired() + .HasColumnType("integer[]") + .HasColumnName("template_versions"); + + b.Property("TriggeredBy") + .IsRequired() + .HasMaxLength(200) + .HasColumnType("character varying(200)") + .HasColumnName("triggered_by"); + + b.HasKey("Id") + .HasName("pk_seo_backfill_jobs"); + + b.HasIndex("EntityType", "EntityId") + .HasDatabaseName("ix_seo_backfill_jobs_entity_type_entity_id"); + + b.HasIndex("Status", "CreatedAt") + .HasDatabaseName("ix_seo_backfill_jobs_status_created_at"); + + b.ToTable("seo_backfill_jobs", (string)null); + }); + + modelBuilder.Entity("Domain.Entities.SeoBackfillSettings", b => + { + b.Property("Id") + .ValueGeneratedOnAdd() + .HasColumnType("uuid") + .HasColumnName("id"); + + b.Property("Enabled") + .HasColumnType("boolean") + .HasColumnName("enabled"); + + b.PrimitiveCollection("EntityTypeFilter") + .IsRequired() + .HasColumnType("text[]") + .HasColumnName("entity_type_filter"); + + b.Property("IntervalSeconds") + .HasColumnType("integer") + .HasColumnName("interval_seconds"); + + b.Property("JobsPerRun") + .HasColumnType("integer") + .HasColumnName("jobs_per_run"); + + b.PrimitiveCollection("LanguageFilter") + .IsRequired() + .HasColumnType("text[]") + .HasColumnName("language_filter"); + + b.Property("SsgRebuildBatchMinutes") + .HasColumnType("integer") + .HasColumnName("ssg_rebuild_batch_minutes"); + + b.Property("UpdatedAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("updated_at"); + + b.HasKey("Id") + .HasName("pk_seo_backfill_settings"); + + b.ToTable("seo_backfill_settings", (string)null); + }); + + modelBuilder.Entity("Domain.Entities.SeoTemplate", b => + { + b.Property("Id") + .ValueGeneratedOnAdd() + .HasColumnType("uuid") + .HasColumnName("id"); + + b.Property("CreatedAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("created_at"); + + b.Property("Description") + .HasMaxLength(500) + .HasColumnType("character varying(500)") + .HasColumnName("description"); + + b.Property("EntityType") + .HasColumnType("integer") + .HasColumnName("entity_type"); + + b.Property("FieldType") + .HasColumnType("integer") + .HasColumnName("field_type"); + + b.Property("IsActive") + .HasColumnType("boolean") + .HasColumnName("is_active"); + + b.Property("LanguageCode") + .IsRequired() + .HasMaxLength(8) + .HasColumnType("character varying(8)") + .HasColumnName("language_code"); + + b.Property("MaxTokens") + .HasColumnType("integer") + .HasColumnName("max_tokens"); + + b.Property("Model") + .IsRequired() + .HasMaxLength(100) + .HasColumnType("character varying(100)") + .HasColumnName("model"); + + b.Property("Name") + .IsRequired() + .HasMaxLength(200) + .HasColumnType("character varying(200)") + .HasColumnName("name"); + + b.Property("OutputSchema") + .IsRequired() + .HasColumnType("jsonb") + .HasColumnName("output_schema"); + + b.Property("PromptTemplate") + .IsRequired() + .HasColumnType("text") + .HasColumnName("prompt_template"); + + b.Property("Temperature") + .HasColumnType("double precision") + .HasColumnName("temperature"); + + b.Property("TrustLevel") + .HasColumnType("integer") + .HasColumnName("trust_level"); + + b.Property("UpdatedAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("updated_at"); + + b.Property("Version") + .HasColumnType("integer") + .HasColumnName("version"); + + b.HasKey("Id") + .HasName("pk_seo_templates"); + + b.HasIndex("EntityType", "FieldType", "LanguageCode", "IsActive") + .HasDatabaseName("ix_seo_templates_entity_type_field_type_language_code_is_active"); + + b.ToTable("seo_templates", (string)null); + }); + + modelBuilder.Entity("Domain.Entities.Site", b => + { + b.Property("Id") + .ValueGeneratedOnAdd() + .HasColumnType("uuid") + .HasColumnName("id"); + + b.Property("AdsEnabled") + .HasColumnType("boolean") + .HasColumnName("ads_enabled"); + + b.Property("Code") + .IsRequired() + .HasMaxLength(50) + .HasColumnType("character varying(50)") + .HasColumnName("code"); + + b.Property("CreatedAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("created_at"); + + b.Property("DefaultLanguage") + .IsRequired() + .HasMaxLength(10) + .HasColumnType("character varying(10)") + .HasColumnName("default_language"); + + b.Property("FeaturesJson") + .IsRequired() + .HasColumnType("jsonb") + .HasColumnName("features_json"); + + b.Property("IndexingEnabled") + .HasColumnType("boolean") + .HasColumnName("indexing_enabled"); + + b.Property("PrimaryDomain") + .IsRequired() + .HasMaxLength(255) + .HasColumnType("character varying(255)") + .HasColumnName("primary_domain"); + + b.Property("SitemapEnabled") + .HasColumnType("boolean") + .HasColumnName("sitemap_enabled"); + + b.Property("Theme") + .IsRequired() + .HasMaxLength(50) + .HasColumnType("character varying(50)") + .HasColumnName("theme"); + + b.Property("UpdatedAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("updated_at"); + + b.HasKey("Id") + .HasName("pk_sites"); + + b.HasIndex("Code") + .IsUnique() + .HasDatabaseName("ix_sites_code"); + + b.HasIndex("PrimaryDomain") + .IsUnique() + .HasDatabaseName("ix_sites_primary_domain"); + + b.ToTable("sites", (string)null); + }); + + modelBuilder.Entity("Domain.Entities.SiteDomain", b => + { + b.Property("Id") + .ValueGeneratedOnAdd() + .HasColumnType("uuid") + .HasColumnName("id"); + + b.Property("CreatedAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("created_at"); + + b.Property("Domain") + .IsRequired() + .HasMaxLength(255) + .HasColumnType("character varying(255)") + .HasColumnName("domain"); + + b.Property("IsPrimary") + .HasColumnType("boolean") + .HasColumnName("is_primary"); + + b.Property("SiteId") + .HasColumnType("uuid") + .HasColumnName("site_id"); + + b.HasKey("Id") + .HasName("pk_site_domains"); + + b.HasIndex("Domain") + .IsUnique() + .HasDatabaseName("ix_site_domains_domain"); + + b.HasIndex("SiteId") + .HasDatabaseName("ix_site_domains_site_id"); + + b.ToTable("site_domains", (string)null); + }); + + modelBuilder.Entity("Domain.Entities.SsgRebuildJob", b => + { + b.Property("Id") + .ValueGeneratedOnAdd() + .HasColumnType("uuid") + .HasColumnName("id"); + + b.Property("AuthorSlugsJson") + .HasColumnType("jsonb") + .HasColumnName("author_slugs_json"); + + b.Property("BookSlugsJson") + .HasColumnType("jsonb") + .HasColumnName("book_slugs_json"); + + b.Property("Concurrency") + .HasColumnType("integer") + .HasColumnName("concurrency"); + + b.Property("CreatedAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("created_at"); + + b.Property("Error") + .HasColumnType("text") + .HasColumnName("error"); + + b.Property("FailedCount") + .HasColumnType("integer") + .HasColumnName("failed_count"); + + b.Property("FinishedAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("finished_at"); + + b.Property("GenreSlugsJson") + .HasColumnType("jsonb") + .HasColumnName("genre_slugs_json"); + + b.Property("Mode") + .IsRequired() + .HasMaxLength(20) + .HasColumnType("character varying(20)") + .HasColumnName("mode"); + + b.Property("RenderedCount") + .HasColumnType("integer") + .HasColumnName("rendered_count"); + + b.Property("SiteId") + .HasColumnType("uuid") + .HasColumnName("site_id"); + + b.Property("StartedAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("started_at"); + + b.Property("Status") + .IsRequired() + .HasMaxLength(20) + .HasColumnType("character varying(20)") + .HasColumnName("status"); + + b.Property("TimeoutMs") + .HasColumnType("integer") + .HasColumnName("timeout_ms"); + + b.Property("TotalRoutes") + .HasColumnType("integer") + .HasColumnName("total_routes"); + + b.HasKey("Id") + .HasName("pk_ssg_rebuild_jobs"); + + b.HasIndex("CreatedAt") + .HasDatabaseName("ix_ssg_rebuild_jobs_created_at"); + + b.HasIndex("SiteId") + .HasDatabaseName("ix_ssg_rebuild_jobs_site_id"); + + b.HasIndex("Status") + .HasDatabaseName("ix_ssg_rebuild_jobs_status"); + + b.ToTable("ssg_rebuild_jobs", (string)null); + }); + + modelBuilder.Entity("Domain.Entities.SsgRebuildResult", b => + { + b.Property("Id") + .ValueGeneratedOnAdd() + .HasColumnType("uuid") + .HasColumnName("id"); + + b.Property("Error") + .HasColumnType("text") + .HasColumnName("error"); + + b.Property("JobId") + .HasColumnType("uuid") + .HasColumnName("job_id"); + + b.Property("RenderTimeMs") + .HasColumnType("integer") + .HasColumnName("render_time_ms"); + + b.Property("RenderedAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("rendered_at"); + + b.Property("Route") + .IsRequired() + .HasMaxLength(500) + .HasColumnType("character varying(500)") + .HasColumnName("route"); + + b.Property("RouteType") + .IsRequired() + .HasMaxLength(20) + .HasColumnType("character varying(20)") + .HasColumnName("route_type"); + + b.Property("Success") + .HasColumnType("boolean") + .HasColumnName("success"); + + b.HasKey("Id") + .HasName("pk_ssg_rebuild_results"); + + b.HasIndex("JobId") + .HasDatabaseName("ix_ssg_rebuild_results_job_id"); + + b.HasIndex("JobId", "Route") + .IsUnique() + .HasDatabaseName("ix_ssg_rebuild_results_job_id_route"); + + b.ToTable("ssg_rebuild_results", (string)null); + }); + + modelBuilder.Entity("Domain.Entities.TextStackImport", b => + { + b.Property("Id") + .ValueGeneratedOnAdd() + .HasColumnType("uuid") + .HasColumnName("id"); + + b.Property("EditionId") + .HasColumnType("uuid") + .HasColumnName("edition_id"); + + b.Property("Identifier") + .IsRequired() + .HasMaxLength(500) + .HasColumnType("character varying(500)") + .HasColumnName("identifier"); + + b.Property("ImportedAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("imported_at"); + + b.Property("SiteId") + .HasColumnType("uuid") + .HasColumnName("site_id"); + + b.HasKey("Id") + .HasName("pk_text_stack_imports"); + + b.HasIndex("EditionId") + .HasDatabaseName("ix_text_stack_imports_edition_id"); + + b.HasIndex("SiteId") + .HasDatabaseName("ix_text_stack_imports_site_id"); + + b.HasIndex("SiteId", "Identifier") + .IsUnique() + .HasDatabaseName("ix_text_stack_imports_site_id_identifier"); + + b.ToTable("text_stack_imports", (string)null); + }); + + modelBuilder.Entity("Domain.Entities.User", b => + { + b.Property("Id") + .ValueGeneratedOnAdd() + .HasColumnType("uuid") + .HasColumnName("id"); + + b.Property("AppleSubject") + .HasMaxLength(255) + .HasColumnType("character varying(255)") + .HasColumnName("apple_subject"); + + b.Property("CreatedAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("created_at"); + + b.Property("Email") + .IsRequired() + .HasMaxLength(255) + .HasColumnType("character varying(255)") + .HasColumnName("email"); + + b.Property("GoogleSubject") + .HasMaxLength(255) + .HasColumnType("character varying(255)") + .HasColumnName("google_subject"); + + b.Property("IsGuest") + .ValueGeneratedOnAdd() + .HasColumnType("boolean") + .HasDefaultValue(false) + .HasColumnName("is_guest"); + + b.Property("LastActiveAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("last_active_at"); + + b.Property("Name") + .HasMaxLength(255) + .HasColumnType("character varying(255)") + .HasColumnName("name"); + + b.Property("NativeLanguage") + .HasMaxLength(16) + .HasColumnType("character varying(16)") + .HasColumnName("native_language"); + + b.Property("PasswordHash") + .HasMaxLength(255) + .HasColumnType("character varying(255)") + .HasColumnName("password_hash"); + + b.Property("Picture") + .HasColumnType("text") + .HasColumnName("picture"); + + b.Property("StorageUsedBytes") + .HasColumnType("bigint") + .HasColumnName("storage_used_bytes"); + + b.HasKey("Id") + .HasName("pk_users"); + + b.HasIndex("AppleSubject") + .IsUnique() + .HasDatabaseName("ix_users_apple_subject") + .HasFilter("apple_subject IS NOT NULL"); + + b.HasIndex("Email") + .IsUnique() + .HasDatabaseName("ix_users_email"); + + b.HasIndex("GoogleSubject") + .IsUnique() + .HasDatabaseName("ix_users_google_subject") + .HasFilter("google_subject IS NOT NULL"); + + b.HasIndex("IsGuest", "LastActiveAt") + .HasDatabaseName("ix_users_guest_cleanup") + .HasFilter("is_guest = true"); + + b.ToTable("users", (string)null); + }); + + modelBuilder.Entity("Domain.Entities.UserAchievement", b => + { + b.Property("Id") + .ValueGeneratedOnAdd() + .HasColumnType("uuid") + .HasColumnName("id"); + + b.Property("AchievementCode") + .IsRequired() + .HasMaxLength(50) + .HasColumnType("character varying(50)") + .HasColumnName("achievement_code"); + + b.Property("SiteId") + .HasColumnType("uuid") + .HasColumnName("site_id"); + + b.Property("UnlockedAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("unlocked_at"); + + b.Property("UserId") + .HasColumnType("uuid") + .HasColumnName("user_id"); + + b.HasKey("Id") + .HasName("pk_user_achievements"); + + b.HasIndex("SiteId") + .HasDatabaseName("ix_user_achievements_site_id"); + + b.HasIndex("UserId", "SiteId") + .HasDatabaseName("ix_user_achievements_user_id_site_id"); + + b.HasIndex("UserId", "SiteId", "AchievementCode") + .IsUnique() + .HasDatabaseName("ix_user_achievements_user_id_site_id_achievement_code"); + + b.ToTable("user_achievements", (string)null); + }); + + modelBuilder.Entity("Domain.Entities.UserBook", b => + { + b.Property("Id") + .ValueGeneratedOnAdd() + .HasColumnType("uuid") + .HasColumnName("id"); + + b.Property("Author") + .HasMaxLength(500) + .HasColumnType("character varying(500)") + .HasColumnName("author"); + + b.Property("CompletedAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("completed_at"); + + b.Property("CoverPath") + .HasMaxLength(500) + .HasColumnType("character varying(500)") + .HasColumnName("cover_path"); + + b.Property("CreatedAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("created_at"); + + b.Property("Description") + .HasColumnType("text") + .HasColumnName("description"); + + b.Property("ErrorMessage") + .HasColumnType("text") + .HasColumnName("error_message"); + + b.Property("Genre") + .HasMaxLength(200) + .HasColumnType("character varying(200)") + .HasColumnName("genre"); + + b.Property("Language") + .IsRequired() + .HasMaxLength(10) + .HasColumnType("character varying(10)") + .HasColumnName("language"); + + b.Property("MetadataHistoryJson") + .HasColumnType("jsonb") + .HasColumnName("metadata_history_json"); + + b.Property("ProgressChapterSlug") + .HasColumnType("text") + .HasColumnName("progress_chapter_slug"); + + b.Property("ProgressLocator") + .HasColumnType("text") + .HasColumnName("progress_locator"); + + b.Property("ProgressPercent") + .HasColumnType("double precision") + .HasColumnName("progress_percent"); + + b.Property("ProgressUpdatedAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("progress_updated_at"); + + b.Property("PublishedYear") + .HasColumnType("integer") + .HasColumnName("published_year"); + + b.Property("SeoSource") + .IsRequired() + .ValueGeneratedOnAdd() + .HasMaxLength(20) + .HasColumnType("character varying(20)") + .HasDefaultValue("auto") + .HasColumnName("seo_source"); + + b.Property("Slug") + .IsRequired() + .HasMaxLength(500) + .HasColumnType("character varying(500)") + .HasColumnName("slug"); + + b.Property("Status") + .HasColumnType("integer") + .HasColumnName("status"); + + b.PrimitiveCollection("SuggestedTags") + .IsRequired() + .ValueGeneratedOnAdd() + .HasColumnType("text[]") + .HasColumnName("suggested_tags") + .HasDefaultValueSql("ARRAY[]::text[]"); + + b.Property("SuggestedTagsAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("suggested_tags_at"); + + b.PrimitiveCollection("Tags") + .IsRequired() + .ValueGeneratedOnAdd() + .HasColumnType("text[]") + .HasColumnName("tags") + .HasDefaultValueSql("ARRAY[]::text[]"); + + b.Property("TakedownAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("takedown_at"); + + b.Property("TakedownReason") + .HasMaxLength(1000) + .HasColumnType("character varying(1000)") + .HasColumnName("takedown_reason"); + + b.Property("Title") + .IsRequired() + .HasMaxLength(500) + .HasColumnType("character varying(500)") + .HasColumnName("title"); + + b.Property("TocJson") + .HasColumnType("jsonb") + .HasColumnName("toc_json"); + + b.Property("TotalWordCount") + .HasColumnType("integer") + .HasColumnName("total_word_count"); + + b.Property("UpdatedAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("updated_at"); + + b.Property("UserId") + .HasColumnType("uuid") + .HasColumnName("user_id"); + + b.HasKey("Id") + .HasName("pk_user_books"); + + b.HasIndex("Status") + .HasDatabaseName("ix_user_books_status"); + + b.HasIndex("Tags") + .HasDatabaseName("ix_user_books_tags"); + + NpgsqlIndexBuilderExtensions.HasMethod(b.HasIndex("Tags"), "gin"); + + b.HasIndex("UserId") + .HasDatabaseName("ix_user_books_user_id"); + + b.HasIndex("UserId", "Slug") + .IsUnique() + .HasDatabaseName("ix_user_books_user_id_slug"); + + b.ToTable("user_books", (string)null); + }); + + modelBuilder.Entity("Domain.Entities.UserBookBookmark", b => + { + b.Property("Id") + .ValueGeneratedOnAdd() + .HasColumnType("uuid") + .HasColumnName("id"); + + b.Property("ChapterId") + .HasColumnType("uuid") + .HasColumnName("chapter_id"); + + b.Property("CreatedAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("created_at"); + + b.Property("Locator") + .IsRequired() + .HasMaxLength(1000) + .HasColumnType("character varying(1000)") + .HasColumnName("locator"); + + b.Property("Title") + .HasMaxLength(500) + .HasColumnType("character varying(500)") + .HasColumnName("title"); + + b.Property("UserBookId") + .HasColumnType("uuid") + .HasColumnName("user_book_id"); + + b.HasKey("Id") + .HasName("pk_user_book_bookmarks"); + + b.HasIndex("ChapterId") + .HasDatabaseName("ix_user_book_bookmarks_chapter_id"); + + b.HasIndex("UserBookId") + .HasDatabaseName("ix_user_book_bookmarks_user_book_id"); + + b.ToTable("user_book_bookmarks", (string)null); + }); + + modelBuilder.Entity("Domain.Entities.UserBookFile", b => + { + b.Property("Id") + .ValueGeneratedOnAdd() + .HasColumnType("uuid") + .HasColumnName("id"); + + b.Property("FileSize") + .HasColumnType("bigint") + .HasColumnName("file_size"); + + b.Property("Format") + .HasColumnType("integer") + .HasColumnName("format"); + + b.Property("OriginalFileName") + .IsRequired() + .HasMaxLength(500) + .HasColumnType("character varying(500)") + .HasColumnName("original_file_name"); + + b.Property("Sha256") + .HasMaxLength(64) + .HasColumnType("character varying(64)") + .HasColumnName("sha256"); + + b.Property("StoragePath") + .IsRequired() + .HasMaxLength(500) + .HasColumnType("character varying(500)") + .HasColumnName("storage_path"); + + b.Property("UploadedAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("uploaded_at"); + + b.Property("UserBookId") + .HasColumnType("uuid") + .HasColumnName("user_book_id"); + + b.HasKey("Id") + .HasName("pk_user_book_files"); + + b.HasIndex("Sha256") + .HasDatabaseName("ix_user_book_files_sha256"); + + b.HasIndex("UserBookId") + .HasDatabaseName("ix_user_book_files_user_book_id"); + + b.ToTable("user_book_files", (string)null); + }); + + modelBuilder.Entity("Domain.Entities.UserChapter", b => + { + b.Property("Id") + .ValueGeneratedOnAdd() + .HasColumnType("uuid") + .HasColumnName("id"); + + b.Property("ChapterNumber") + .HasColumnType("integer") + .HasColumnName("chapter_number"); + + b.Property("ContentQualityScore") + .HasColumnType("integer") + .HasColumnName("content_quality_score"); + + b.Property("CreatedAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("created_at"); + + b.Property("Html") + .IsRequired() + .HasColumnType("text") + .HasColumnName("html"); + + b.Property("PlainText") + .IsRequired() + .HasColumnType("text") + .HasColumnName("plain_text"); + + b.Property("Slug") + .HasMaxLength(255) + .HasColumnType("character varying(255)") + .HasColumnName("slug"); + + b.Property("Title") + .IsRequired() + .HasMaxLength(500) + .HasColumnType("character varying(500)") + .HasColumnName("title"); + + b.Property("UserBookId") + .HasColumnType("uuid") + .HasColumnName("user_book_id"); + + b.Property("WordCount") + .HasColumnType("integer") + .HasColumnName("word_count"); + + b.HasKey("Id") + .HasName("pk_user_chapters"); + + b.HasIndex("UserBookId") + .HasDatabaseName("ix_user_chapters_user_book_id"); + + b.HasIndex("UserBookId", "ChapterNumber") + .IsUnique() + .HasDatabaseName("ix_user_chapters_user_book_id_chapter_number"); + + b.HasIndex("UserBookId", "Slug") + .IsUnique() + .HasDatabaseName("ix_user_chapters_user_book_id_slug"); + + b.ToTable("user_chapters", (string)null); + }); + + modelBuilder.Entity("Domain.Entities.UserIngestionJob", b => + { + b.Property("Id") + .ValueGeneratedOnAdd() + .HasColumnType("uuid") + .HasColumnName("id"); + + b.Property("AttemptCount") + .HasColumnType("integer") + .HasColumnName("attempt_count"); + + b.Property("CreatedAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("created_at"); + + b.Property("Error") + .HasColumnType("text") + .HasColumnName("error"); + + b.Property("FinishedAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("finished_at"); + + b.Property("SourceFormat") + .HasMaxLength(50) + .HasColumnType("character varying(50)") + .HasColumnName("source_format"); + + b.Property("StartedAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("started_at"); + + b.Property("Status") + .HasColumnType("integer") + .HasColumnName("status"); + + b.Property("UnitsCount") + .HasColumnType("integer") + .HasColumnName("units_count"); + + b.Property("UserBookFileId") + .HasColumnType("uuid") + .HasColumnName("user_book_file_id"); + + b.Property("UserBookId") + .HasColumnType("uuid") + .HasColumnName("user_book_id"); + + b.HasKey("Id") + .HasName("pk_user_ingestion_jobs"); + + b.HasIndex("CreatedAt") + .HasDatabaseName("ix_user_ingestion_jobs_created_at"); + + b.HasIndex("Status") + .HasDatabaseName("ix_user_ingestion_jobs_status"); + + b.HasIndex("UserBookFileId") + .HasDatabaseName("ix_user_ingestion_jobs_user_book_file_id"); + + b.HasIndex("UserBookId") + .HasDatabaseName("ix_user_ingestion_jobs_user_book_id"); + + b.ToTable("user_ingestion_jobs", (string)null); + }); + + modelBuilder.Entity("Domain.Entities.UserLibrary", b => + { + b.Property("Id") + .ValueGeneratedOnAdd() + .HasColumnType("uuid") + .HasColumnName("id"); + + b.Property("CreatedAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("created_at"); + + b.Property("EditionId") + .HasColumnType("uuid") + .HasColumnName("edition_id"); + + b.Property("UserId") + .HasColumnType("uuid") + .HasColumnName("user_id"); + + b.HasKey("Id") + .HasName("pk_user_libraries"); + + b.HasIndex("EditionId") + .HasDatabaseName("ix_user_libraries_edition_id"); + + b.HasIndex("UserId", "EditionId") + .IsUnique() + .HasDatabaseName("ix_user_libraries_user_id_edition_id"); + + b.ToTable("user_libraries", (string)null); + }); + + modelBuilder.Entity("Domain.Entities.UserRefreshToken", b => + { + b.Property("Id") + .ValueGeneratedOnAdd() + .HasColumnType("uuid") + .HasColumnName("id"); + + b.Property("CreatedAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("created_at"); + + b.Property("ExpiresAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("expires_at"); + + b.Property("Token") + .IsRequired() + .HasColumnType("text") + .HasColumnName("token"); + + b.Property("UserId") + .HasColumnType("uuid") + .HasColumnName("user_id"); + + b.HasKey("Id") + .HasName("pk_user_refresh_tokens"); + + b.HasIndex("ExpiresAt") + .HasDatabaseName("ix_user_refresh_tokens_expires_at"); + + b.HasIndex("Token") + .IsUnique() + .HasDatabaseName("ix_user_refresh_tokens_token"); + + b.HasIndex("UserId") + .HasDatabaseName("ix_user_refresh_tokens_user_id"); + + b.ToTable("user_refresh_tokens", (string)null); + }); + + modelBuilder.Entity("Domain.Entities.UserVocabularySettings", b => + { + b.Property("UserId") + .HasColumnType("uuid") + .HasColumnName("user_id"); + + b.Property("SiteId") + .HasColumnType("uuid") + .HasColumnName("site_id"); + + b.Property("AutoRetireEnabled") + .HasColumnType("boolean") + .HasColumnName("auto_retire_enabled"); + + b.Property("ClusteringEnabled") + .HasColumnType("boolean") + .HasColumnName("clustering_enabled"); + + b.Property("CreatedAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("created_at"); + + b.Property("DailyNewCap") + .HasColumnType("integer") + .HasColumnName("daily_new_cap"); + + b.Property("FrequencyFilterEnabled") + .HasColumnType("boolean") + .HasColumnName("frequency_filter_enabled"); + + b.Property("UpdatedAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("updated_at"); + + b.Property("WeeklyReviewBudget") + .HasColumnType("integer") + .HasColumnName("weekly_review_budget"); + + b.HasKey("UserId", "SiteId") + .HasName("pk_user_vocabulary_settings"); + + b.HasIndex("SiteId") + .HasDatabaseName("ix_user_vocabulary_settings_site_id"); + + b.ToTable("user_vocabulary_settings", (string)null); + }); + + modelBuilder.Entity("Domain.Entities.VocabularyReview", b => + { + b.Property("Id") + .ValueGeneratedOnAdd() + .HasColumnType("uuid") + .HasColumnName("id"); + + b.Property("CreatedAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("created_at"); + + b.Property("IsCorrect") + .HasColumnType("boolean") + .HasColumnName("is_correct"); + + b.Property("ResponseTimeMs") + .HasColumnType("integer") + .HasColumnName("response_time_ms"); + + b.Property("ReviewMode") + .IsRequired() + .HasMaxLength(30) + .HasColumnType("character varying(30)") + .HasColumnName("review_mode"); + + b.Property("SiteId") + .HasColumnType("uuid") + .HasColumnName("site_id"); + + b.Property("StageAfter") + .HasColumnType("integer") + .HasColumnName("stage_after"); + + b.Property("StageBefore") + .HasColumnType("integer") + .HasColumnName("stage_before"); + + b.Property("UserId") + .HasColumnType("uuid") + .HasColumnName("user_id"); + + b.Property("VocabularyWordId") + .HasColumnType("uuid") + .HasColumnName("vocabulary_word_id"); + + b.HasKey("Id") + .HasName("pk_vocabulary_reviews"); + + b.HasIndex("SiteId") + .HasDatabaseName("ix_vocabulary_reviews_site_id"); + + b.HasIndex("VocabularyWordId") + .HasDatabaseName("ix_vocabulary_reviews_vocabulary_word_id"); + + b.HasIndex("UserId", "SiteId", "CreatedAt") + .HasDatabaseName("ix_vocabulary_reviews_user_id_site_id_created_at"); + + b.ToTable("vocabulary_reviews", (string)null); + }); + + modelBuilder.Entity("Domain.Entities.VocabularyWord", b => + { + b.Property("Id") + .ValueGeneratedOnAdd() + .HasColumnType("uuid") + .HasColumnName("id"); + + b.Property("ActivatedAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("activated_at"); + + b.Property("BookTitle") + .HasMaxLength(500) + .HasColumnType("character varying(500)") + .HasColumnName("book_title"); + + b.Property("ChapterId") + .HasColumnType("uuid") + .HasColumnName("chapter_id"); + + b.Property("ClusterId") + .HasColumnType("uuid") + .HasColumnName("cluster_id"); + + b.Property("ConsecutiveCorrect") + .HasColumnType("integer") + .HasColumnName("consecutive_correct"); + + b.Property("CorrectReviews") + .HasColumnType("integer") + .HasColumnName("correct_reviews"); + + b.Property("CreatedAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("created_at"); + + b.Property("Definition") + .HasMaxLength(2000) + .HasColumnType("character varying(2000)") + .HasColumnName("definition"); + + b.Property("Distractors") + .HasColumnType("text") + .HasColumnName("distractors"); + + b.Property("EditionId") + .HasColumnType("uuid") + .HasColumnName("edition_id"); + + b.Property("Explanation") + .HasMaxLength(1000) + .HasColumnType("character varying(1000)") + .HasColumnName("explanation"); + + b.Property("Hint") + .HasMaxLength(500) + .HasColumnType("character varying(500)") + .HasColumnName("hint"); + + b.Property("IntervalDays") + .HasColumnType("double precision") + .HasColumnName("interval_days"); + + b.Property("IsRetired") + .HasColumnType("boolean") + .HasColumnName("is_retired"); + + b.Property("Language") + .IsRequired() + .HasMaxLength(8) + .HasColumnType("character varying(8)") + .HasColumnName("language"); + + b.Property("LastReviewedAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("last_reviewed_at"); + + b.Property("NextReviewAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("next_review_at"); + + b.Property("Priority") + .HasColumnType("double precision") + .HasColumnName("priority"); + + b.Property("RetiredAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("retired_at"); + + b.Property("RetiredReason") + .HasMaxLength(60) + .HasColumnType("character varying(60)") + .HasColumnName("retired_reason"); + + b.Property("Sentence") + .HasMaxLength(1000) + .HasColumnType("character varying(1000)") + .HasColumnName("sentence"); + + b.Property("SiteId") + .HasColumnType("uuid") + .HasColumnName("site_id"); + + b.Property("Source") + .IsRequired() + .HasMaxLength(40) + .HasColumnType("character varying(40)") + .HasColumnName("source"); + + b.Property("Stage") + .HasColumnType("integer") + .HasColumnName("stage"); + + b.Property("TotalReviews") + .HasColumnType("integer") + .HasColumnName("total_reviews"); + + b.Property("Translation") + .HasMaxLength(500) + .HasColumnType("character varying(500)") + .HasColumnName("translation"); + + b.Property("UpdatedAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("updated_at"); + + b.Property("UserBookId") + .HasColumnType("uuid") + .HasColumnName("user_book_id"); + + b.Property("UserId") + .HasColumnType("uuid") + .HasColumnName("user_id"); + + b.Property("Word") + .IsRequired() + .HasMaxLength(200) + .HasColumnType("character varying(200)") + .HasColumnName("word"); + + b.Property("ZipfRank") + .HasColumnType("integer") + .HasColumnName("zipf_rank"); + + b.Property("ZipfScore") + .HasColumnType("double precision") + .HasColumnName("zipf_score"); + + b.HasKey("Id") + .HasName("pk_vocabulary_words"); + + b.HasIndex("ChapterId") + .HasDatabaseName("ix_vocabulary_words_chapter_id"); + + b.HasIndex("ClusterId") + .HasDatabaseName("ix_vocabulary_words_cluster_id"); + + b.HasIndex("EditionId") + .HasDatabaseName("ix_vocabulary_words_edition_id"); + + b.HasIndex("SiteId") + .HasDatabaseName("ix_vocabulary_words_site_id"); + + b.HasIndex("UserBookId") + .HasDatabaseName("ix_vocabulary_words_user_book_id"); + + b.HasIndex("UserId", "SiteId") + .HasDatabaseName("ix_vocabulary_words_user_id_site_id"); + + b.HasIndex("UserId", "SiteId", "IsRetired", "NextReviewAt") + .HasDatabaseName("ix_vocabulary_words_user_id_site_id_is_retired_next_review_at"); + + b.HasIndex("UserId", "SiteId", "Word", "Language") + .IsUnique() + .HasDatabaseName("ix_vocabulary_words_user_id_site_id_word_language"); + + b.ToTable("vocabulary_words", (string)null); + }); + + modelBuilder.Entity("Domain.Entities.WordCluster", b => + { + b.Property("Id") + .ValueGeneratedOnAdd() + .HasColumnType("uuid") + .HasColumnName("id"); + + b.Property("BookTitle") + .HasMaxLength(500) + .HasColumnType("character varying(500)") + .HasColumnName("book_title"); + + b.Property("CohesionScore") + .HasColumnType("double precision") + .HasColumnName("cohesion_score"); + + b.Property("CompletedAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("completed_at"); + + b.Property("CreatedAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("created_at"); + + b.Property("DismissedAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("dismissed_at"); + + b.Property("EditionId") + .HasColumnType("uuid") + .HasColumnName("edition_id"); + + b.Property("IsConfirmed") + .HasColumnType("boolean") + .HasColumnName("is_confirmed"); + + b.Property("IsDismissed") + .HasColumnType("boolean") + .HasColumnName("is_dismissed"); + + b.Property("MemberCount") + .HasColumnType("integer") + .HasColumnName("member_count"); + + b.Property("SiteId") + .HasColumnType("uuid") + .HasColumnName("site_id"); + + b.Property("Theme") + .HasMaxLength(100) + .HasColumnType("character varying(100)") + .HasColumnName("theme"); + + b.Property("Title") + .IsRequired() + .HasMaxLength(200) + .HasColumnType("character varying(200)") + .HasColumnName("title"); + + b.Property("UserBookId") + .HasColumnType("uuid") + .HasColumnName("user_book_id"); + + b.Property("UserId") + .HasColumnType("uuid") + .HasColumnName("user_id"); + + b.HasKey("Id") + .HasName("pk_word_clusters"); + + b.HasIndex("EditionId") + .HasDatabaseName("ix_word_clusters_edition_id"); + + b.HasIndex("SiteId") + .HasDatabaseName("ix_word_clusters_site_id"); + + b.HasIndex("UserBookId") + .HasDatabaseName("ix_word_clusters_user_book_id"); + + b.HasIndex("UserId", "SiteId", "IsDismissed", "CreatedAt") + .IsDescending(false, false, false, true) + .HasDatabaseName("ix_word_clusters_user_id_site_id_is_dismissed_created_at"); + + b.ToTable("word_clusters", (string)null); + }); + + modelBuilder.Entity("Domain.Entities.WordFrequency", b => + { + b.Property("Id") + .ValueGeneratedOnAdd() + .HasColumnType("uuid") + .HasColumnName("id"); + + b.Property("Language") + .IsRequired() + .HasMaxLength(8) + .HasColumnType("character varying(8)") + .HasColumnName("language"); + + b.Property("Pos") + .HasMaxLength(20) + .HasColumnType("character varying(20)") + .HasColumnName("pos"); + + b.Property("Rank") + .HasColumnType("integer") + .HasColumnName("rank"); + + b.Property("Word") + .IsRequired() + .HasMaxLength(200) + .HasColumnType("character varying(200)") + .HasColumnName("word"); + + b.Property("Zipf") + .HasColumnType("double precision") + .HasColumnName("zipf"); + + b.HasKey("Id") + .HasName("pk_word_frequencies"); + + b.HasIndex("Language", "Rank") + .HasDatabaseName("ix_word_frequencies_language_rank"); + + b.HasIndex("Language", "Word") + .IsUnique() + .HasDatabaseName("ix_word_frequencies_language_word"); + + b.ToTable("word_frequencies", (string)null); + }); + + modelBuilder.Entity("Domain.Entities.WordLookup", b => + { + b.Property("Id") + .ValueGeneratedOnAdd() + .HasColumnType("uuid") + .HasColumnName("id"); + + b.Property("BookTitle") + .HasMaxLength(500) + .HasColumnType("character varying(500)") + .HasColumnName("book_title"); + + b.Property("ChapterId") + .HasColumnType("uuid") + .HasColumnName("chapter_id"); + + b.Property("EditionId") + .HasColumnType("uuid") + .HasColumnName("edition_id"); + + b.Property("FirstTappedAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("first_tapped_at"); + + b.Property("Language") + .IsRequired() + .HasMaxLength(8) + .HasColumnType("character varying(8)") + .HasColumnName("language"); + + b.Property("LastTappedAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("last_tapped_at"); + + b.Property("LastTranslation") + .HasMaxLength(500) + .HasColumnType("character varying(500)") + .HasColumnName("last_translation"); + + b.Property("Sentence") + .HasMaxLength(1000) + .HasColumnType("character varying(1000)") + .HasColumnName("sentence"); + + b.Property("SiteId") + .HasColumnType("uuid") + .HasColumnName("site_id"); + + b.Property("TapCount") + .HasColumnType("integer") + .HasColumnName("tap_count"); + + b.Property("UserBookId") + .HasColumnType("uuid") + .HasColumnName("user_book_id"); + + b.Property("UserId") + .HasColumnType("uuid") + .HasColumnName("user_id"); + + b.Property("Word") + .IsRequired() + .HasMaxLength(200) + .HasColumnType("character varying(200)") + .HasColumnName("word"); + + b.Property("ZipfRank") + .HasColumnType("integer") + .HasColumnName("zipf_rank"); + + b.HasKey("Id") + .HasName("pk_word_lookups"); + + b.HasIndex("ChapterId") + .HasDatabaseName("ix_word_lookups_chapter_id"); + + b.HasIndex("EditionId") + .HasDatabaseName("ix_word_lookups_edition_id"); + + b.HasIndex("SiteId") + .HasDatabaseName("ix_word_lookups_site_id"); + + b.HasIndex("UserBookId") + .HasDatabaseName("ix_word_lookups_user_book_id"); + + b.HasIndex("UserId", "SiteId", "LastTappedAt") + .IsDescending(false, false, true) + .HasDatabaseName("ix_word_lookups_user_id_site_id_last_tapped_at"); + + b.HasIndex("UserId", "SiteId", "Word", "Language") + .IsUnique() + .HasDatabaseName("ix_word_lookups_user_id_site_id_word_language"); + + b.ToTable("word_lookups", (string)null); + }); + + modelBuilder.Entity("Domain.Entities.Work", b => + { + b.Property("Id") + .ValueGeneratedOnAdd() + .HasColumnType("uuid") + .HasColumnName("id"); + + b.Property("CreatedAt") + .HasColumnType("timestamp with time zone") + .HasColumnName("created_at"); + + b.Property("SiteId") + .HasColumnType("uuid") + .HasColumnName("site_id"); + + b.Property("Slug") + .IsRequired() + .HasColumnType("text") + .HasColumnName("slug"); + + b.HasKey("Id") + .HasName("pk_works"); + + b.HasIndex("SiteId") + .HasDatabaseName("ix_works_site_id"); + + b.HasIndex("SiteId", "Slug") + .IsUnique() + .HasDatabaseName("ix_works_site_id_slug"); + + b.ToTable("works", (string)null); + }); + + modelBuilder.Entity("edition_genres", b => + { + b.Property("EditionsId") + .HasColumnType("uuid") + .HasColumnName("editions_id"); + + b.Property("GenresId") + .HasColumnType("uuid") + .HasColumnName("genres_id"); + + b.HasKey("EditionsId", "GenresId") + .HasName("pk_edition_genres"); + + b.HasIndex("GenresId") + .HasDatabaseName("ix_edition_genres_genres_id"); + + b.ToTable("edition_genres", (string)null); + }); + + modelBuilder.Entity("Domain.Entities.AdminRefreshToken", b => + { + b.HasOne("Domain.Entities.AdminUser", "AdminUser") + .WithMany("RefreshTokens") + .HasForeignKey("AdminUserId") + .OnDelete(DeleteBehavior.Cascade) + .IsRequired() + .HasConstraintName("fk_admin_refresh_tokens_admin_users_admin_user_id"); + + b.Navigation("AdminUser"); + }); + + modelBuilder.Entity("Domain.Entities.Author", b => + { + b.HasOne("Domain.Entities.Site", "Site") + .WithMany() + .HasForeignKey("SiteId") + .OnDelete(DeleteBehavior.Restrict) + .IsRequired() + .HasConstraintName("fk_authors_sites_site_id"); + + b.Navigation("Site"); + }); + + modelBuilder.Entity("Domain.Entities.AutoPublishJob", b => + { + b.HasOne("Domain.Entities.Edition", "Edition") + .WithMany() + .HasForeignKey("EditionId") + .OnDelete(DeleteBehavior.Cascade) + .IsRequired() + .HasConstraintName("fk_auto_publish_jobs_editions_edition_id"); + + b.HasOne("Domain.Entities.Site", "Site") + .WithMany() + .HasForeignKey("SiteId") + .OnDelete(DeleteBehavior.Cascade) + .IsRequired() + .HasConstraintName("fk_auto_publish_jobs_sites_site_id"); + + b.Navigation("Edition"); + + b.Navigation("Site"); + }); + + modelBuilder.Entity("Domain.Entities.BookAsset", b => + { + b.HasOne("Domain.Entities.Edition", "Edition") + .WithMany("Assets") + .HasForeignKey("EditionId") + .OnDelete(DeleteBehavior.Cascade) + .IsRequired() + .HasConstraintName("fk_book_assets_editions_edition_id"); + + b.Navigation("Edition"); + }); + + modelBuilder.Entity("Domain.Entities.BookCollection", b => + { + b.HasOne("Domain.Entities.Collection", "Collection") + .WithMany("Books") + .HasForeignKey("CollectionId") + .OnDelete(DeleteBehavior.Cascade) + .IsRequired() + .HasConstraintName("fk_book_collections_collections_collection_id"); + + b.Navigation("Collection"); + }); + + modelBuilder.Entity("Domain.Entities.BookFile", b => + { + b.HasOne("Domain.Entities.Edition", "Edition") + .WithMany("BookFiles") + .HasForeignKey("EditionId") + .OnDelete(DeleteBehavior.Cascade) + .IsRequired() + .HasConstraintName("fk_book_files_editions_edition_id"); + + b.Navigation("Edition"); + }); + + modelBuilder.Entity("Domain.Entities.BookQualityJob", b => + { + b.HasOne("Domain.Entities.Edition", "Edition") + .WithMany() + .HasForeignKey("EditionId") + .OnDelete(DeleteBehavior.Cascade) + .HasConstraintName("fk_book_quality_jobs_editions_edition_id"); + + b.HasOne("Domain.Entities.UserBook", "UserBook") + .WithMany() + .HasForeignKey("UserBookId") + .OnDelete(DeleteBehavior.Cascade) + .HasConstraintName("fk_book_quality_jobs_user_books_user_book_id"); + + b.Navigation("Edition"); + + b.Navigation("UserBook"); + }); + + modelBuilder.Entity("Domain.Entities.Bookmark", b => + { + b.HasOne("Domain.Entities.Chapter", "Chapter") + .WithMany("Bookmarks") + .HasForeignKey("ChapterId") + .OnDelete(DeleteBehavior.Cascade) + .IsRequired() + .HasConstraintName("fk_bookmarks_chapters_chapter_id"); + + b.HasOne("Domain.Entities.Edition", "Edition") + .WithMany() + .HasForeignKey("EditionId") + .OnDelete(DeleteBehavior.Cascade) + .IsRequired() + .HasConstraintName("fk_bookmarks_editions_edition_id"); + + b.HasOne("Domain.Entities.Site", "Site") + .WithMany() + .HasForeignKey("SiteId") + .OnDelete(DeleteBehavior.Restrict) + .IsRequired() + .HasConstraintName("fk_bookmarks_sites_site_id"); + + b.HasOne("Domain.Entities.User", "User") + .WithMany("Bookmarks") + .HasForeignKey("UserId") + .OnDelete(DeleteBehavior.Cascade) + .IsRequired() + .HasConstraintName("fk_bookmarks_users_user_id"); + + b.Navigation("Chapter"); + + b.Navigation("Edition"); + + b.Navigation("Site"); + + b.Navigation("User"); + }); + + modelBuilder.Entity("Domain.Entities.Chapter", b => + { + b.HasOne("Domain.Entities.Edition", "Edition") + .WithMany("Chapters") + .HasForeignKey("EditionId") + .OnDelete(DeleteBehavior.Cascade) + .IsRequired() + .HasConstraintName("fk_chapters_editions_edition_id"); + + b.Navigation("Edition"); + }); + + modelBuilder.Entity("Domain.Entities.Collection", b => + { + b.HasOne("Domain.Entities.User", "User") + .WithMany() + .HasForeignKey("UserId") + .OnDelete(DeleteBehavior.Cascade) + .IsRequired() + .HasConstraintName("fk_collections_users_user_id"); + + b.Navigation("User"); + }); + + modelBuilder.Entity("Domain.Entities.Edition", b => + { + b.HasOne("Domain.Entities.Site", "Site") + .WithMany() + .HasForeignKey("SiteId") + .OnDelete(DeleteBehavior.Restrict) + .IsRequired() + .HasConstraintName("fk_editions_sites_site_id"); + + b.HasOne("Domain.Entities.Edition", "SourceEdition") + .WithMany("TranslatedEditions") + .HasForeignKey("SourceEditionId") + .OnDelete(DeleteBehavior.SetNull) + .HasConstraintName("fk_editions_editions_source_edition_id"); + + b.HasOne("Domain.Entities.Work", "Work") + .WithMany("Editions") + .HasForeignKey("WorkId") + .OnDelete(DeleteBehavior.Cascade) + .IsRequired() + .HasConstraintName("fk_editions_works_work_id"); + + b.Navigation("Site"); + + b.Navigation("SourceEdition"); + + b.Navigation("Work"); + }); + + modelBuilder.Entity("Domain.Entities.EditionAuthor", b => + { + b.HasOne("Domain.Entities.Author", "Author") + .WithMany("EditionAuthors") + .HasForeignKey("AuthorId") + .OnDelete(DeleteBehavior.Cascade) + .IsRequired() + .HasConstraintName("fk_edition_authors_authors_author_id"); + + b.HasOne("Domain.Entities.Edition", "Edition") + .WithMany("EditionAuthors") + .HasForeignKey("EditionId") + .OnDelete(DeleteBehavior.Cascade) + .IsRequired() + .HasConstraintName("fk_edition_authors_editions_edition_id"); + + b.Navigation("Author"); + + b.Navigation("Edition"); + }); + + modelBuilder.Entity("Domain.Entities.Genre", b => + { + b.HasOne("Domain.Entities.Site", "Site") + .WithMany() + .HasForeignKey("SiteId") + .OnDelete(DeleteBehavior.Restrict) + .IsRequired() + .HasConstraintName("fk_genres_sites_site_id"); + + b.Navigation("Site"); + }); + + modelBuilder.Entity("Domain.Entities.Highlight", b => + { + b.HasOne("Domain.Entities.Chapter", "Chapter") + .WithMany() + .HasForeignKey("ChapterId") + .OnDelete(DeleteBehavior.SetNull) + .HasConstraintName("fk_highlights_chapters_chapter_id"); + + b.HasOne("Domain.Entities.Edition", "Edition") + .WithMany() + .HasForeignKey("EditionId") + .OnDelete(DeleteBehavior.SetNull) + .HasConstraintName("fk_highlights_editions_edition_id"); + + b.HasOne("Domain.Entities.Site", "Site") + .WithMany() + .HasForeignKey("SiteId") + .OnDelete(DeleteBehavior.Restrict) + .IsRequired() + .HasConstraintName("fk_highlights_sites_site_id"); + + b.HasOne("Domain.Entities.UserBook", "UserBook") + .WithMany() + .HasForeignKey("UserBookId") + .OnDelete(DeleteBehavior.SetNull) + .HasConstraintName("fk_highlights_user_books_user_book_id"); + + b.HasOne("Domain.Entities.UserChapter", "UserChapter") + .WithMany() + .HasForeignKey("UserChapterId") + .OnDelete(DeleteBehavior.SetNull) + .HasConstraintName("fk_highlights_user_chapters_user_chapter_id"); + + b.HasOne("Domain.Entities.User", "User") + .WithMany("Highlights") + .HasForeignKey("UserId") + .OnDelete(DeleteBehavior.Cascade) + .IsRequired() + .HasConstraintName("fk_highlights_users_user_id"); + + b.Navigation("Chapter"); + + b.Navigation("Edition"); + + b.Navigation("Site"); + + b.Navigation("User"); + + b.Navigation("UserBook"); + + b.Navigation("UserChapter"); + }); + + modelBuilder.Entity("Domain.Entities.IngestionJob", b => + { + b.HasOne("Domain.Entities.BookFile", "BookFile") + .WithMany("IngestionJobs") + .HasForeignKey("BookFileId") + .OnDelete(DeleteBehavior.Cascade) + .IsRequired() + .HasConstraintName("fk_ingestion_jobs_book_files_book_file_id"); + + b.HasOne("Domain.Entities.Edition", "Edition") + .WithMany("IngestionJobs") + .HasForeignKey("EditionId") + .OnDelete(DeleteBehavior.Cascade) + .IsRequired() + .HasConstraintName("fk_ingestion_jobs_editions_edition_id"); + + b.HasOne("Domain.Entities.Edition", "SourceEdition") + .WithMany() + .HasForeignKey("SourceEditionId") + .OnDelete(DeleteBehavior.SetNull) + .HasConstraintName("fk_ingestion_jobs_editions_source_edition_id"); + + b.HasOne("Domain.Entities.Work", "Work") + .WithMany() + .HasForeignKey("WorkId") + .OnDelete(DeleteBehavior.SetNull) + .HasConstraintName("fk_ingestion_jobs_works_work_id"); + + b.Navigation("BookFile"); + + b.Navigation("Edition"); + + b.Navigation("SourceEdition"); + + b.Navigation("Work"); + }); + + modelBuilder.Entity("Domain.Entities.LintResult", b => + { + b.HasOne("Domain.Entities.Edition", "Edition") + .WithMany() + .HasForeignKey("EditionId") + .OnDelete(DeleteBehavior.Cascade) + .IsRequired() + .HasConstraintName("fk_lint_results_editions_edition_id"); + + b.Navigation("Edition"); + }); + + modelBuilder.Entity("Domain.Entities.Note", b => + { + b.HasOne("Domain.Entities.Chapter", "Chapter") + .WithMany("Notes") + .HasForeignKey("ChapterId") + .OnDelete(DeleteBehavior.Cascade) + .IsRequired() + .HasConstraintName("fk_notes_chapters_chapter_id"); + + b.HasOne("Domain.Entities.Edition", "Edition") + .WithMany() + .HasForeignKey("EditionId") + .OnDelete(DeleteBehavior.Cascade) + .IsRequired() + .HasConstraintName("fk_notes_editions_edition_id"); + + b.HasOne("Domain.Entities.Highlight", "Highlight") + .WithOne("Note") + .HasForeignKey("Domain.Entities.Note", "HighlightId") + .OnDelete(DeleteBehavior.SetNull) + .HasConstraintName("fk_notes_highlights_highlight_id"); + + b.HasOne("Domain.Entities.Site", "Site") + .WithMany() + .HasForeignKey("SiteId") + .OnDelete(DeleteBehavior.Restrict) + .IsRequired() + .HasConstraintName("fk_notes_sites_site_id"); + + b.HasOne("Domain.Entities.User", "User") + .WithMany("Notes") + .HasForeignKey("UserId") + .OnDelete(DeleteBehavior.Cascade) + .IsRequired() + .HasConstraintName("fk_notes_users_user_id"); + + b.Navigation("Chapter"); + + b.Navigation("Edition"); + + b.Navigation("Highlight"); + + b.Navigation("Site"); + + b.Navigation("User"); + }); + + modelBuilder.Entity("Domain.Entities.PasswordResetToken", b => + { + b.HasOne("Domain.Entities.User", "User") + .WithMany() + .HasForeignKey("UserId") + .OnDelete(DeleteBehavior.Cascade) + .IsRequired() + .HasConstraintName("fk_password_reset_tokens_users_user_id"); + + b.Navigation("User"); + }); + + modelBuilder.Entity("Domain.Entities.PendingVocabularyWord", b => + { + b.HasOne("Domain.Entities.Chapter", "Chapter") + .WithMany() + .HasForeignKey("ChapterId") + .OnDelete(DeleteBehavior.SetNull) + .HasConstraintName("fk_pending_vocabulary_words_chapters_chapter_id"); + + b.HasOne("Domain.Entities.Edition", "Edition") + .WithMany() + .HasForeignKey("EditionId") + .OnDelete(DeleteBehavior.SetNull) + .HasConstraintName("fk_pending_vocabulary_words_editions_edition_id"); + + b.HasOne("Domain.Entities.Site", "Site") + .WithMany() + .HasForeignKey("SiteId") + .OnDelete(DeleteBehavior.Restrict) + .IsRequired() + .HasConstraintName("fk_pending_vocabulary_words_sites_site_id"); + + b.HasOne("Domain.Entities.UserBook", "UserBook") + .WithMany() + .HasForeignKey("UserBookId") + .OnDelete(DeleteBehavior.SetNull) + .HasConstraintName("fk_pending_vocabulary_words_user_books_user_book_id"); + + b.HasOne("Domain.Entities.User", "User") + .WithMany() + .HasForeignKey("UserId") + .OnDelete(DeleteBehavior.Cascade) + .IsRequired() + .HasConstraintName("fk_pending_vocabulary_words_users_user_id"); + + b.Navigation("Chapter"); + + b.Navigation("Edition"); + + b.Navigation("Site"); + + b.Navigation("User"); + + b.Navigation("UserBook"); + }); + + modelBuilder.Entity("Domain.Entities.ReadingGoal", b => + { + b.HasOne("Domain.Entities.Site", "Site") + .WithMany() + .HasForeignKey("SiteId") + .OnDelete(DeleteBehavior.Restrict) + .IsRequired() + .HasConstraintName("fk_reading_goals_sites_site_id"); + + b.HasOne("Domain.Entities.User", "User") + .WithMany() + .HasForeignKey("UserId") + .OnDelete(DeleteBehavior.Cascade) + .IsRequired() + .HasConstraintName("fk_reading_goals_users_user_id"); + + b.Navigation("Site"); + + b.Navigation("User"); + }); + + modelBuilder.Entity("Domain.Entities.ReadingProgress", b => + { + b.HasOne("Domain.Entities.Chapter", "Chapter") + .WithMany("ReadingProgresses") + .HasForeignKey("ChapterId") + .OnDelete(DeleteBehavior.Cascade) + .IsRequired() + .HasConstraintName("fk_reading_progresses_chapters_chapter_id"); + + b.HasOne("Domain.Entities.Edition", "Edition") + .WithMany() + .HasForeignKey("EditionId") + .OnDelete(DeleteBehavior.Cascade) + .IsRequired() + .HasConstraintName("fk_reading_progresses_editions_edition_id"); + + b.HasOne("Domain.Entities.Site", "Site") + .WithMany() + .HasForeignKey("SiteId") + .OnDelete(DeleteBehavior.Restrict) + .IsRequired() + .HasConstraintName("fk_reading_progresses_sites_site_id"); + + b.HasOne("Domain.Entities.User", "User") + .WithMany("ReadingProgresses") + .HasForeignKey("UserId") + .OnDelete(DeleteBehavior.Cascade) + .IsRequired() + .HasConstraintName("fk_reading_progresses_users_user_id"); + + b.Navigation("Chapter"); + + b.Navigation("Edition"); + + b.Navigation("Site"); + + b.Navigation("User"); + }); + + modelBuilder.Entity("Domain.Entities.ReadingSession", b => + { + b.HasOne("Domain.Entities.Edition", "Edition") + .WithMany() + .HasForeignKey("EditionId") + .OnDelete(DeleteBehavior.SetNull) + .HasConstraintName("fk_reading_sessions_editions_edition_id"); + + b.HasOne("Domain.Entities.Site", "Site") + .WithMany() + .HasForeignKey("SiteId") + .OnDelete(DeleteBehavior.Restrict) + .IsRequired() + .HasConstraintName("fk_reading_sessions_sites_site_id"); + + b.HasOne("Domain.Entities.UserBook", "UserBook") + .WithMany() + .HasForeignKey("UserBookId") + .OnDelete(DeleteBehavior.SetNull) + .HasConstraintName("fk_reading_sessions_user_books_user_book_id"); + + b.HasOne("Domain.Entities.User", "User") + .WithMany() + .HasForeignKey("UserId") + .OnDelete(DeleteBehavior.Cascade) + .IsRequired() + .HasConstraintName("fk_reading_sessions_users_user_id"); + + b.Navigation("Edition"); + + b.Navigation("Site"); + + b.Navigation("User"); + + b.Navigation("UserBook"); + }); + + modelBuilder.Entity("Domain.Entities.SiteDomain", b => + { + b.HasOne("Domain.Entities.Site", "Site") + .WithMany("Domains") + .HasForeignKey("SiteId") + .OnDelete(DeleteBehavior.Cascade) + .IsRequired() + .HasConstraintName("fk_site_domains_sites_site_id"); + + b.Navigation("Site"); + }); + + modelBuilder.Entity("Domain.Entities.SsgRebuildJob", b => + { + b.HasOne("Domain.Entities.Site", "Site") + .WithMany() + .HasForeignKey("SiteId") + .OnDelete(DeleteBehavior.Restrict) + .IsRequired() + .HasConstraintName("fk_ssg_rebuild_jobs_sites_site_id"); + + b.Navigation("Site"); + }); + + modelBuilder.Entity("Domain.Entities.SsgRebuildResult", b => + { + b.HasOne("Domain.Entities.SsgRebuildJob", "Job") + .WithMany("Results") + .HasForeignKey("JobId") + .OnDelete(DeleteBehavior.Cascade) + .IsRequired() + .HasConstraintName("fk_ssg_rebuild_results_ssg_rebuild_jobs_job_id"); + + b.Navigation("Job"); + }); + + modelBuilder.Entity("Domain.Entities.TextStackImport", b => + { + b.HasOne("Domain.Entities.Edition", "Edition") + .WithMany() + .HasForeignKey("EditionId") + .OnDelete(DeleteBehavior.Cascade) + .IsRequired() + .HasConstraintName("fk_text_stack_imports_editions_edition_id"); + + b.HasOne("Domain.Entities.Site", "Site") + .WithMany() + .HasForeignKey("SiteId") + .OnDelete(DeleteBehavior.Restrict) + .IsRequired() + .HasConstraintName("fk_text_stack_imports_sites_site_id"); + + b.Navigation("Edition"); + + b.Navigation("Site"); + }); + + modelBuilder.Entity("Domain.Entities.UserAchievement", b => + { + b.HasOne("Domain.Entities.Site", "Site") + .WithMany() + .HasForeignKey("SiteId") + .OnDelete(DeleteBehavior.Restrict) + .IsRequired() + .HasConstraintName("fk_user_achievements_sites_site_id"); + + b.HasOne("Domain.Entities.User", "User") + .WithMany() + .HasForeignKey("UserId") + .OnDelete(DeleteBehavior.Cascade) + .IsRequired() + .HasConstraintName("fk_user_achievements_users_user_id"); + + b.Navigation("Site"); + + b.Navigation("User"); + }); + + modelBuilder.Entity("Domain.Entities.UserBook", b => + { + b.HasOne("Domain.Entities.User", "User") + .WithMany("UserBooks") + .HasForeignKey("UserId") + .OnDelete(DeleteBehavior.Cascade) + .IsRequired() + .HasConstraintName("fk_user_books_users_user_id"); + + b.Navigation("User"); + }); + + modelBuilder.Entity("Domain.Entities.UserBookBookmark", b => + { + b.HasOne("Domain.Entities.UserChapter", "Chapter") + .WithMany() + .HasForeignKey("ChapterId") + .OnDelete(DeleteBehavior.Cascade) + .IsRequired() + .HasConstraintName("fk_user_book_bookmarks_user_chapters_chapter_id"); + + b.HasOne("Domain.Entities.UserBook", "UserBook") + .WithMany() + .HasForeignKey("UserBookId") + .OnDelete(DeleteBehavior.Cascade) + .IsRequired() + .HasConstraintName("fk_user_book_bookmarks_user_books_user_book_id"); + + b.Navigation("Chapter"); + + b.Navigation("UserBook"); + }); + + modelBuilder.Entity("Domain.Entities.UserBookFile", b => + { + b.HasOne("Domain.Entities.UserBook", "UserBook") + .WithMany("BookFiles") + .HasForeignKey("UserBookId") + .OnDelete(DeleteBehavior.Cascade) + .IsRequired() + .HasConstraintName("fk_user_book_files_user_books_user_book_id"); + + b.Navigation("UserBook"); + }); + + modelBuilder.Entity("Domain.Entities.UserChapter", b => + { + b.HasOne("Domain.Entities.UserBook", "UserBook") + .WithMany("Chapters") + .HasForeignKey("UserBookId") + .OnDelete(DeleteBehavior.Cascade) + .IsRequired() + .HasConstraintName("fk_user_chapters_user_books_user_book_id"); + + b.Navigation("UserBook"); + }); + + modelBuilder.Entity("Domain.Entities.UserIngestionJob", b => + { + b.HasOne("Domain.Entities.UserBookFile", "UserBookFile") + .WithMany() + .HasForeignKey("UserBookFileId") + .OnDelete(DeleteBehavior.Cascade) + .IsRequired() + .HasConstraintName("fk_user_ingestion_jobs_user_book_files_user_book_file_id"); + + b.HasOne("Domain.Entities.UserBook", "UserBook") + .WithMany("IngestionJobs") + .HasForeignKey("UserBookId") + .OnDelete(DeleteBehavior.Cascade) + .IsRequired() + .HasConstraintName("fk_user_ingestion_jobs_user_books_user_book_id"); + + b.Navigation("UserBook"); + + b.Navigation("UserBookFile"); + }); + + modelBuilder.Entity("Domain.Entities.UserLibrary", b => + { + b.HasOne("Domain.Entities.Edition", "Edition") + .WithMany() + .HasForeignKey("EditionId") + .OnDelete(DeleteBehavior.Cascade) + .IsRequired() + .HasConstraintName("fk_user_libraries_editions_edition_id"); + + b.HasOne("Domain.Entities.User", "User") + .WithMany("UserLibraries") + .HasForeignKey("UserId") + .OnDelete(DeleteBehavior.Cascade) + .IsRequired() + .HasConstraintName("fk_user_libraries_users_user_id"); + + b.Navigation("Edition"); + + b.Navigation("User"); + }); + + modelBuilder.Entity("Domain.Entities.UserRefreshToken", b => + { + b.HasOne("Domain.Entities.User", "User") + .WithMany() + .HasForeignKey("UserId") + .OnDelete(DeleteBehavior.Cascade) + .IsRequired() + .HasConstraintName("fk_user_refresh_tokens_users_user_id"); + + b.Navigation("User"); + }); + + modelBuilder.Entity("Domain.Entities.UserVocabularySettings", b => + { + b.HasOne("Domain.Entities.Site", "Site") + .WithMany() + .HasForeignKey("SiteId") + .OnDelete(DeleteBehavior.Restrict) + .IsRequired() + .HasConstraintName("fk_user_vocabulary_settings_sites_site_id"); + + b.HasOne("Domain.Entities.User", "User") + .WithMany() + .HasForeignKey("UserId") + .OnDelete(DeleteBehavior.Cascade) + .IsRequired() + .HasConstraintName("fk_user_vocabulary_settings_users_user_id"); + + b.Navigation("Site"); + + b.Navigation("User"); + }); + + modelBuilder.Entity("Domain.Entities.VocabularyReview", b => + { + b.HasOne("Domain.Entities.Site", "Site") + .WithMany() + .HasForeignKey("SiteId") + .OnDelete(DeleteBehavior.Restrict) + .IsRequired() + .HasConstraintName("fk_vocabulary_reviews_sites_site_id"); + + b.HasOne("Domain.Entities.User", "User") + .WithMany() + .HasForeignKey("UserId") + .OnDelete(DeleteBehavior.Cascade) + .IsRequired() + .HasConstraintName("fk_vocabulary_reviews_users_user_id"); + + b.HasOne("Domain.Entities.VocabularyWord", "VocabularyWord") + .WithMany("Reviews") + .HasForeignKey("VocabularyWordId") + .OnDelete(DeleteBehavior.Cascade) + .IsRequired() + .HasConstraintName("fk_vocabulary_reviews_vocabulary_words_vocabulary_word_id"); + + b.Navigation("Site"); + + b.Navigation("User"); + + b.Navigation("VocabularyWord"); + }); + + modelBuilder.Entity("Domain.Entities.VocabularyWord", b => + { + b.HasOne("Domain.Entities.Chapter", "Chapter") + .WithMany() + .HasForeignKey("ChapterId") + .OnDelete(DeleteBehavior.SetNull) + .HasConstraintName("fk_vocabulary_words_chapters_chapter_id"); + + b.HasOne("Domain.Entities.WordCluster", null) + .WithMany("Words") + .HasForeignKey("ClusterId") + .OnDelete(DeleteBehavior.SetNull) + .HasConstraintName("fk_vocabulary_words_word_clusters_cluster_id"); + + b.HasOne("Domain.Entities.Edition", "Edition") + .WithMany() + .HasForeignKey("EditionId") + .OnDelete(DeleteBehavior.SetNull) + .HasConstraintName("fk_vocabulary_words_editions_edition_id"); + + b.HasOne("Domain.Entities.Site", "Site") + .WithMany() + .HasForeignKey("SiteId") + .OnDelete(DeleteBehavior.Restrict) + .IsRequired() + .HasConstraintName("fk_vocabulary_words_sites_site_id"); + + b.HasOne("Domain.Entities.UserBook", "UserBook") + .WithMany() + .HasForeignKey("UserBookId") + .OnDelete(DeleteBehavior.SetNull) + .HasConstraintName("fk_vocabulary_words_user_books_user_book_id"); + + b.HasOne("Domain.Entities.User", "User") + .WithMany() + .HasForeignKey("UserId") + .OnDelete(DeleteBehavior.Cascade) + .IsRequired() + .HasConstraintName("fk_vocabulary_words_users_user_id"); + + b.Navigation("Chapter"); + + b.Navigation("Edition"); + + b.Navigation("Site"); + + b.Navigation("User"); + + b.Navigation("UserBook"); + }); + + modelBuilder.Entity("Domain.Entities.WordCluster", b => + { + b.HasOne("Domain.Entities.Edition", "Edition") + .WithMany() + .HasForeignKey("EditionId") + .OnDelete(DeleteBehavior.SetNull) + .HasConstraintName("fk_word_clusters_editions_edition_id"); + + b.HasOne("Domain.Entities.Site", "Site") + .WithMany() + .HasForeignKey("SiteId") + .OnDelete(DeleteBehavior.Restrict) + .IsRequired() + .HasConstraintName("fk_word_clusters_sites_site_id"); + + b.HasOne("Domain.Entities.UserBook", "UserBook") + .WithMany() + .HasForeignKey("UserBookId") + .OnDelete(DeleteBehavior.SetNull) + .HasConstraintName("fk_word_clusters_user_books_user_book_id"); + + b.HasOne("Domain.Entities.User", "User") + .WithMany() + .HasForeignKey("UserId") + .OnDelete(DeleteBehavior.Cascade) + .IsRequired() + .HasConstraintName("fk_word_clusters_users_user_id"); + + b.Navigation("Edition"); + + b.Navigation("Site"); + + b.Navigation("User"); + + b.Navigation("UserBook"); + }); + + modelBuilder.Entity("Domain.Entities.WordLookup", b => + { + b.HasOne("Domain.Entities.Chapter", "Chapter") + .WithMany() + .HasForeignKey("ChapterId") + .OnDelete(DeleteBehavior.SetNull) + .HasConstraintName("fk_word_lookups_chapters_chapter_id"); + + b.HasOne("Domain.Entities.Edition", "Edition") + .WithMany() + .HasForeignKey("EditionId") + .OnDelete(DeleteBehavior.SetNull) + .HasConstraintName("fk_word_lookups_editions_edition_id"); + + b.HasOne("Domain.Entities.Site", "Site") + .WithMany() + .HasForeignKey("SiteId") + .OnDelete(DeleteBehavior.Restrict) + .IsRequired() + .HasConstraintName("fk_word_lookups_sites_site_id"); + + b.HasOne("Domain.Entities.UserBook", "UserBook") + .WithMany() + .HasForeignKey("UserBookId") + .OnDelete(DeleteBehavior.SetNull) + .HasConstraintName("fk_word_lookups_user_books_user_book_id"); + + b.HasOne("Domain.Entities.User", "User") + .WithMany() + .HasForeignKey("UserId") + .OnDelete(DeleteBehavior.Cascade) + .IsRequired() + .HasConstraintName("fk_word_lookups_users_user_id"); + + b.Navigation("Chapter"); + + b.Navigation("Edition"); + + b.Navigation("Site"); + + b.Navigation("User"); + + b.Navigation("UserBook"); + }); + + modelBuilder.Entity("Domain.Entities.Work", b => + { + b.HasOne("Domain.Entities.Site", "Site") + .WithMany("Works") + .HasForeignKey("SiteId") + .OnDelete(DeleteBehavior.Restrict) + .IsRequired() + .HasConstraintName("fk_works_sites_site_id"); + + b.Navigation("Site"); + }); + + modelBuilder.Entity("edition_genres", b => + { + b.HasOne("Domain.Entities.Edition", null) + .WithMany() + .HasForeignKey("EditionsId") + .OnDelete(DeleteBehavior.Cascade) + .IsRequired() + .HasConstraintName("fk_edition_genres_editions_editions_id"); + + b.HasOne("Domain.Entities.Genre", null) + .WithMany() + .HasForeignKey("GenresId") + .OnDelete(DeleteBehavior.Cascade) + .IsRequired() + .HasConstraintName("fk_edition_genres_genres_genres_id"); + }); + + modelBuilder.Entity("Domain.Entities.AdminUser", b => + { + b.Navigation("RefreshTokens"); + }); + + modelBuilder.Entity("Domain.Entities.Author", b => + { + b.Navigation("EditionAuthors"); + }); + + modelBuilder.Entity("Domain.Entities.BookFile", b => + { + b.Navigation("IngestionJobs"); + }); + + modelBuilder.Entity("Domain.Entities.Chapter", b => + { + b.Navigation("Bookmarks"); + + b.Navigation("Notes"); + + b.Navigation("ReadingProgresses"); + }); + + modelBuilder.Entity("Domain.Entities.Collection", b => + { + b.Navigation("Books"); + }); + + modelBuilder.Entity("Domain.Entities.Edition", b => + { + b.Navigation("Assets"); + + b.Navigation("BookFiles"); + + b.Navigation("Chapters"); + + b.Navigation("EditionAuthors"); + + b.Navigation("IngestionJobs"); + + b.Navigation("TranslatedEditions"); + }); + + modelBuilder.Entity("Domain.Entities.Highlight", b => + { + b.Navigation("Note"); + }); + + modelBuilder.Entity("Domain.Entities.Site", b => + { + b.Navigation("Domains"); + + b.Navigation("Works"); + }); + + modelBuilder.Entity("Domain.Entities.SsgRebuildJob", b => + { + b.Navigation("Results"); + }); + + modelBuilder.Entity("Domain.Entities.User", b => + { + b.Navigation("Bookmarks"); + + b.Navigation("Highlights"); + + b.Navigation("Notes"); + + b.Navigation("ReadingProgresses"); + + b.Navigation("UserBooks"); + + b.Navigation("UserLibraries"); + }); + + modelBuilder.Entity("Domain.Entities.UserBook", b => + { + b.Navigation("BookFiles"); + + b.Navigation("Chapters"); + + b.Navigation("IngestionJobs"); + }); + + modelBuilder.Entity("Domain.Entities.VocabularyWord", b => + { + b.Navigation("Reviews"); + }); + + modelBuilder.Entity("Domain.Entities.WordCluster", b => + { + b.Navigation("Words"); + }); + + modelBuilder.Entity("Domain.Entities.Work", b => + { + b.Navigation("Editions"); + }); +#pragma warning restore 612, 618 + } + } +} diff --git a/backend/src/Infrastructure/Migrations/20260522170250_AddChapterContentQualityScore.cs b/backend/src/Infrastructure/Migrations/20260522170250_AddChapterContentQualityScore.cs new file mode 100644 index 00000000..4d34dc53 --- /dev/null +++ b/backend/src/Infrastructure/Migrations/20260522170250_AddChapterContentQualityScore.cs @@ -0,0 +1,68 @@ +using Microsoft.EntityFrameworkCore.Migrations; + +#nullable disable + +namespace Infrastructure.Migrations +{ + /// + public partial class AddChapterContentQualityScore : Migration + { + /// + protected override void Up(MigrationBuilder migrationBuilder) + { + migrationBuilder.AddColumn( + name: "content_quality_score", + table: "user_chapters", + type: "integer", + nullable: true); + + migrationBuilder.AddColumn( + name: "content_quality_score", + table: "chapters", + type: "integer", + nullable: true); + + migrationBuilder.AddColumn( + name: "content_chapters_cleaned", + table: "book_quality_jobs", + type: "integer", + nullable: true); + + migrationBuilder.AddColumn( + name: "content_chapters_rejected", + table: "book_quality_jobs", + type: "integer", + nullable: true); + + migrationBuilder.AddColumn( + name: "content_chapters_skipped", + table: "book_quality_jobs", + type: "integer", + nullable: true); + } + + /// + protected override void Down(MigrationBuilder migrationBuilder) + { + migrationBuilder.DropColumn( + name: "content_quality_score", + table: "user_chapters"); + + migrationBuilder.DropColumn( + name: "content_quality_score", + table: "chapters"); + + migrationBuilder.DropColumn( + name: "content_chapters_cleaned", + table: "book_quality_jobs"); + + migrationBuilder.DropColumn( + name: "content_chapters_rejected", + table: "book_quality_jobs"); + + migrationBuilder.DropColumn( + name: "content_chapters_skipped", + table: "book_quality_jobs"); + } + } +} diff --git a/backend/src/Infrastructure/Migrations/AppDbContextModelSnapshot.cs b/backend/src/Infrastructure/Migrations/AppDbContextModelSnapshot.cs index b0917ce5..0ddf82c4 100644 --- a/backend/src/Infrastructure/Migrations/AppDbContextModelSnapshot.cs +++ b/backend/src/Infrastructure/Migrations/AppDbContextModelSnapshot.cs @@ -18,7 +18,7 @@ protected override void BuildModel(ModelBuilder modelBuilder) { #pragma warning disable 612, 618 modelBuilder - .HasAnnotation("ProductVersion", "10.0.0") + .HasAnnotation("ProductVersion", "10.0.8") .HasAnnotation("Relational:MaxIdentifierLength", 63); NpgsqlModelBuilderExtensions.UseIdentityByDefaultColumns(modelBuilder); @@ -418,6 +418,18 @@ protected override void BuildModel(ModelBuilder modelBuilder) .HasColumnType("uuid") .HasColumnName("id"); + b.Property("ContentChaptersCleaned") + .HasColumnType("integer") + .HasColumnName("content_chapters_cleaned"); + + b.Property("ContentChaptersRejected") + .HasColumnType("integer") + .HasColumnName("content_chapters_rejected"); + + b.Property("ContentChaptersSkipped") + .HasColumnType("integer") + .HasColumnName("content_chapters_skipped"); + b.Property("CreatedAt") .HasColumnType("timestamp with time zone") .HasColumnName("created_at"); @@ -542,6 +554,10 @@ protected override void BuildModel(ModelBuilder modelBuilder) .HasColumnType("integer") .HasColumnName("chapter_number"); + b.Property("ContentQualityScore") + .HasColumnType("integer") + .HasColumnName("content_quality_score"); + b.Property("CreatedAt") .HasColumnType("timestamp with time zone") .HasColumnName("created_at"); @@ -2454,6 +2470,10 @@ protected override void BuildModel(ModelBuilder modelBuilder) .HasColumnType("integer") .HasColumnName("chapter_number"); + b.Property("ContentQualityScore") + .HasColumnType("integer") + .HasColumnName("content_quality_score"); + b.Property("CreatedAt") .HasColumnType("timestamp with time zone") .HasColumnName("created_at"); diff --git a/backend/src/Worker/Services/UserIngestionService.cs b/backend/src/Worker/Services/UserIngestionService.cs index cf0a3f06..dbac68f9 100644 --- a/backend/src/Worker/Services/UserIngestionService.cs +++ b/backend/src/Worker/Services/UserIngestionService.cs @@ -8,6 +8,7 @@ using Microsoft.Extensions.Logging; using TextStack.Extraction.Contracts; using TextStack.Extraction.Enums; +using TextStack.Extraction.Quality; using TextStack.Extraction.Registry; namespace Worker.Services; @@ -188,7 +189,8 @@ public async Task ProcessJobAsync(Guid jobId, CancellationToken ct) // Create chapters foreach (var unit in result.Units) { - var html = Application.Common.ImageProcessingHelper.RewriteImageSrcs(unit.Html ?? string.Empty, imageMap); + var html = SanitizeText( + Application.Common.ImageProcessingHelper.RewriteImageSrcs(unit.Html ?? string.Empty, imageMap)); var chapterTitle = SanitizeText(unit.Title ?? $"Chapter {unit.OrderIndex + 1}"); var chapter = new UserChapter { @@ -197,9 +199,10 @@ public async Task ProcessJobAsync(Guid jobId, CancellationToken ct) ChapterNumber = unit.OrderIndex + 1, Slug = SlugGenerator.GenerateChapterSlug(chapterTitle, unit.OrderIndex), Title = chapterTitle, - Html = SanitizeText(html), + Html = html, PlainText = SanitizeText(unit.PlainText), WordCount = unit.WordCount, + ContentQualityScore = ChapterContentQualityAnalyzer.Analyze(html).Score, CreatedAt = DateTimeOffset.UtcNow }; db.UserChapters.Add(chapter); From ef5c6e1056e66aa0e110c14052ebb781e8b9026f Mon Sep 17 00:00:00 2001 From: Vasyl Vdovychenko Date: Fri, 22 May 2026 13:13:26 -0400 Subject: [PATCH 4/7] =?UTF-8?q?fix(test):=20SsgRouteProvider=20=E2=80=94?= =?UTF-8?q?=20test=20matched=20pre-noindex-fix=20behavior?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit GetRoutesAsync_ExcludesNonIndexableContent asserted non-indexable books are dropped from SSG routes. But AddBookRoutesAsync intentionally routes every Published edition (renderer emits noindex meta; filtering would 404 the slug) — the test was never updated when that fix landed. Split into two tests reflecting actual design: - NonIndexableBook_StillRouted — book route emitted regardless of Indexable - ExcludesNonIndexableAuthorsAndGenres — author/genre listings DO filter it Co-Authored-By: Claude Opus 4.7 (1M context) --- .../SsgRebuild/SsgRouteProviderTests.cs | 44 ++++++++++++++++--- 1 file changed, 38 insertions(+), 6 deletions(-) diff --git a/tests/TextStack.UnitTests/SsgRebuild/SsgRouteProviderTests.cs b/tests/TextStack.UnitTests/SsgRebuild/SsgRouteProviderTests.cs index 13835068..8bb0850a 100644 --- a/tests/TextStack.UnitTests/SsgRebuild/SsgRouteProviderTests.cs +++ b/tests/TextStack.UnitTests/SsgRebuild/SsgRouteProviderTests.cs @@ -72,9 +72,12 @@ public async Task GetRoutesAsync_SpecificMode_OnlyReturnsSpecifiedBooks() } [Fact] - public async Task GetRoutesAsync_ExcludesNonIndexableContent() + public async Task GetRoutesAsync_NonIndexableBook_StillRouted() { - // Arrange + // Book detail pages are rendered for every Published edition regardless + // of Indexable — the renderer emits + // instead. Filtering here would leave nginx serving a hard 404 for the + // slug, so AddBookRoutesAsync intentionally does NOT filter on Indexable. var site = CreateSite(); var editions = CreateEditions([ ("indexable-book", "Indexable Book", true, EditionStatus.Published), @@ -83,14 +86,43 @@ public async Task GetRoutesAsync_ExcludesNonIndexableContent() SetupMockDbSets(site, editions, new List().AsQueryable(), new List().AsQueryable()); - // Act var routes = await _provider.GetRoutesAsync( TestSiteId, SsgRebuildMode.Full, null, null, null, CancellationToken.None); - // Assert var bookRoutes = routes.Where(r => r.RouteType == "book").ToList(); - Assert.Single(bookRoutes); - Assert.Contains(bookRoutes, r => r.Route.Contains("indexable-book")); + Assert.Equal(2, bookRoutes.Count); + Assert.Contains(bookRoutes, r => r.Route.Contains("non-indexable-book")); + } + + [Fact] + public async Task GetRoutesAsync_ExcludesNonIndexableAuthorsAndGenres() + { + // Author/genre listing pages — unlike book detail pages — ARE filtered + // by Indexable (the admin "hide from listings" override). + var site = CreateSite(); + var editions = CreateEditions([("book-1", "Book 1", true, EditionStatus.Published)]); + var edition = editions.First(); + var authors = CreateAuthors([ + ("shown-author", "Shown Author", true), + ("hidden-author", "Hidden Author", false) + ], edition); + var genres = CreateGenres([ + ("shown-genre", "Shown Genre", true), + ("hidden-genre", "Hidden Genre", false) + ], edition); + + SetupMockDbSets(site, editions, authors, genres); + + var routes = await _provider.GetRoutesAsync( + TestSiteId, SsgRebuildMode.Full, null, null, null, CancellationToken.None); + + var authorRoutes = routes.Where(r => r.RouteType == "author").ToList(); + Assert.Single(authorRoutes); + Assert.Contains(authorRoutes, r => r.Route.Contains("shown-author")); + + var genreRoutes = routes.Where(r => r.RouteType == "genre").ToList(); + Assert.Single(genreRoutes); + Assert.Contains(genreRoutes, r => r.Route.Contains("shown-genre")); } [Fact] From 637dad9209dea01017769105d5a41af71a4695ee Mon Sep 17 00:00:00 2001 From: Vasyl Vdovychenko Date: Fri, 22 May 2026 13:19:30 -0400 Subject: [PATCH 5/7] feat(pdf-quality) [slice 3]: Claude content-cleanup phase MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit quality-poll.sh Phase 3 — after structure fixes, for each chapter scoring below CONTENT_QUALITY_THRESHOLD: fetch HTML → Claude CLI (fix structure, preserve content verbatim) → pdf-cleanup-gate.py (word-multiset diff: reject hallucination / over-deletion) → PUT cleaned HTML → log (messy→clean) pair. - pdf-cleanup-gate.py: deterministic preservation gate, stdlib-only. Joins line-wrap hyphens before tokenizing so legit merges don't read as new words. 3% novel-token tolerance, 70% retention floor. - InternalEndpoints: UpdateQualityJobRequest + handler carry the three ContentChapters* counters; GetQualityJob returns them. - Off by default — CONTENT_CLEANUP_ENABLED in .env gates the whole phase. Pairs land in data/pdf-cleanup-dataset/ for the slice-5 ratchet. Co-Authored-By: Claude Opus 4.7 (1M context) --- .env.example | 6 + .../src/Api/Endpoints/InternalEndpoints.cs | 9 + infra/scripts/pdf-cleanup-gate.py | 87 +++++++++ infra/scripts/quality-poll.sh | 173 +++++++++++++++++- 4 files changed, 272 insertions(+), 3 deletions(-) create mode 100755 infra/scripts/pdf-cleanup-gate.py diff --git a/.env.example b/.env.example index 3952e99b..660df6df 100644 --- a/.env.example +++ b/.env.example @@ -39,3 +39,9 @@ RESEND_API_KEY= RESEND_FROM_EMAIL=noreply@textstack.app # Where SEO backfill / ops failure alerts go. Empty = alerts disabled. ADMIN_ALERT_EMAIL= + +# PDF content cleanup (quality-poll.sh Phase 3 — feat-0007). +# When true, chapters scoring below the threshold get an LLM cleanup pass +# via Claude CLI. Off by default — leaves the poller at structure-only. +CONTENT_CLEANUP_ENABLED=false +CONTENT_QUALITY_THRESHOLD=60 diff --git a/backend/src/Api/Endpoints/InternalEndpoints.cs b/backend/src/Api/Endpoints/InternalEndpoints.cs index 8582a58b..1c7c87a0 100644 --- a/backend/src/Api/Endpoints/InternalEndpoints.cs +++ b/backend/src/Api/Endpoints/InternalEndpoints.cs @@ -435,6 +435,9 @@ private static async Task GetQualityJob( job.IssuesJson, job.IssuesFound, job.IssuesFixed, + job.ContentChaptersCleaned, + job.ContentChaptersRejected, + job.ContentChaptersSkipped, job.Error, job.LogOutput, job.CreatedAt, @@ -458,6 +461,9 @@ private static async Task UpdateQualityJob( if (req.IssuesFixed.HasValue) job.IssuesFixed = req.IssuesFixed; if (req.Error is not null) job.Error = req.Error; if (req.LogOutput is not null) job.LogOutput = req.LogOutput; + if (req.ContentChaptersCleaned.HasValue) job.ContentChaptersCleaned = req.ContentChaptersCleaned; + if (req.ContentChaptersRejected.HasValue) job.ContentChaptersRejected = req.ContentChaptersRejected; + if (req.ContentChaptersSkipped.HasValue) job.ContentChaptersSkipped = req.ContentChaptersSkipped; if (req.SetStartedAt) job.StartedAt = DateTimeOffset.UtcNow; if (req.SetFinishedAt) job.FinishedAt = DateTimeOffset.UtcNow; @@ -506,5 +512,8 @@ public record UpdateQualityJobRequest( int? IssuesFixed = null, string? Error = null, string? LogOutput = null, + int? ContentChaptersCleaned = null, + int? ContentChaptersRejected = null, + int? ContentChaptersSkipped = null, bool SetStartedAt = false, bool SetFinishedAt = false); diff --git a/infra/scripts/pdf-cleanup-gate.py b/infra/scripts/pdf-cleanup-gate.py new file mode 100755 index 00000000..7c783d4d --- /dev/null +++ b/infra/scripts/pdf-cleanup-gate.py @@ -0,0 +1,87 @@ +#!/usr/bin/env python3 +"""Preservation gate for the PDF content-cleanup pass (quality-poll.sh Phase 3). + +The LLM is asked to fix *structure* only — never to reword, summarize, or add +content. This gate verifies that deterministically before the cleaned HTML is +allowed to overwrite the original. + +Usage: + pdf-cleanup-gate.py +Prints exactly one line: + ACCEPT + REJECT: + +Checks: + * hallucination — tokens present in the cleaned text but not the original + (beyond a 3% tolerance for hyphen-merge / punctuation edge cases) + * over-deletion — cleaned kept < 70% of the original word count + (running headers / page numbers are a few %; 30% headroom is generous) + +Line-wrap hyphens are joined before tokenizing, so a legitimate +"chal lenges" -> "challenges" merge is not seen as a new word. +Stdlib only. +""" +import html +import re +import sys +from collections import Counter + +# A hyphen (ASCII, U+2010, U+00AD soft, U+2011 non-breaking) followed by +# whitespace = a line-wrap split. Remove both to rejoin the word. +_HYPHEN_WRAP = re.compile(r"[-‐­‑]\s+") +_TAG = re.compile(r"<[^>]+>") +_WORD = re.compile(r"[^\W_]+", re.UNICODE) + +# Tolerances. +HALLUCINATION_MAX = 0.03 # ≤3% novel tokens allowed +RETENTION_MIN = 0.70 # must keep ≥70% of original tokens + + +def tokenize(raw: str) -> Counter: + text = html.unescape(_TAG.sub(" ", raw)).lower() + text = _HYPHEN_WRAP.sub("", text) + return Counter(_WORD.findall(text)) + + +def verdict(original: str, cleaned: str) -> str: + orig = tokenize(original) + clean = tokenize(cleaned) + n_orig = sum(orig.values()) + n_clean = sum(clean.values()) + + if n_clean == 0: + return "REJECT: cleaned output has no text" + if n_orig == 0: + return "ACCEPT" # nothing to preserve + + novel = sum((clean - orig).values()) + if novel / n_clean > HALLUCINATION_MAX: + return (f"REJECT: {novel}/{n_clean} novel tokens " + f"(>{HALLUCINATION_MAX:.0%}) — possible hallucination") + + if n_clean < RETENTION_MIN * n_orig: + return (f"REJECT: kept {n_clean}/{n_orig} tokens " + f"(<{RETENTION_MIN:.0%}) — over-deletion") + + return "ACCEPT" + + +def main() -> int: + if len(sys.argv) != 3: + print("REJECT: usage: pdf-cleanup-gate.py ") + return 2 + try: + with open(sys.argv[1], encoding="utf-8") as f: + original = f.read() + with open(sys.argv[2], encoding="utf-8") as f: + cleaned = f.read() + except OSError as exc: + print(f"REJECT: cannot read input — {exc}") + return 2 + + print(verdict(original, cleaned)) + return 0 + + +if __name__ == "__main__": + sys.exit(main()) diff --git a/infra/scripts/quality-poll.sh b/infra/scripts/quality-poll.sh index f9e0e39d..e411bebc 100755 --- a/infra/scripts/quality-poll.sh +++ b/infra/scripts/quality-poll.sh @@ -11,6 +11,14 @@ REPO_DIR="$(cd "$(dirname "$0")/../.." && pwd)" POLL_INTERVAL=30 API_BASE="http://localhost:8080" +# Phase 3 (per-chapter content cleanup). Off unless CONTENT_CLEANUP_ENABLED=true +# in .env. Chapters scoring below the threshold (ChapterContentQualityAnalyzer) +# get an LLM cleanup pass; output is verified by pdf-cleanup-gate.py. +CONTENT_CLEANUP_ENABLED="${CONTENT_CLEANUP_ENABLED:-false}" +CONTENT_QUALITY_THRESHOLD="${CONTENT_QUALITY_THRESHOLD:-60}" +DATASET_DIR="$REPO_DIR/data/pdf-cleanup-dataset" +GATE_SCRIPT="$REPO_DIR/infra/scripts/pdf-cleanup-gate.py" + # Status codes (match BookQualityJobStatus enum) STATUS_QUEUED=0 STATUS_VALIDATING=1 @@ -129,6 +137,157 @@ apply_merge() { fi } +# Overwrite a chapter's HTML (PUT recomputes plainText server-side). +# $4 is a JSON-encoded string (quotes included). +apply_content() { + local type="$1" id="$2" chapter_num="$3" esc_html="$4" + if [ "$type" = "edition" ]; then + curl -s -X PUT "$API_BASE/internal/editions/$id/chapters/$chapter_num" \ + -H "Content-Type: application/json" -d "{\"html\":$esc_html}" >/dev/null + else + curl -s -X PUT "$API_BASE/internal/user-books/$id/chapters/$chapter_num" \ + -H "Content-Type: application/json" -d "{\"html\":$esc_html}" >/dev/null + fi +} + +# Phase 3 — per-chapter content cleanup. For each chapter scoring below the +# quality threshold, ask Claude to fix structure only, verify the output with +# the preservation gate, then overwrite the chapter HTML. Every (messy→clean) +# pair is logged to the dataset dir for the heuristic ratchet (feat-0007). +run_content_cleanup() { + local job_id="$1" target_type="$2" target_id="$3" book_title="$4" + + if [ "$CONTENT_CLEANUP_ENABLED" != "true" ]; then + return 0 + fi + + local chapter_table id_col + if [ "$target_type" = "edition" ]; then + chapter_table="chapters"; id_col="edition_id" + else + chapter_table="user_chapters"; id_col="user_book_id" + fi + + local flagged + flagged=$(db_query "SELECT chapter_number FROM $chapter_table \ + WHERE $id_col='$target_id' AND content_quality_score IS NOT NULL \ + AND content_quality_score < $CONTENT_QUALITY_THRESHOLD ORDER BY chapter_number") + + if [ -z "$flagged" ]; then + log "Phase 3: no chapters below quality threshold ($CONTENT_QUALITY_THRESHOLD)" + return 0 + fi + + mkdir -p "$DATASET_DIR" + local cleaned=0 rejected=0 skipped=0 + + for num in $flagged; do + local content_json html + content_json=$(fetch_chapter_sample "$target_type" "$target_id" "$num") + html=$(echo "$content_json" | python3 -c \ + "import sys,json; print(json.load(sys.stdin).get('html','') or '')" 2>/dev/null) + + if [ -z "$html" ]; then + log "Phase 3: chapter $num — no HTML, skipped" + skipped=$((skipped + 1)); continue + fi + + local tmp_orig tmp_prompt tmp_out tmp_clean + tmp_orig=$(mktemp); tmp_prompt=$(mktemp); tmp_out=$(mktemp); tmp_clean=$(mktemp) + printf '%s' "$html" > "$tmp_orig" + + { + cat <<'PROMPT_HEAD' +You are a book content cleanup tool for TextStack, an online reader. +Below is one chapter's HTML, extracted from a PDF. PDF extraction leaves +structural defects. Fix ONLY structure: + +- Remove running headers/footers and stray page numbers that leaked into the + body (e.g. "

Chapter 1: Introduction | 4

", "

2

", "

|

"). +- Rejoin paragraphs fragmented into many one or two word

elements. +- Merge words split by a line-wrap hyphen ("chal- lenges" -> "challenges"). +- Keep genuine headings as

/

, body text as

, lists as

    /
      . + +ABSOLUTE RULES: +- Preserve every word of real content verbatim. Do not summarize, reword, + translate, correct spelling, or add anything. +- Preserve all tags exactly, src attribute unchanged. +- Preserve code / monospace content character-for-character. +- Output raw HTML only — no markdown, no code fences, no commentary. + +CHAPTER_HTML: +PROMPT_HEAD + cat "$tmp_orig" + cat <<'PROMPT_TAIL' + +Output exactly, with nothing before or after: +CLEANED_HTML_START + +CLEANED_HTML_END +PROMPT_TAIL + } > "$tmp_prompt" + + if ! timeout 300 claude -p --model claude-sonnet-4-6 --permission-mode default \ + < "$tmp_prompt" > "$tmp_out" 2>/dev/null; then + log "Phase 3: chapter $num — Claude CLI failed, skipped" + skipped=$((skipped + 1)) + rm -f "$tmp_orig" "$tmp_prompt" "$tmp_out" "$tmp_clean"; continue + fi + + # Extract between markers, drop any stray code-fence lines. + sed -n '/^CLEANED_HTML_START$/,/^CLEANED_HTML_END$/p' "$tmp_out" \ + | sed '1d;$d' | sed '/^```/d' > "$tmp_clean" + + if [ ! -s "$tmp_clean" ]; then + log "Phase 3: chapter $num — no cleaned HTML parsed, skipped" + skipped=$((skipped + 1)) + rm -f "$tmp_orig" "$tmp_prompt" "$tmp_out" "$tmp_clean"; continue + fi + + local gate_verdict + gate_verdict=$(python3 "$GATE_SCRIPT" "$tmp_orig" "$tmp_clean" 2>/dev/null || echo "REJECT: gate error") + + local pair_file="$DATASET_DIR/${target_id}-ch${num}-$(date +%s).json" + if [[ "$gate_verdict" == ACCEPT* ]]; then + local esc_html + esc_html=$(python3 -c "import sys,json; print(json.dumps(open(sys.argv[1],encoding='utf-8').read()))" "$tmp_clean") + apply_content "$target_type" "$target_id" "$num" "$esc_html" + cleaned=$((cleaned + 1)) + log "Phase 3: chapter $num cleaned" + log_pair "$pair_file" "$target_id" "$num" true "$tmp_orig" "$tmp_clean" "$gate_verdict" + else + rejected=$((rejected + 1)) + log "Phase 3: chapter $num rejected — $gate_verdict" + log_pair "$pair_file" "$target_id" "$num" false "$tmp_orig" "$tmp_clean" "$gate_verdict" + fi + + rm -f "$tmp_orig" "$tmp_prompt" "$tmp_out" "$tmp_clean" + done + + update_job_api "$job_id" \ + "{\"contentChaptersCleaned\":$cleaned,\"contentChaptersRejected\":$rejected,\"contentChaptersSkipped\":$skipped}" + log "Phase 3 done for $book_title: $cleaned cleaned, $rejected rejected, $skipped skipped" +} + +# Append a (messy → cleaned) pair to the dataset log — fuel for the heuristic +# ratchet. Best-effort: a logging failure must not fail the cleanup. +log_pair() { + local file="$1" book_id="$2" chapter="$3" accepted="$4" orig="$5" clean="$6" verdict="$7" + python3 - "$file" "$book_id" "$chapter" "$accepted" "$orig" "$clean" "$verdict" <<'PY' 2>/dev/null || true +import json, sys, time +file, book_id, chapter, accepted, orig, clean, verdict = sys.argv[1:8] +json.dump({ + "bookId": book_id, + "chapterNumber": int(chapter), + "accepted": accepted == "true", + "verdict": verdict, + "original": open(orig, encoding="utf-8").read(), + "cleaned": open(clean, encoding="utf-8").read(), + "timestamp": time.time(), +}, open(file, "w", encoding="utf-8"), ensure_ascii=False) +PY +} + process_job() { local job_id="$1" [[ "$job_id" =~ ^[0-9a-f-]{36}$ ]] || { log "Invalid job ID: $job_id"; return 1; } @@ -264,8 +423,11 @@ ISSUES_END") || { esc_issues=$(echo "$issues_json" | python3 -c "import sys,json; print(json.dumps(sys.stdin.read().strip()))" 2>/dev/null) if [ "$issues_count" = "0" ]; then - update_job_api "$job_id" "{\"status\":3,\"issuesJson\":$esc_issues,\"issuesFound\":0,\"issuesFixed\":0,\"setFinishedAt\":true}" - log "No issues found for: $book_title" + update_job_api "$job_id" "{\"status\":2,\"issuesJson\":$esc_issues,\"issuesFound\":0,\"issuesFixed\":0}" + # Structure is fine — but chapter content can still be poorly extracted. + run_content_cleanup "$job_id" "$target_type" "$target_id" "$book_title" + update_job_api "$job_id" "{\"status\":3,\"setFinishedAt\":true}" + log "No structure issues for: $book_title" return 0 fi @@ -343,13 +505,18 @@ for num, title in renames: fi done <<< "$rename_entries" + update_job_api "$job_id" "{\"issuesFixed\":$fixed_count}" + + # Phase 3 — content cleanup, after structure fixes have renumbered chapters. + run_content_cleanup "$job_id" "$target_type" "$target_id" "$book_title" + # Store log local log_text log_text="Validated $chapter_count chapters. Found $issues_count issues, fixed $fixed_count." local esc_log esc_log=$(echo "$log_text" | python3 -c "import sys,json; print(json.dumps(sys.stdin.read().strip()))") - update_job_api "$job_id" "{\"status\":3,\"issuesFixed\":$fixed_count,\"logOutput\":$esc_log,\"setFinishedAt\":true}" + update_job_api "$job_id" "{\"status\":3,\"logOutput\":$esc_log,\"setFinishedAt\":true}" log "Completed: $book_title — $issues_count issues found, $fixed_count fixed" } From d89209d3f0de2272436b5ea45f14fa30c8157b3b Mon Sep 17 00:00:00 2001 From: Vasyl Vdovychenko Date: Fri, 22 May 2026 14:06:14 -0400 Subject: [PATCH 6/7] feat(pdf-quality) [slice 4]: surface Phase 3 results in admin MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - AdminBookQualityEndpoints: QualityJobDetailDto carries the three ContentChapters* counters; GetJob returns them; RetryJob resets them. - Admin BookQualityPage detail panel shows "Content cleanup — cleaned / rejected / skipped" when the job ran Phase 3. - Worker logs a content-quality score distribution per book at ingest (count, avg, how many below 60) — both edition and user-book paths. Co-Authored-By: Claude Opus 4.7 (1M context) --- apps/admin/src/api/client.ts | 3 +++ apps/admin/src/pages/BookQualityPage.tsx | 10 ++++++++++ .../Api/Endpoints/AdminBookQualityEndpoints.cs | 5 +++++ .../Application/Ingestion/IngestionService.cs | 17 +++++++++++++++-- backend/src/Worker/Services/IngestionService.cs | 9 ++++++--- .../src/Worker/Services/UserIngestionService.cs | 13 ++++++++++++- 6 files changed, 51 insertions(+), 6 deletions(-) diff --git a/apps/admin/src/api/client.ts b/apps/admin/src/api/client.ts index bb1f2b6d..69445bda 100644 --- a/apps/admin/src/api/client.ts +++ b/apps/admin/src/api/client.ts @@ -471,6 +471,9 @@ export interface BookQualityJobListItem { export interface BookQualityJobDetail extends BookQualityJobListItem { issuesJson: string | null logOutput: string | null + contentChaptersCleaned: number | null + contentChaptersRejected: number | null + contentChaptersSkipped: number | null } export interface BookQualitySettings { diff --git a/apps/admin/src/pages/BookQualityPage.tsx b/apps/admin/src/pages/BookQualityPage.tsx index dfac4177..819c1d9a 100644 --- a/apps/admin/src/pages/BookQualityPage.tsx +++ b/apps/admin/src/pages/BookQualityPage.tsx @@ -187,6 +187,16 @@ export function BookQualityPage() {

      Issues found: {selectedJob.issuesFound} | Fixed: {selectedJob.issuesFixed ?? 0}

      )} + {(selectedJob.contentChaptersCleaned != null + || selectedJob.contentChaptersRejected != null + || selectedJob.contentChaptersSkipped != null) && ( +

      + Content cleanup — cleaned: {selectedJob.contentChaptersCleaned ?? 0} + {' | '}rejected: {selectedJob.contentChaptersRejected ?? 0} + {' | '}skipped: {selectedJob.contentChaptersSkipped ?? 0} +

      + )} + {selectedJob.issuesJson && (
      Issues: diff --git a/backend/src/Api/Endpoints/AdminBookQualityEndpoints.cs b/backend/src/Api/Endpoints/AdminBookQualityEndpoints.cs index abd43e83..6a40cd68 100644 --- a/backend/src/Api/Endpoints/AdminBookQualityEndpoints.cs +++ b/backend/src/Api/Endpoints/AdminBookQualityEndpoints.cs @@ -82,6 +82,7 @@ private static async Task GetJob(Guid id, IAppDbContext db, Cancellatio return Results.Ok(new QualityJobDetailDto( job.Id, job.EditionId, job.UserBookId, job.Status.ToString(), job.IssuesJson, job.IssuesFound, job.IssuesFixed, + job.ContentChaptersCleaned, job.ContentChaptersRejected, job.ContentChaptersSkipped, job.Error, job.LogOutput, job.CreatedAt, job.StartedAt, job.FinishedAt, job.Edition?.Title, job.UserBook?.Title @@ -134,6 +135,9 @@ private static async Task RetryJob(Guid id, IAppDbContext db, Cancellat job.IssuesJson = null; job.IssuesFound = null; job.IssuesFixed = null; + job.ContentChaptersCleaned = null; + job.ContentChaptersRejected = null; + job.ContentChaptersSkipped = null; job.StartedAt = null; job.FinishedAt = null; await db.SaveChangesAsync(ct); @@ -177,6 +181,7 @@ public record QualityJobListDto( public record QualityJobDetailDto( Guid Id, Guid? EditionId, Guid? UserBookId, string Status, string? IssuesJson, int? IssuesFound, int? IssuesFixed, + int? ContentChaptersCleaned, int? ContentChaptersRejected, int? ContentChaptersSkipped, string? Error, string? LogOutput, DateTimeOffset CreatedAt, DateTimeOffset? StartedAt, DateTimeOffset? FinishedAt, string? EditionTitle, string? UserBookTitle); diff --git a/backend/src/Application/Ingestion/IngestionService.cs b/backend/src/Application/Ingestion/IngestionService.cs index 2d6e058c..7ff69602 100644 --- a/backend/src/Application/Ingestion/IngestionService.cs +++ b/backend/src/Application/Ingestion/IngestionService.cs @@ -4,6 +4,7 @@ using Domain.Enums; using Domain.Utilities; using Microsoft.EntityFrameworkCore; +using Microsoft.Extensions.Logging; using TextStack.Extraction.Quality; namespace Application.Ingestion; @@ -30,7 +31,8 @@ List Warnings public record ExtractionWarningDto(int Code, string Message); -public class IngestionService(IAppDbContext db, IFileStorageService storage) +public class IngestionService( + IAppDbContext db, IFileStorageService storage, ILogger logger) { private static readonly TimeSpan StuckJobTimeout = TimeSpan.FromMinutes(10); @@ -87,10 +89,13 @@ public async Task ProcessParsedBookAsync( db.Chapters.RemoveRange(existingChapters); // Create new chapters + var qualityScores = new List(); foreach (var ch in parsed.Chapters) { var chapterSlug = SlugGenerator.GenerateChapterSlug(ch.Title, ch.Order); var chapterHtml = SanitizeText(ch.Html); + var score = ChapterContentQualityAnalyzer.Analyze(chapterHtml).Score; + qualityScores.Add(score); var chapter = new Chapter { Id = Guid.NewGuid(), @@ -101,7 +106,7 @@ public async Task ProcessParsedBookAsync( Html = chapterHtml, PlainText = SanitizeText(ch.PlainText), WordCount = ch.WordCount, - ContentQualityScore = ChapterContentQualityAnalyzer.Analyze(chapterHtml).Score, + ContentQualityScore = score, OriginalChapterNumber = ch.OriginalChapterNumber, PartNumber = ch.PartNumber, TotalParts = ch.TotalParts, @@ -111,6 +116,14 @@ public async Task ProcessParsedBookAsync( db.Chapters.Add(chapter); } + if (qualityScores.Count > 0) + { + logger.LogInformation( + "Content quality for edition {EditionId}: {Count} chapters, avg score {Avg}, {Below} below 60", + job.EditionId, qualityScores.Count, (int)qualityScores.Average(), + qualityScores.Count(s => s < 60)); + } + // Publish the edition job.Edition.Status = EditionStatus.Published; job.Edition.PublishedAt = DateTimeOffset.UtcNow; diff --git a/backend/src/Worker/Services/IngestionService.cs b/backend/src/Worker/Services/IngestionService.cs index bcf91b20..d1690479 100644 --- a/backend/src/Worker/Services/IngestionService.cs +++ b/backend/src/Worker/Services/IngestionService.cs @@ -26,6 +26,7 @@ public class IngestionWorkerService private readonly ISearchIndexer _searchIndexer; private readonly IImageOptimizer _imageOptimizer; private readonly ILogger _logger; + private readonly ILogger _ingestionLogger; public IngestionWorkerService( IDbContextFactory dbFactory, @@ -33,7 +34,8 @@ public IngestionWorkerService( IExtractorRegistry extractorRegistry, ISearchIndexer searchIndexer, IImageOptimizer imageOptimizer, - ILogger logger) + ILogger logger, + ILogger ingestionLogger) { _dbFactory = dbFactory; _storage = storage; @@ -41,6 +43,7 @@ public IngestionWorkerService( _searchIndexer = searchIndexer; _imageOptimizer = imageOptimizer; _logger = logger; + _ingestionLogger = ingestionLogger; } public async Task GetNextJobAsync(CancellationToken ct) @@ -48,7 +51,7 @@ public IngestionWorkerService( using var activity = IngestionActivitySource.Source.StartActivity("ingestion.job.pick"); await using var db = await _dbFactory.CreateDbContextAsync(ct); - var service = new AppIngestion.IngestionService(db, _storage); + var service = new AppIngestion.IngestionService(db, _storage, _ingestionLogger); var job = await service.GetNextJobAsync(ct); activity?.SetTag("job.found", job is not null); @@ -89,7 +92,7 @@ public async Task ProcessJobAsync(Guid jobId, CancellationToken ct) string? failureReason = null; await using var db = await _dbFactory.CreateDbContextAsync(ct); - var service = new AppIngestion.IngestionService(db, _storage); + var service = new AppIngestion.IngestionService(db, _storage, _ingestionLogger); var job = await service.GetJobWithDetailsAsync(jobId, ct); diff --git a/backend/src/Worker/Services/UserIngestionService.cs b/backend/src/Worker/Services/UserIngestionService.cs index dbac68f9..8834ff88 100644 --- a/backend/src/Worker/Services/UserIngestionService.cs +++ b/backend/src/Worker/Services/UserIngestionService.cs @@ -187,11 +187,14 @@ public async Task ProcessJobAsync(Guid jobId, CancellationToken ct) db.UserChapters.RemoveRange(existingChapters); // Create chapters + var qualityScores = new List(); foreach (var unit in result.Units) { var html = SanitizeText( Application.Common.ImageProcessingHelper.RewriteImageSrcs(unit.Html ?? string.Empty, imageMap)); var chapterTitle = SanitizeText(unit.Title ?? $"Chapter {unit.OrderIndex + 1}"); + var score = ChapterContentQualityAnalyzer.Analyze(html).Score; + qualityScores.Add(score); var chapter = new UserChapter { Id = Guid.NewGuid(), @@ -202,12 +205,20 @@ public async Task ProcessJobAsync(Guid jobId, CancellationToken ct) Html = html, PlainText = SanitizeText(unit.PlainText), WordCount = unit.WordCount, - ContentQualityScore = ChapterContentQualityAnalyzer.Analyze(html).Score, + ContentQualityScore = score, CreatedAt = DateTimeOffset.UtcNow }; db.UserChapters.Add(chapter); } + if (qualityScores.Count > 0) + { + _logger.LogInformation( + "Content quality for user book {BookId}: {Count} chapters, avg score {Avg}, {Below} below 60", + job.UserBookId, qualityScores.Count, (int)qualityScores.Average(), + qualityScores.Count(s => s < 60)); + } + // Update book metadata if (string.IsNullOrEmpty(job.UserBook.Description) && !string.IsNullOrEmpty(result.Metadata.Description)) job.UserBook.Description = StripHtml(result.Metadata.Description); From 53a67cda0d7426bd72ea631dec03624aa4849892 Mon Sep 17 00:00:00 2001 From: Vasyl Vdovychenko Date: Fri, 22 May 2026 15:11:40 -0400 Subject: [PATCH 7/7] changelog: PDF content quality pipeline (feat-0007 slices 1-4) Co-Authored-By: Claude Opus 4.7 (1M context) --- CHANGELOG.md | 26 ++++++++++++++++++++++++++ 1 file changed, 26 insertions(+) diff --git a/CHANGELOG.md b/CHANGELOG.md index 70110691..aefc8498 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -2,6 +2,32 @@ ## [Unreleased] +### PDF content quality — Claude cleanup pipeline (2026-05-22) + +Slices 1-4 of feat-0007 (`docs/05-features/feat-0007-pdf-content-quality.md`). +Makes PDF-extracted books readable: heuristics get ~70-75%, the gap to ~90% is +semantic (running headers in body, fragmented paragraphs, hyphenation, inlined +footnotes). Closes it with a gated Claude cleanup pass, and logs every fix so +the deterministic heuristics can ratchet up over time. Marker (ML PDF pipeline) +was evaluated and shelved — the prod GPU's 4 GB VRAM can't hold its model set. + +- **`ChapterContentQualityAnalyzer`** — deterministic 0-100 content-quality + score + issue codes (fragmented paragraphs, running headers in body, + unmerged hyphenation, orphan page numbers, inlined footnotes) for extracted + chapter HTML. Pure C#, 12 unit tests. The gate that decides which chapters + warrant an LLM pass. +- **Score persisted at ingest** — `ContentQualityScore` column on `Chapter` + + `UserChapter`, set in both ingestion paths; `BookQualityJob` carries Phase 3 + tracking counters. Worker logs a per-book score distribution. +- **`quality-poll.sh` Phase 3** — for each chapter below the quality threshold, + Claude CLI fixes structure (preserving content verbatim); a stdlib-only + preservation gate (`pdf-cleanup-gate.py`) rejects hallucination or + over-deletion via word-multiset diff before the cleaned HTML is written back. + Every (messy → clean) pair is logged to `data/pdf-cleanup-dataset/` as fuel + for the future heuristic ratchet. Off by default — `CONTENT_CLEANUP_ENABLED`. +- **Admin observability** — the Book Quality job detail panel shows Phase 3 + results (chapters cleaned / rejected / skipped). + ### Mobile reader — autosave restore (2026-05-13) - **WordCard parity with web WordPopup** — single-word tap on mobile