Problem
The current ingestion pipeline is too fine-grained and slow:
Current approach:
Parse markdown into sections -> paragraphs hierarchy
Per paragraph: 1 embedding + 1 keyword extraction (Claude) + 3-5 keyword embeddings
For agency-2.md (87 paragraphs): ~500+ API calls, very slow
Issues identified:
Keywords extracted per-paragraph are often redundant or nonsensical (e.g., "Content:", "agency-2")
The section/paragraph hierarchy doesn't fit all document types (survey found 54% have no headings, some are dialogs)
Too many small nodes in the database
Proposed Solution: Semantic Chunking
Use Claude Haiku to identify semantic boundaries in text, creating ~500-1500 token chunks based on meaning rather than arbitrary paragraph splits.
Prototype results on agency-2.md:
31 chunks instead of 87 paragraphs
Average chunk: ~756 chars (~189 tokens)
Chunk types detected: exposition, argument, list, dialog-turn, notes, section, conclusion
Processing time: 26s (for chunking only, before embeddings)
New simplified schema:
documents (source file metadata)
-> chunks (semantic units, ~500-1500 tokens each)
-> keywords (extracted per chunk, not per paragraph)
Implementation Plan
Survey document formats to understand variety (done - see survey-documents.ts)
Prototype SemanticChunker with Haiku (done - see src/lib/chunker.ts)
Refine chunker prompts based on testing
Create new ingestion pipeline using chunks
Migrate database schema (chunks table, remove section/paragraph distinction)
Update search to work with chunks
Re-import all documents with new pipeline
Files
src/lib/chunker.ts - Semantic chunker prototype
scripts/test-chunker.ts - Test script for chunker
scripts/survey-documents.ts - Document format survey
Related
Closes #1 (the original hierarchy bug becomes moot with new chunking approach)
Problem
The current ingestion pipeline is too fine-grained and slow:
Current approach:
Issues identified:
Proposed Solution: Semantic Chunking
Use Claude Haiku to identify semantic boundaries in text, creating ~500-1500 token chunks based on meaning rather than arbitrary paragraph splits.
Prototype results on agency-2.md:
New simplified schema:
Implementation Plan
Files
Related
Closes #1 (the original hierarchy bug becomes moot with new chunking approach)