Refactor ingestion to use semantic chunking

## Problem

The current ingestion pipeline is too fine-grained and slow:

**Current approach:**
- Parse markdown into sections -> paragraphs hierarchy
- Per paragraph: 1 embedding + 1 keyword extraction (Claude) + 3-5 keyword embeddings
- For agency-2.md (87 paragraphs): ~500+ API calls, very slow

**Issues identified:**
1. Keywords extracted per-paragraph are often redundant or nonsensical (e.g., "Content:", "agency-2")
2. The section/paragraph hierarchy doesn't fit all document types (survey found 54% have no headings, some are dialogs)
3. Too many small nodes in the database

## Proposed Solution: Semantic Chunking

Use Claude Haiku to identify semantic boundaries in text, creating ~500-1500 token chunks based on meaning rather than arbitrary paragraph splits.

**Prototype results on agency-2.md:**
- **31 chunks** instead of 87 paragraphs
- Average chunk: ~756 chars (~189 tokens)
- Chunk types detected: exposition, argument, list, dialog-turn, notes, section, conclusion
- Processing time: 26s (for chunking only, before embeddings)

**New simplified schema:**
```
documents (source file metadata)
    -> chunks (semantic units, ~500-1500 tokens each)
            -> keywords (extracted per chunk, not per paragraph)
```

## Implementation Plan

1. [x] Survey document formats to understand variety (done - see survey-documents.ts)
2. [x] Prototype SemanticChunker with Haiku (done - see src/lib/chunker.ts)
3. [ ] Refine chunker prompts based on testing
4. [ ] Create new ingestion pipeline using chunks
5. [ ] Migrate database schema (chunks table, remove section/paragraph distinction)
6. [ ] Update search to work with chunks
7. [ ] Re-import all documents with new pipeline

## Files

- src/lib/chunker.ts - Semantic chunker prototype
- scripts/test-chunker.ts - Test script for chunker
- scripts/survey-documents.ts - Document format survey

## Related

Closes #1 (the original hierarchy bug becomes moot with new chunking approach)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor ingestion to use semantic chunking #2

Problem

Proposed Solution: Semantic Chunking

Implementation Plan

Files

Related

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Refactor ingestion to use semantic chunking #2

Description

Problem

Proposed Solution: Semantic Chunking

Implementation Plan

Files

Related

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions