Skip to content

Refactor ingestion to use semantic chunking #2

@fronx

Description

@fronx

Problem

The current ingestion pipeline is too fine-grained and slow:

Current approach:

  • Parse markdown into sections -> paragraphs hierarchy
  • Per paragraph: 1 embedding + 1 keyword extraction (Claude) + 3-5 keyword embeddings
  • For agency-2.md (87 paragraphs): ~500+ API calls, very slow

Issues identified:

  1. Keywords extracted per-paragraph are often redundant or nonsensical (e.g., "Content:", "agency-2")
  2. The section/paragraph hierarchy doesn't fit all document types (survey found 54% have no headings, some are dialogs)
  3. Too many small nodes in the database

Proposed Solution: Semantic Chunking

Use Claude Haiku to identify semantic boundaries in text, creating ~500-1500 token chunks based on meaning rather than arbitrary paragraph splits.

Prototype results on agency-2.md:

  • 31 chunks instead of 87 paragraphs
  • Average chunk: ~756 chars (~189 tokens)
  • Chunk types detected: exposition, argument, list, dialog-turn, notes, section, conclusion
  • Processing time: 26s (for chunking only, before embeddings)

New simplified schema:

documents (source file metadata)
    -> chunks (semantic units, ~500-1500 tokens each)
            -> keywords (extracted per chunk, not per paragraph)

Implementation Plan

  1. Survey document formats to understand variety (done - see survey-documents.ts)
  2. Prototype SemanticChunker with Haiku (done - see src/lib/chunker.ts)
  3. Refine chunker prompts based on testing
  4. Create new ingestion pipeline using chunks
  5. Migrate database schema (chunks table, remove section/paragraph distinction)
  6. Update search to work with chunks
  7. Re-import all documents with new pipeline

Files

  • src/lib/chunker.ts - Semantic chunker prototype
  • scripts/test-chunker.ts - Test script for chunker
  • scripts/survey-documents.ts - Document format survey

Related

Closes #1 (the original hierarchy bug becomes moot with new chunking approach)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions