Skip to content

Size-aware chunking: merge small blocks, split large blocks, add breadcrumb context #63

Description

@ergut

Tracking issue for the chunking improvements proposed in #39 by @Brikas.

The current chunker (chunker.py) maps each top level block plus its children to one chunk, with no size awareness beyond min_chunk_length, which drops short blocks rather than merging them.

Scope:

  1. Merge small blocks. Combine consecutive blocks that fall under a target size instead of dropping them at min_chunk_length. Fixes silent data loss for short blocks.
  2. Split large blocks. Add a max_chunk_length and split oversized blocks so they stay within the embedder's context window and retrieval stays sharp.
  3. Breadcrumb context. Attach the block's hierarchical path to each chunk, with a (+N) notation for omitted siblings. Keep it out of the embedded text and only on the raw/output side.

Implementation note: chunk ids are {page}::{block_index} and block_index is also used for sync state. Merge/split breaks that uniqueness, so we need a sub index or revised id scheme, which also touches the sync/state layer.

Suggested order: start with (2) since it's lowest risk, then (1), then (3) as an optional separate pass.

@Brikas has offered to implement via a fork. If no PR lands in the next few weeks we'll pick it up.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions