Tracking issue for the chunking improvements proposed in #39 by @Brikas.
The current chunker (chunker.py) maps each top level block plus its children to one chunk, with no size awareness beyond min_chunk_length, which drops short blocks rather than merging them.
Scope:
- Merge small blocks. Combine consecutive blocks that fall under a target size instead of dropping them at
min_chunk_length. Fixes silent data loss for short blocks.
- Split large blocks. Add a
max_chunk_length and split oversized blocks so they stay within the embedder's context window and retrieval stays sharp.
- Breadcrumb context. Attach the block's hierarchical path to each chunk, with a
(+N) notation for omitted siblings. Keep it out of the embedded text and only on the raw/output side.
Implementation note: chunk ids are {page}::{block_index} and block_index is also used for sync state. Merge/split breaks that uniqueness, so we need a sub index or revised id scheme, which also touches the sync/state layer.
Suggested order: start with (2) since it's lowest risk, then (1), then (3) as an optional separate pass.
@Brikas has offered to implement via a fork. If no PR lands in the next few weeks we'll pick it up.
Tracking issue for the chunking improvements proposed in #39 by @Brikas.
The current chunker (chunker.py) maps each top level block plus its children to one chunk, with no size awareness beyond
min_chunk_length, which drops short blocks rather than merging them.Scope:
min_chunk_length. Fixes silent data loss for short blocks.max_chunk_lengthand split oversized blocks so they stay within the embedder's context window and retrieval stays sharp.(+N)notation for omitted siblings. Keep it out of the embedded text and only on the raw/output side.Implementation note: chunk ids are
{page}::{block_index}andblock_indexis also used for sync state. Merge/split breaks that uniqueness, so we need a sub index or revised id scheme, which also touches the sync/state layer.Suggested order: start with (2) since it's lowest risk, then (1), then (3) as an optional separate pass.
@Brikas has offered to implement via a fork. If no PR lands in the next few weeks we'll pick it up.