Chunking upgrade - Size aware chunks with breadcrumb context #39

Brikas · 2026-03-29T15:32:12Z

Brikas
Mar 29, 2026

I appreciate the more LogSeq native chunking by top-level blocks. I think this utilizes the LogSeq's logical note structure for a good use.

I think there could be even a further upgrade. Just putting some thought now.

Size aware chunking
If there are many small top-level blocks, they can't easily fall out of context being alone. Define an OPTIMAL_CHUNK_SIZE (or reuse MIN_CHUNK_SIZE for simplicity). If a top-level block does not fullfill that, recurse swallowing adjacent too.

Define a MAX_CHUNK_SIZE. If a top-level block is really large, split its downstream content up into chunks that also aim for OPTIMAL_CHUNK_SIZE. Now even more interesting part is that to avoid loosing the context of where this block belongs, include a breadcrumb at the top of the chunk indicating its origin. May even include a smart elipsis to indicate that at this level there are x more nodes. Keep it recursive.

I said this very simply, but I know this requires a somewhat sophisticated tree algo to get the optimal chunks.

Below is an example

Example parameters. Not actually calculated, just illustrative. Small for brevity of explanation. Normally, would be bigger.

MIN_CHUNK_SIZE = 50
OPTIMAL_CHUNK_SIZE = 180
MAX_CHUNK_SIZE = 260

Original File

Project Phoenix.md
- Overview
  - Status: on track
  - Owner: Dana
  - Next milestone: prototype on Friday

- Risks
  - Vendor delay
    - Waiting on API quota increase
    - Backup vendor identified
  - Hiring
    - Need one more frontend contractor

- Meeting notes
  - Monday sync
    - Product asked for narrower MVP
    - Engineering said auth is the main constraint
    - Decision
      - Drop team sharing from MVP
      - Keep export
      - Revisit permissions later
  - Customer calls
    - ACME
      - Wants SSO
      - Okay with manual provisioning for pilot
    - BetaCorp
      - Cares mostly about audit logs
      - Security review expected next week
  - Open questions
    - Should we expose admin analytics in v1?
    - Do we support CSV import at launch?
    - How much onboarding can be manual?

- Links
  - Spec: /docs/phoenix-spec
  - Board: /boards/phoenix

Chunks

Chunk 1 - merged small top-level blocks

- Overview
  - Status: on track
  - Owner: Dana
  - Next milestone: prototype on Friday

- Risks
  - Vendor delay
    - Waiting on API quota increase
    - Backup vendor identified
  - Hiring
    - Need one more frontend contractor

Chunk 2 - large top-level block split, first subtree

[breadcrumb] Project Phoenix.md > Meeting notes
  - Monday sync
    - Product asked for narrower MVP
    - Engineering said auth is the main constraint
    - Decision
      - Drop team sharing from MVP
      - Keep export
      - Revisit permissions later
  (... 2 omitted sibling blocks after)

Chunk 3

[breadcrumb] Project Phoenix.md > Meeting notes
  (... 1 omitted sibling block before)
  - Customer calls
    - ACME
      - Wants SSO
      - Okay with manual provisioning for pilot
    - BetaCorp
      - Cares mostly about audit logs
      - Security review expected next week
  (... 1 omitted sibling block after)

Chunk 4

[breadcrumb] Project Phoenix.md > Meeting notes
  (... 2 omitted sibling blocks before)
  - Open questions
    - Should we expose admin analytics in v1?
    - Do we support CSV import at launch?
    - How much onboarding can be manual?

Chunk 5 - tiny trailing top-level block (best-fit, can't do optimal, but meets the minimum)

- Links
  - Spec: /docs/phoenix-spec
  - Board: /boards/phoenix

The added tokens by breadcrumbs can be concerning, esp. when sizes are tiny, this can inflate greatly. Reason to make them optional
But also they could be shorter (... 2 omitted sibling blocks before) -> (+2), with an explanation to the Agent at the MCP level (+N) inside the retrieved data indicate ommited sibling blocks for context

With this breakcrumb + sibling data, agent can then utilize the get_page_content, search or query to selectively retrieve the full context as it now sees the hierarchy.

What's your view on this? I may give it a shot myself with a fork.

nonfuntoke · 2026-05-04T20:14:33Z

nonfuntoke
May 4, 2026

Appreciate the discussion here — the agent angle stood out as especially practical.

A small pattern that may help:

separate discovery, routing, and publish state so the workflow can adapt to issue/comment/discussion surfaces
keep a compact evidence block for why a surface was selected

If useful, we keep a structured index of similar workflows here: https://skillslookup.com

Happy to share more detail if it would help.

0 replies

ergut · 2026-06-14T02:39:47Z

ergut
Jun 14, 2026
Maintainer

Hey @Brikas, sorry for the very late reply on this. Thanks for the thoughtful proposal, I finally sat down and looked at it properly against the current chunker.

I think you're right on the main points. The current chunker is deliberately simple: each top level block plus its children becomes one chunk, and the only size handling is min_chunk_length, which actually drops anything under 50 chars instead of merging it. So short but meaningful blocks (single line notes, links, journal entries) silently never make it into the index. Your merge idea fixes a real data loss problem, that's the highest value part for me.

The large block splitting is also well justified. There's no upper bound today, so a big block with many children gets flattened into one giant chunk, which can blow past the embedder's context limit and waters down retrieval quality since one vector ends up representing too many topics.

On the breadcrumb context I mostly agree, with one nuance: I'd lean toward keeping the breadcrumb out of the embedded text and only attaching it to the raw/output, so it doesn't shift every chunk's vector. The (+2) sibling notation is a nice touch.

One implementation gotcha worth flagging: the chunk id is currently {page}::{block_index}, and block_index is used both there and for sync state tracking. Once a block can produce multiple chunks (split) or several blocks collapse into one (merge), block_index alone stops being unique, so we'll need a sub index or a revised id scheme, and that touches the sync/state layer too.

If you'd still like to take a crack at it, a PR would be very welcome. I've opened #63 to track the work so it doesn't get lost. No pressure on timing, but if you don't get to it in the next few weeks I'll pick it up myself.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chunking upgrade - Size aware chunks with breadcrumb context #39

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Chunking upgrade - Size aware chunks with breadcrumb context #39

Uh oh!

Uh oh!

Brikas Mar 29, 2026

Replies: 2 comments

Uh oh!

nonfuntoke May 4, 2026

Uh oh!

ergut Jun 14, 2026 Maintainer

Brikas
Mar 29, 2026

nonfuntoke
May 4, 2026

ergut
Jun 14, 2026
Maintainer