Replies: 2 comments
-
|
Appreciate the discussion here — the agent angle stood out as especially practical. A small pattern that may help:
If useful, we keep a structured index of similar workflows here: https://skillslookup.com Happy to share more detail if it would help. |
Beta Was this translation helpful? Give feedback.
-
|
Hey @Brikas, sorry for the very late reply on this. Thanks for the thoughtful proposal, I finally sat down and looked at it properly against the current chunker. I think you're right on the main points. The current chunker is deliberately simple: each top level block plus its children becomes one chunk, and the only size handling is The large block splitting is also well justified. There's no upper bound today, so a big block with many children gets flattened into one giant chunk, which can blow past the embedder's context limit and waters down retrieval quality since one vector ends up representing too many topics. On the breadcrumb context I mostly agree, with one nuance: I'd lean toward keeping the breadcrumb out of the embedded text and only attaching it to the raw/output, so it doesn't shift every chunk's vector. The One implementation gotcha worth flagging: the chunk id is currently If you'd still like to take a crack at it, a PR would be very welcome. I've opened #63 to track the work so it doesn't get lost. No pressure on timing, but if you don't get to it in the next few weeks I'll pick it up myself. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I appreciate the more LogSeq native chunking by top-level blocks. I think this utilizes the LogSeq's logical note structure for a good use.
I think there could be even a further upgrade. Just putting some thought now.
Size aware chunking
If there are many small top-level blocks, they can't easily fall out of context being alone. Define an OPTIMAL_CHUNK_SIZE (or reuse MIN_CHUNK_SIZE for simplicity). If a top-level block does not fullfill that, recurse swallowing adjacent too.
Define a MAX_CHUNK_SIZE. If a top-level block is really large, split its downstream content up into chunks that also aim for OPTIMAL_CHUNK_SIZE. Now even more interesting part is that to avoid loosing the context of where this block belongs, include a breadcrumb at the top of the chunk indicating its origin. May even include a smart elipsis to indicate that at this level there are x more nodes. Keep it recursive.
I said this very simply, but I know this requires a somewhat sophisticated tree algo to get the optimal chunks.
Below is an example
Example parameters. Not actually calculated, just illustrative. Small for brevity of explanation. Normally, would be bigger.
Original File
Chunks
Chunk 1 - merged small top-level blocks
Chunk 2 - large top-level block split, first subtree
Chunk 3
Chunk 4
Chunk 5 - tiny trailing top-level block (best-fit, can't do optimal, but meets the minimum)
The added tokens by breadcrumbs can be concerning, esp. when sizes are tiny, this can inflate greatly. Reason to make them optional
But also they could be shorter
(... 2 omitted sibling blocks before)->(+2), with an explanation to the Agent at the MCP level(+N) inside the retrieved data indicate ommited sibling blocks for contextWith this breakcrumb + sibling data, agent can then utilize the
get_page_content,searchorqueryto selectively retrieve the full context as it now sees the hierarchy.What's your view on this? I may give it a shot myself with a fork.
Beta Was this translation helpful? Give feedback.
All reactions