Skip to content

fix: break content truncation at semantic boundaries#30

Open
ntegrals wants to merge 1 commit intomasterfrom
fix/content-truncation
Open

fix: break content truncation at semantic boundaries#30
ntegrals wants to merge 1 commit intomasterfrom
fix/content-truncation

Conversation

@ntegrals
Copy link
Copy Markdown
Owner

@ntegrals ntegrals commented Apr 2, 2026

Summary

When extractMarkdown() exceeds maxLength, the previous logic only broke at a paragraph boundary if it was in the last 20% of content. Otherwise it sliced mid-word/sentence, producing broken markdown.

Before

# Some Article

This is a long paragraph about...  ← sliced here mid-sentence

[... content truncated, ~500 chars remaining]

After

The truncation now tries boundaries in priority order:

  1. Paragraph break (\n\n) — cleanest cut
  2. Sentence ending (. , .\n, ? , ! ) — preserves complete thoughts
  3. Word boundary (space) — avoids mid-word cuts
  4. Hard limit — only if no boundary found in the first 50%

All boundaries must be at least 50% into the content to avoid over-truncation.

Code change

packages/core/src/page/content-extractor.ts lines 247-268

Test plan

  • bun run build — compiles clean
  • bun run test — all 364 tests pass

Improve markdown truncation to prefer paragraph breaks, then sentence
endings, then word boundaries instead of slicing mid-text. Uses a 50%
minimum keep ratio so short content isn't over-truncated.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants