Skip to content

fix: handle non-BMP UTF-16 characters in markdown formatting#63

Open
Arondy wants to merge 1 commit into
MaxApiTeam:mainfrom
Arondy:fix/non-bmp-utf16-positions
Open

fix: handle non-BMP UTF-16 characters in markdown formatting#63
Arondy wants to merge 1 commit into
MaxApiTeam:mainfrom
Arondy:fix/non-bmp-utf16-positions

Conversation

@Arondy

@Arondy Arondy commented Jun 11, 2026

Copy link
Copy Markdown

Описание

Исправляет некорректное форматирование Markdown-ссылок при наличии в тексте non-BMP символов, например эмодзи.

Тип изменений

  • Исправление бага
  • Новая функциональность
  • Улучшение документации
  • Рефакторинг

Связанные задачи / Issue

Closes #62.

Тестирование

from pymax.formatting.markdown import Formatter

clean, entities = Formatter.format_markdown("🔥 [a](https://x.com) 👍 [b](https://y.com)")
assert entities[0].from_ == 3   # 🔥(2) + пробел(1)
assert entities[1].from_ == 8   # 🔥(2) + пробел(1) + a(1) + пробел(1) + 👍(2) + пробел(1)

Summary by CodeRabbit

Release Notes

  • Bug Fixes
    • Improved handling of emoji and extended Unicode characters in markdown formatting.

@coderabbitai

coderabbitai Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

The formatter now correctly handles non-BMP characters by tracking markdown positions in UTF-16 code units. A new infrastructure layer (BMP_MAX constant and _code_units_len helper) enables the parser to distinguish BMP characters (1 code unit) from non-BMP characters (2 code units), allowing accurate position advancement across LINK, HEADING, QUOTE, and character parsing paths.

Changes

UTF-16 Position Tracking

Layer / File(s) Summary
UTF-16 Infrastructure
src/pymax/formatting/markdown.py
Formatter gains BMP_MAX = 0xFFFF constant and _code_units_len() helper to compute UTF-16 code-unit lengths by encoding text as UTF-16LE and dividing by 2.
UTF-16 Position Tracking in Parsing
src/pymax/formatting/markdown.py
LINK label extraction, HEADING/QUOTE text loops, and default character advancement now increment clean_pos using UTF-16 code-unit sizing (2 for non-BMP, 1 for BMP) instead of fixed +1 increments.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related issues

  • #62: The UTF-16 code-unit position tracking changes directly address the bug where non-BMP characters shifted LINK positions by fixing offset calculations throughout the markdown formatter.

Poem

🐰 In UTF-16's binary dance,
Where code units bloom and prance,
BMP bounds are now our guide,
Non-BMP marked with widened stride,
Positions tracked both true and bright!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the main change: fixing handling of non-BMP UTF-16 characters in markdown formatting, which matches the core objective of the PR.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The pull request description follows the required template with all sections completed: description, change type, related issues, and testing example.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Markdown ссылки с некорректными позициями при наличии non-BMP символов

1 participant