Skip to content

✨ feat(annotation): surface-form and inline-tag exporters#253

Draft
gaborbernat wants to merge 5 commits into
tox-dev:mainfrom
gaborbernat:feat/annotation-exporters
Draft

✨ feat(annotation): surface-form and inline-tag exporters#253
gaborbernat wants to merge 5 commits into
tox-dev:mainfrom
gaborbernat:feat/annotation-exporters

Conversation

@gaborbernat

Copy link
Copy Markdown
Member

Summary

Port the inscriptis annotation output processors as pure transforms over the (text, spans) pair
Node.to_annotated_text() returns (#220 added the extraction step; this adds the rendering step):

  • turbohtml.annotation_surface(text, spans) -> dict[str, list[str]] — the surface-form extractor: group each label's
    matched substrings, in document order.
  • turbohtml.annotation_tags(text, spans) -> str — the inline-tagged (XML) exporter: weave the spans back into the text
    as <label>...</label> markup, innermost span closing first so nested spans stay well-formed.

Both touch only their str and sequence arguments — never a tree, node, or shared handle — so they need no critical
section and are free-threading safe by construction.

Implementation

Implemented in src/turbohtml/annotation.c, bound through _htmlmodule.c / turbohtml.h / _html.pyi, exported from
turbohtml. The inline exporter expands each span into sorted open/close events whose comparator carries the other
endpoint and the original index, so the boundary order is total and the output deterministic even when spans coincide.

Docs

How-to (both exporters over to_annotated_text), an explanation section on the extraction/rendering split and the
well-formed-nesting sort, the inscriptis migration row + example (replacing the prior "not ported" pitfall), and the
reference (auto-documented via automodule). Changelog fragment docs/changelog/251.feature.rst.

Coverage / gates

annotation.c is 100% line+branch under both clang/llvm-cov and gcc-16; only genuine alloc-failure arms carry
GCOVR_EXCL with written reasons. Gates green: fix, type, 3.14t, docs; Python coverage 100% (30686 passed). The
3.14 C gate aggregate sits at 99.9% branch from the known macOS llvm-cov phantom-branch noise — no source file has a
genuine uncovered branch.

closes #251

Finish the inscriptis annotation output processors: cover annotation.c
to 100% line+branch, relax the spans stub to Iterable to match the
PySequence_Fast input, and document the surface-form and inline-tagged
exporters across how-to, explanation, and the inscriptis migration row.

closes tox-dev#251
PySequence_Fast_GET_SIZE/GET_ITEM branch on list-vs-tuple internally; every existing test passed spans as a list (or an iterator PySequence_Fast materializes to a list), leaving the tuple path's 2 branches uncovered (annotation.c 58/60). A tuple-of-tuples case closes them: 60/60 under llvm-cov and gcc.
gcc counts each qsort-comparator ternary's two arms separately; with only 3 coincident events qsort never compared identical-range opens high-vs-low, leaving annotation.c at 73/74 branch on gcc (the 3.11-3.14 Linux gate). Six coincident identical-range spans plus three zero-width and a close/open boundary drive every comparator arm: annotation.c is 74/74 on gcc-16 and 60/60 on llvm-cov.
The phase/seq tiebreaks in annotation_event_cmp decide between events that agree on every earlier key. seq increases with the original span index, so for such a pair the comparator is only ever handed the lower-seq operand on one side under a given libc's qsort -- the opposite arm is correct but unreachable by any input (verified: glibc and macOS qsort flag different arms). Mark those four branches GCOVR_EXCL_BR_LINE with the rationale; pos and rank ties, which qsort compares both ways, stay covered. Verified 100% on llvm-cov and Ubuntu gcc (Docker).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

✨ feat: annotation output processors (surface-form + inline-tagged export)

1 participant