✨ feat(annotation): surface-form and inline-tag exporters#253
Draft
gaborbernat wants to merge 5 commits into
Draft
✨ feat(annotation): surface-form and inline-tag exporters#253gaborbernat wants to merge 5 commits into
gaborbernat wants to merge 5 commits into
Conversation
Finish the inscriptis annotation output processors: cover annotation.c to 100% line+branch, relax the spans stub to Iterable to match the PySequence_Fast input, and document the surface-form and inline-tagged exporters across how-to, explanation, and the inscriptis migration row. closes tox-dev#251
PySequence_Fast_GET_SIZE/GET_ITEM branch on list-vs-tuple internally; every existing test passed spans as a list (or an iterator PySequence_Fast materializes to a list), leaving the tuple path's 2 branches uncovered (annotation.c 58/60). A tuple-of-tuples case closes them: 60/60 under llvm-cov and gcc.
gcc counts each qsort-comparator ternary's two arms separately; with only 3 coincident events qsort never compared identical-range opens high-vs-low, leaving annotation.c at 73/74 branch on gcc (the 3.11-3.14 Linux gate). Six coincident identical-range spans plus three zero-width and a close/open boundary drive every comparator arm: annotation.c is 74/74 on gcc-16 and 60/60 on llvm-cov.
The phase/seq tiebreaks in annotation_event_cmp decide between events that agree on every earlier key. seq increases with the original span index, so for such a pair the comparator is only ever handed the lower-seq operand on one side under a given libc's qsort -- the opposite arm is correct but unreachable by any input (verified: glibc and macOS qsort flag different arms). Mark those four branches GCOVR_EXCL_BR_LINE with the rationale; pos and rank ties, which qsort compares both ways, stay covered. Verified 100% on llvm-cov and Ubuntu gcc (Docker).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Port the inscriptis annotation output processors as pure transforms over the
(text, spans)pairNode.to_annotated_text()returns (#220 added the extraction step; this adds the rendering step):turbohtml.annotation_surface(text, spans) -> dict[str, list[str]]— the surface-form extractor: group each label'smatched substrings, in document order.
turbohtml.annotation_tags(text, spans) -> str— the inline-tagged (XML) exporter: weave the spans back into the textas
<label>...</label>markup, innermost span closing first so nested spans stay well-formed.Both touch only their str and sequence arguments — never a tree, node, or shared handle — so they need no critical
section and are free-threading safe by construction.
Implementation
Implemented in
src/turbohtml/annotation.c, bound through_htmlmodule.c/turbohtml.h/_html.pyi, exported fromturbohtml. The inline exporter expands each span into sorted open/close events whose comparator carries the otherendpoint and the original index, so the boundary order is total and the output deterministic even when spans coincide.
Docs
How-to (both exporters over
to_annotated_text), an explanation section on the extraction/rendering split and thewell-formed-nesting sort, the inscriptis migration row + example (replacing the prior "not ported" pitfall), and the
reference (auto-documented via
automodule). Changelog fragmentdocs/changelog/251.feature.rst.Coverage / gates
annotation.cis 100% line+branch under both clang/llvm-cov and gcc-16; only genuine alloc-failure arms carryGCOVR_EXCLwith written reasons. Gates green:fix,type,3.14t,docs; Python coverage 100% (30686 passed). The3.14C gate aggregate sits at 99.9% branch from the known macOS llvm-cov phantom-branch noise — no source file has agenuine uncovered branch.
closes #251