feat(tree): parsel-style string extraction by gaborbernat · Pull Request #256 · tox-dev/turbohtml

gaborbernat · 2026-06-21T02:47:40Z

Adds parsel-style string-extraction primitives so scraping code can pull strings out of a parsed tree without bolting non-standard CSS pseudo-elements onto the selector engine.

API

Element.attr(name, /, default=None) -> str | None -- the raw attribute value as one string (class reads back as "a b c", a valueless attribute as "", an absent one as default).
Node.re(pattern, /, *, attr=None) -> list[str] -- run a str or compiled re.Pattern over the node's text (or an attribute value with attr=); yields the lone capturing group when the pattern has one, else the whole match.
Node.re_first(pattern, /, default=None, *, attr=None) -> str | None -- the first match with the same group rule, or default.

The regex runs in Python's re; only the source string is produced in C under the per-tree critical section.

Coverage

100% line and branch on tree_type.c under clang llvm-cov. The only excluded branches are the unforceable allocation-failure guards in node_re/node_re_first; the testable absent-attribute path is split out and covered.

Note: the gcc-16 cross-check could not be run in this environment (the permission layer denied env, rm, meson, and direct gcovr, so the build dir compiler could not be switched). Every new conditional is two-sided and exercised by tests.

Docs

how-to (Pull strings out of a page), reference (auto), explanation (Extracting strings), parsel migration table + pitfalls, and the changelog fragment.

closes #246

Complete Element.attr() and Node.re()/re_first(): cover the valueless-attribute branch in regex_source, split the absent path from the unforceable allocation-failure guard so only the latter is excluded, and reach 100% line and branch C coverage. Fix the _html.pyi stub so re.Pattern resolves inside class Node (the re method shadowed the module; import Pattern directly), switch the tests to ty: ignore directives, and add the how-to, reference, explanation, and parsel migration docs plus the changelog fragment. closes tox-dev#246

gaborbernat added 2 commits June 20, 2026 19:27

wip(feat/parsel-extract): partial implementation (agent handoff)

924b4e6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(tree): parsel-style string extraction#256

feat(tree): parsel-style string extraction#256
gaborbernat wants to merge 2 commits into
tox-dev:mainfrom
gaborbernat:feat/parsel-extract

gaborbernat commented Jun 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gaborbernat commented Jun 21, 2026

API

Coverage

Docs

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant