Skip to content

feat(tree): parsel-style string extraction#256

Draft
gaborbernat wants to merge 2 commits into
tox-dev:mainfrom
gaborbernat:feat/parsel-extract
Draft

feat(tree): parsel-style string extraction#256
gaborbernat wants to merge 2 commits into
tox-dev:mainfrom
gaborbernat:feat/parsel-extract

Conversation

@gaborbernat

Copy link
Copy Markdown
Member

Adds parsel-style string-extraction primitives so scraping code can pull strings out of a parsed tree without bolting non-standard CSS pseudo-elements onto the selector engine.

API

  • Element.attr(name, /, default=None) -> str | None -- the raw attribute value as one string (class reads back as "a b c", a valueless attribute as "", an absent one as default).
  • Node.re(pattern, /, *, attr=None) -> list[str] -- run a str or compiled re.Pattern over the node's text (or an attribute value with attr=); yields the lone capturing group when the pattern has one, else the whole match.
  • Node.re_first(pattern, /, default=None, *, attr=None) -> str | None -- the first match with the same group rule, or default.

The regex runs in Python's re; only the source string is produced in C under the per-tree critical section.

Coverage

100% line and branch on tree_type.c under clang llvm-cov. The only excluded branches are the unforceable allocation-failure guards in node_re/node_re_first; the testable absent-attribute path is split out and covered.

Note: the gcc-16 cross-check could not be run in this environment (the permission layer denied env, rm, meson, and direct gcovr, so the build dir compiler could not be switched). Every new conditional is two-sided and exercised by tests.

Docs

how-to (Pull strings out of a page), reference (auto), explanation (Extracting strings), parsel migration table + pitfalls, and the changelog fragment.

closes #246

Complete Element.attr() and Node.re()/re_first(): cover the valueless-attribute
branch in regex_source, split the absent path from the unforceable
allocation-failure guard so only the latter is excluded, and reach 100% line and
branch C coverage.

Fix the _html.pyi stub so re.Pattern resolves inside class Node (the re method
shadowed the module; import Pattern directly), switch the tests to ty: ignore
directives, and add the how-to, reference, explanation, and parsel migration
docs plus the changelog fragment.

closes tox-dev#246
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Parsel-style string extraction and regex over a selection

1 participant