Skip to content

✨ feat(tree): add Node.prune to keep a CSS selector#258

Merged
gaborbernat merged 3 commits into
tox-dev:mainfrom
gaborbernat:feat/parse-prune
Jun 21, 2026
Merged

✨ feat(tree): add Node.prune to keep a CSS selector#258
gaborbernat merged 3 commits into
tox-dev:mainfrom
gaborbernat:feat/parse-prune

Conversation

@gaborbernat

Copy link
Copy Markdown
Member

What

Adds Node.prune(selector), the post-parse equivalent of BeautifulSoup's SoupStrainer: parse the whole document the WHATWG way, then keep only the descendants matching a CSS selector, together with their ancestors up to the node it is called on and the whole subtree under each match, removing everything else in place. It returns the node, so it chains off parse:

doc = turbohtml.parse(big_html).prune("article")

A selector that matches nothing empties the subtree.

Design

  • All logic in C, in node_prune next to node_css_closest in tree_type.c, reusing the existing selector engine (selector.h) and th_node_remove.
  • Free-threading safe under one per-tree critical section: pass 1 matches the selector and snapshots every match plus its ancestor chain (a regex/string filter can call back into Python, so no edit runs here); pass 2 is pure C and removes the un-kept nodes, so no structural pointer is dereferenced across a Python call or after a removal rewired it.
  • Reuses the Bloom selector, interned atoms, and arena; the only allocation is one growable keep buffer.

Surface

  • New Node.prune C method, registered on the shared node_methods table.
  • Typed stub in _html.pyi: def prune(self, selector: str, /) -> Node: ...

Docs / tests

  • How-to section, an explanation paragraph, the reference (autodoc), and a BeautifulSoup SoupStrainer -> prune migration row plus the updated omissions note.
  • tests/test_tree_prune.py (12 cases) and a concurrency case in tests/test_tree_freethread.py.
  • Changelog fragment docs/changelog/252.feature.rst.

closes #252

gaborbernat and others added 3 commits June 21, 2026 01:28
Add a Node.prune(selector) C method that, after the normal WHATWG parse,
removes every descendant not matching the CSS selector and not an
ancestor or descendant of a match, trimming a large document to a small
tree. This is the post-parse equivalent of BeautifulSoup's SoupStrainer.

The match runs first into a snapshot of each match plus its ancestor
chain, then a pure-C pass removes the rest, so no structural pointer is
dereferenced across a Python call (a regex/string filter) or after a
removal rewired it. All work runs under one per-tree critical section,
reusing the existing selector engine, atoms, and arena.

closes tox-dev#252
@gaborbernat gaborbernat marked this pull request as ready for review June 21, 2026 08:32
@gaborbernat gaborbernat merged commit ffe8a79 into tox-dev:main Jun 21, 2026
33 of 35 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Node.prune(selector): keep only subtrees matching a CSS selector

1 participant