✨ feat(tree): add Node.prune to keep a CSS selector by gaborbernat · Pull Request #258 · tox-dev/turbohtml

gaborbernat · 2026-06-21T02:50:45Z

What

Adds Node.prune(selector), the post-parse equivalent of BeautifulSoup's SoupStrainer: parse the whole document the WHATWG way, then keep only the descendants matching a CSS selector, together with their ancestors up to the node it is called on and the whole subtree under each match, removing everything else in place. It returns the node, so it chains off parse:

doc = turbohtml.parse(big_html).prune("article")

A selector that matches nothing empties the subtree.

Design

All logic in C, in node_prune next to node_css_closest in tree_type.c, reusing the existing selector engine (selector.h) and th_node_remove.
Free-threading safe under one per-tree critical section: pass 1 matches the selector and snapshots every match plus its ancestor chain (a regex/string filter can call back into Python, so no edit runs here); pass 2 is pure C and removes the un-kept nodes, so no structural pointer is dereferenced across a Python call or after a removal rewired it.
Reuses the Bloom selector, interned atoms, and arena; the only allocation is one growable keep buffer.

Surface

New Node.prune C method, registered on the shared node_methods table.
Typed stub in _html.pyi: def prune(self, selector: str, /) -> Node: ...

Docs / tests

How-to section, an explanation paragraph, the reference (autodoc), and a BeautifulSoup SoupStrainer -> prune migration row plus the updated omissions note.
tests/test_tree_prune.py (12 cases) and a concurrency case in tests/test_tree_freethread.py.
Changelog fragment docs/changelog/252.feature.rst.

closes #252

Add a Node.prune(selector) C method that, after the normal WHATWG parse, removes every descendant not matching the CSS selector and not an ancestor or descendant of a match, trimming a large document to a small tree. This is the post-parse equivalent of BeautifulSoup's SoupStrainer. The match runs first into a snapshot of each match plus its ancestor chain, then a pure-C pass removes the rest, so no structural pointer is dereferenced across a Python call (a regex/string filter) or after a removal rewired it. All work runs under one per-tree critical section, reusing the existing selector engine, atoms, and arena. closes tox-dev#252

for more information, see https://pre-commit.ci

gaborbernat force-pushed the feat/parse-prune branch from 046f433 to f0df182 Compare June 21, 2026 08:22

gaborbernat and others added 3 commits June 21, 2026 01:28

[pre-commit.ci] auto fixes from pre-commit.com hooks

9e0cfcb

for more information, see https://pre-commit.ci

[pre-commit.ci] auto fixes from pre-commit.com hooks

4cfb54d

for more information, see https://pre-commit.ci

gaborbernat force-pushed the feat/parse-prune branch from 3be2dab to 4cfb54d Compare June 21, 2026 08:31

gaborbernat marked this pull request as ready for review June 21, 2026 08:32

gaborbernat merged commit ffe8a79 into tox-dev:main Jun 21, 2026
33 of 35 checks passed

gaborbernat mentioned this pull request Jun 21, 2026

SoupStrainer equivalent: keep only matching subtrees (parse keep=/Node.prune) #250

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

✨ feat(tree): add Node.prune to keep a CSS selector#258

✨ feat(tree): add Node.prune to keep a CSS selector#258
gaborbernat merged 3 commits into
tox-dev:mainfrom
gaborbernat:feat/parse-prune

gaborbernat commented Jun 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gaborbernat commented Jun 21, 2026

What

Design

Surface

Docs / tests

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant