Skip to content

feat(node): readability main-content extraction#257

Draft
gaborbernat wants to merge 3 commits into
tox-dev:mainfrom
gaborbernat:feat/main-content
Draft

feat(node): readability main-content extraction#257
gaborbernat wants to merge 3 commits into
tox-dev:mainfrom
gaborbernat:feat/main-content

Conversation

@gaborbernat

Copy link
Copy Markdown
Member

Adds readability-style main-content extraction, the role resiliparse's main-content extractor fills.

  • Node.main_content() -> Element | None returns the dominant content element (article body, with nav/sidebar/ads/comment boilerplate scored out), or None when nothing reads as content.
  • Node.main_text() -> str renders that element with to_text(), or "".

Heuristic

A content-density score over the arena tree, entirely in C (treebuilder_readability.h):

  • Paragraph-like elements (<p>, <td>, <pre>, >=25 chars) contribute a base point, one per comma (a clause proxy), and up to three for length (one per 100 chars); the contribution rolls up to the parent in full and the grandparent at half.
  • Containers get a tag weight (<div> +5, <blockquote>/<td>/<pre> +3, lists/<form> -3, headings/<th> -5) and a class/id weight (+/-25 for content vs boilerplate hints).
  • Boilerplate subtrees (<script>, <nav>, <aside>, foreign namespaces, and comment/modal/sidebar-class elements not rescued by a main/content hint) are pruned before counting.
  • Each survivor is discounted by its link density; the highest remaining score wins.

The scoring walk is pure C and free-threading safe: it touches no Python object until a winner is chosen, then the binding wraps (or renders) that one node under the per-tree critical section. A concurrent stress test covers it.

Scope

Language detection and WARC/archive handling that resiliparse bundles are out of scope; the migration guide points to a dedicated tool for those.

Docs & tests

  • Tutorial, how-to, reference (autodoc) and explanation coverage, plus the resiliparse migration row.
  • New tests/test_tree_main_content.py drives every scoring branch (100% C line+branch under both llvm-cov and gcc-16); concurrent stress in test_tree_freethread.py.

closes #249

gaborbernat and others added 3 commits June 20, 2026 19:47
main_text() returns str, so 'not …' replaces '== ""'; wrap the 126-col SVG fixture across two string literals.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Main-content / boilerplate extraction (readability)

1 participant