feat(node): readability main-content extraction#257
Draft
gaborbernat wants to merge 3 commits into
Draft
Conversation
for more information, see https://pre-commit.ci
main_text() returns str, so 'not …' replaces '== ""'; wrap the 126-col SVG fixture across two string literals.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds readability-style main-content extraction, the role resiliparse's main-content extractor fills.
Node.main_content() -> Element | Nonereturns the dominant content element (article body, with nav/sidebar/ads/comment boilerplate scored out), orNonewhen nothing reads as content.Node.main_text() -> strrenders that element withto_text(), or"".Heuristic
A content-density score over the arena tree, entirely in C (
treebuilder_readability.h):<p>,<td>,<pre>, >=25 chars) contribute a base point, one per comma (a clause proxy), and up to three for length (one per 100 chars); the contribution rolls up to the parent in full and the grandparent at half.<div>+5,<blockquote>/<td>/<pre>+3, lists/<form>-3, headings/<th>-5) and a class/id weight (+/-25 for content vs boilerplate hints).<script>,<nav>,<aside>, foreign namespaces, andcomment/modal/sidebar-class elements not rescued by amain/contenthint) are pruned before counting.The scoring walk is pure C and free-threading safe: it touches no Python object until a winner is chosen, then the binding wraps (or renders) that one node under the per-tree critical section. A concurrent stress test covers it.
Scope
Language detection and WARC/archive handling that resiliparse bundles are out of scope; the migration guide points to a dedicated tool for those.
Docs & tests
tests/test_tree_main_content.pydrives every scoring branch (100% C line+branch under both llvm-cov and gcc-16); concurrent stress intest_tree_freethread.py.closes #249