Skip to content

feat: remove nodes from html by default and allow data-llm to choose whether to remove nodes forcibly or not#4

Merged
TheOutdoorProgrammer merged 1 commit into
mainfrom
selective-removal
Dec 27, 2025
Merged

feat: remove nodes from html by default and allow data-llm to choose whether to remove nodes forcibly or not#4
TheOutdoorProgrammer merged 1 commit into
mainfrom
selective-removal

Conversation

@TheOutdoorProgrammer
Copy link
Copy Markdown
Contributor

Summary

Refactors HTML element stripping from rigid boolean flags to a flexible array-based approach with attribute-level override controls using data-llm annotations.

Motivation

The previous implementation required hardcoded boolean parameters for each element type we wanted to strip (nav, aside, script), which didn't scale well and made the API inflexible. Users had no way to selectively keep specific instances of elements that would otherwise be stripped, and adding new element types required API changes.

Changes

  • API Breaking Change: Convert() now accepts elementsToStrip []*C.char instead of separate stripNav, stripAside, stripScript boolean parameters
  • Refactored StripConfig struct to use ElementsToStrip []string instead of individual boolean fields
  • Implemented data-llm="keep" attribute to preserve elements that would normally be stripped
  • Implemented data-llm="drop" attribute to remove elements that would normally be kept
  • Changed default behavior to only strip header and footer elements (nav, aside, script, style are no longer stripped by default)
  • Added FFI test runner script (scripts/run-ffi-tests.sh)
  • Updated all existing tests to use new API

Features

Flexible Element Stripping

Pass any HTML element names you want stripped as an array instead of predefined flags.

Attribute-Based Overrides

  • data-llm="keep": Preserves an element even if its tag is in the strip list
  • data-llm="drop": Removes an element even if its tag is NOT in the strip list

Simplified Defaults

Only header and footer are stripped by default. All other elements (nav, aside, script, style) require explicit inclusion in the strip list.

Usage

Go:

stripConfig := converter.StripConfig{
    ElementsToStrip: []string{"nav", "aside", "script", "style"},
}
processed, err := converter.ProcessHTML(htmlBytes, stripConfig)

HTML with overrides:

<!-- This nav will be kept even if "nav" is in the strip list -->
<nav data-llm="keep">Important navigation</nav>

<!-- This div will be removed even though "div" is not in the strip list -->
<div data-llm="drop">Unnecessary content</div>

Python FFI:

elements_to_strip = [b"nav", b"aside", b"script"]
arr = (c_char_p * len(elements_to_strip))(*elements_to_strip)
result = lib.Convert(html_bytes, arr)

Testing

Added comprehensive test coverage:

  • TestProcessHTMLWithElementsToStrip - Verifies custom element stripping
  • TestProcessHTMLWithDataLLMKeep - Verifies keep attribute behavior
  • TestProcessHTMLWithDataLLMDrop - Verifies drop attribute behavior

@TheOutdoorProgrammer TheOutdoorProgrammer merged commit c98cc00 into main Dec 27, 2025
3 checks passed
@TheOutdoorProgrammer TheOutdoorProgrammer deleted the selective-removal branch December 27, 2025 04:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant