Skip to content

added new converter + middleware config for cache#6

Merged
JakeNesler merged 1 commit into
mainfrom
html-to-md
Dec 28, 2025
Merged

added new converter + middleware config for cache#6
JakeNesler merged 1 commit into
mainfrom
html-to-md

Conversation

@JakeNesler
Copy link
Copy Markdown
Contributor

@JakeNesler JakeNesler commented Dec 27, 2025

Summary

Adds a high-performance HTML-to-Markdown converter with HTTP middleware for serving LLM-optimized content via ?gremllm query parameter.

  • Single-pass HTML parsing (~28ms for 5000 lines, ~12ms cached)
  • Comprehensive HTML5 tag support (100+ tags)
  • Content control via data-llm attributes

Features

Converter

  • HTMLToMarkdown() - Single-pass HTML to markdown conversion
  • ProcessHTML() - HTML cleaning (returns HTML)
  • CondenseMarkdown() - Noise removal and whitespace normalization
  • Table-driven tag handling for maintainability
  • sync.Pool buffer reuse for reduced GC pressure

Middleware

  • Content-hash based caching (MD5)
  • LRU eviction when cache limit reached

Content Control

  • data-llm="keep" - Preserve normally-stripped elements
  • data-llm="drop" - Hide elements from LLMs
  • data-llm-description="..." - Describe script functionality

Tag Coverage

Category Count Examples
wrapRules 38 h1-h6, strong, em, del, kbd, sub, sup, table elements
passThroughTags 46 div, span, section, form, fieldset, thead, tbody
skipTags 5 canvas, embed, object, param, wbr
Special handlers 8 a, img, ul, ol, li, pre, code, audio, video
Default stripped 9 nav, aside, footer, header, script, style, svg, iframe, noscript

Performance

Scenario Time
index.html (500 lines) cold 21ms
index.html cached 11ms
bigfile.html (5000 lines) cold 28ms
bigfile.html cached 12ms

@JakeNesler
Copy link
Copy Markdown
Contributor Author

added bigfile.html for speed testing, not really needed but cool to see how fast it is

Comment on lines +481 to +487
// Result should be non-nil
if result == "" && tt.name == "no html structure" {
// Plain text should still be extracted
if !strings.Contains(result, "plain text") {
// This is acceptable - some parsers handle this differently
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have a check in here? We have ifs for nothing 😂

Comment on lines +147 to +150
// Return the processed markdown
w.Header().Set("Content-Type", "text/markdown; charset=utf-8")
w.WriteHeader(rw.statusCode)
w.Write(processed)
w.Write([]byte(markdown))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to change the content-length header too. But we can do that in another pr. Mostly just pointing it out.

Copy link
Copy Markdown
Contributor

@TheOutdoorProgrammer TheOutdoorProgrammer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just the small nits above looks good!

@JakeNesler JakeNesler merged commit 7529e86 into main Dec 28, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants