added new converter + middleware config for cache by JakeNesler · Pull Request #6 · gremllm/lib

JakeNesler · 2025-12-27T22:23:22Z

Summary

Adds a high-performance HTML-to-Markdown converter with HTTP middleware for serving LLM-optimized content via ?gremllm query parameter.

Single-pass HTML parsing (~28ms for 5000 lines, ~12ms cached)
Comprehensive HTML5 tag support (100+ tags)
Content control via data-llm attributes

Features

Converter

HTMLToMarkdown() - Single-pass HTML to markdown conversion
ProcessHTML() - HTML cleaning (returns HTML)
CondenseMarkdown() - Noise removal and whitespace normalization
Table-driven tag handling for maintainability
sync.Pool buffer reuse for reduced GC pressure

Middleware

Content-hash based caching (MD5)
LRU eviction when cache limit reached

Content Control

data-llm="keep" - Preserve normally-stripped elements
data-llm="drop" - Hide elements from LLMs
data-llm-description="..." - Describe script functionality

Tag Coverage

Category	Count	Examples
wrapRules	38	h1-h6, strong, em, del, kbd, sub, sup, table elements
passThroughTags	46	div, span, section, form, fieldset, thead, tbody
skipTags	5	canvas, embed, object, param, wbr
Special handlers	8	a, img, ul, ol, li, pre, code, audio, video
Default stripped	9	nav, aside, footer, header, script, style, svg, iframe, noscript

Performance

Scenario	Time
index.html (500 lines) cold	21ms
index.html cached	11ms
bigfile.html (5000 lines) cold	28ms
bigfile.html cached	12ms

JakeNesler · 2025-12-27T22:23:58Z

added bigfile.html for speed testing, not really needed but cool to see how fast it is

TheOutdoorProgrammer · 2025-12-28T00:00:29Z

+			// Result should be non-nil
+			if result == "" && tt.name == "no html structure" {
+				// Plain text should still be extracted
+				if !strings.Contains(result, "plain text") {
+					// This is acceptable - some parsers handle this differently
+				}
+			}


Should we have a check in here? We have ifs for nothing 😂

TheOutdoorProgrammer · 2025-12-28T00:02:15Z

+			// Return the processed markdown
+			w.Header().Set("Content-Type", "text/markdown; charset=utf-8")
 			w.WriteHeader(rw.statusCode)
-			w.Write(processed)
+			w.Write([]byte(markdown))


we need to change the content-length header too. But we can do that in another pr. Mostly just pointing it out.

TheOutdoorProgrammer

Just the small nits above looks good!

JakeNesler requested a review from TheOutdoorProgrammer December 27, 2025 22:23

squashing

d4e2714

JakeNesler force-pushed the html-to-md branch from b822dca to d4e2714 Compare December 27, 2025 23:13

TheOutdoorProgrammer reviewed Dec 28, 2025

View reviewed changes

TheOutdoorProgrammer approved these changes Dec 28, 2025

View reviewed changes

JakeNesler merged commit 7529e86 into main Dec 28, 2025
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

added new converter + middleware config for cache#6

added new converter + middleware config for cache#6
JakeNesler merged 1 commit into
mainfrom
html-to-md

JakeNesler commented Dec 27, 2025 •

edited

Loading

Uh oh!

JakeNesler commented Dec 27, 2025

Uh oh!

TheOutdoorProgrammer Dec 28, 2025

Uh oh!

TheOutdoorProgrammer Dec 28, 2025

Uh oh!

TheOutdoorProgrammer left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

JakeNesler commented Dec 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JakeNesler commented Dec 27, 2025

Uh oh!

TheOutdoorProgrammer Dec 28, 2025

Choose a reason for hiding this comment

Uh oh!

TheOutdoorProgrammer Dec 28, 2025

Choose a reason for hiding this comment

Uh oh!

TheOutdoorProgrammer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JakeNesler commented Dec 27, 2025 •

edited

Loading