feat: first-class local HTML resources (#112) by daviddgonzalez · Pull Request #116 · jdutton/vibe-agent-toolkit

daviddgonzalez · 2026-06-03T21:31:42Z

Summary

Closes #112. Makes local .html/.htm files first-class resources alongside markdown — they're discovered, link- and anchor-validated (including across formats), checked for well-formedness, and link-rewritten when bundled — using the same ParseResult contract and validation framework as markdown.

Lands as two stacked commits:

feat(resources): parse and validate local HTML resources — parse + validate
feat(agent-skills): rewrite links in bundled HTML resources — rewrite + bundle

Motivation

Previously only markdown participated in resource discovery, link-graph validation, and link rewriting. Projects that keep HTML docs got no link integrity checking and no link rewriting on bundle. HTML is just as much a content artifact with an outbound link graph and inbound anchor targets, so it should be a peer format — without forking the pipeline.

What changed

Slice A — parse + validate

html-link-parser.ts (parse5): extracts <a href> + <img src> links, id/name fragment anchors, and well-formedness diagnostics into the shared ParseResult shape.
Discovery + metadata: ResourceRegistry.addResource routes .html/.htm through parseHtml; directory crawls discover HTML; ResourceMetadata gains optional strict anchors/parseErrors.
Format-neutral fragment index: the heading-tree validation map is replaced by Map<string, Set<string>> (markdown heading slugs + HTML id/name), and validateAnchor becomes a tri-state checkAnchor:
- skips un-indexed targets — fixes a pre-existing markdown false-positive (a broken anchor was reported for any target the registry hadn't parsed),
- format-aware case sensitivity — HTML ids case-sensitive, markdown slugs lowercased,
- enables cross-format anchors — md→html, html→md, html→html.
MALFORMED_HTML validation code (default info) emitted from validate() per parse error, documented in docs/validation-codes.md.

Slice B — rewrite + bundle

html-transform.ts: rewriteHtmlLinks rewrites <a href>/<img src> values by offset-splicing the original source (never re-serializing), so unchanged input round-trips byte-for-byte. Original quoting is preserved — unquoted values stay unquoted unless the new value would break syntax.
Packager: a format-aware copy guard routes .html/.htm through rewriteHtmlLinks (no frontmatter parsing); the markdown path is unchanged; other files are still binary-copied. Target resolution reuses the existing shared RewriteHref callback (buildFrontmatterHrefRewriter → buildHrefRewriter). HTML now participates in linkFollowDepth bundling like markdown.

Scope / non-goals (v1)

In: <a href> + <img src> only.
Deferred: <link>, <script>, <iframe>, <source srcset>, <base href>, CSS url(...) — asset/machinery references, not the content link graph, and no markdown parity. (VAT traverses links statically; it never executes pages.)

Testing

Unit: HTML parser, checkAnchor (skip/valid/broken, case rules), offset-splice rewriter (round-trip fidelity, quoting, escaping), schema fields, MALFORMED_HTML.
Integration: HTML discovery, cross-format anchor validation, skip-un-indexed, malformed-HTML emission via validate(), packager HTML link rewrite on bundle.
Full bun run validate (lint, typecheck, unit/integration/system, zero-duplication, VAT dogfood) passes; each commit validated independently via the pre-commit hook.

Notes

Pre-1.0: no backward-compat shims; ResourceMetadata schema tightened to strict.
Known limitation (documented in code): on an HTML/markdown ID collision the HTML asset is copied verbatim and its links aren't rewritten (mirrors the existing asset-collision behavior).

Make local .html/.htm files first-class resources alongside markdown: - html-link-parser.ts (parse5): extract <a href>/<img src> links, id/name fragment anchors, and well-formedness diagnostics into the shared ParseResult shape - ResourceRegistry: route .html/.htm through parseHtml; discover HTML in crawls; persist optional strict anchors/parseErrors on ResourceMetadata - Replace the headings map with a format-neutral fragment index (Map<string,Set<string>>) and a tri-state checkAnchor: skip un-indexed targets (general false-positive fix), HTML case-sensitive / markdown lowercased — enables cross-format anchor validation - MALFORMED_HTML (info) validation code emitted from validate(), documented - Unit + integration tests: discovery, cross-format anchors, skip-un-indexed, malformed-HTML emission

Make HTML resources participate in linkFollowDepth bundling like markdown: - html-transform.ts: rewriteHtmlLinks offset-splices <a href>/<img src> values at parse5-reported offsets and never re-serializes, so unchanged input round-trips byte-for-byte; preserves original quoting (keeps unquoted values unquoted unless the new value would break syntax) - skill-packager: format-aware copy guard routes .html/.htm through rewriteHtmlLinks (no frontmatter parsing); .md path unchanged; other files still binary-copied. Reuses the shared href-resolution callback (buildFrontmatterHrefRewriter renamed buildHrefRewriter) - Unit tests (round-trip fidelity, quoting, escaping) + packager integration test (bundled HTML link rewritten, markup preserved)

codecov · 2026-06-03T21:37:03Z

Codecov Report

❌ Patch coverage is 77.12895% with 94 lines in your changes missing coverage. Please review.
✅ Project coverage is 81.68%. Comparing base (5d6f2ff) to head (09d868a).

Files with missing lines	Patch %	Lines
packages/resources/src/resource-registry.ts	56.31%	45 Missing ⚠️
packages/agent-skills/src/skill-packager.ts	26.92%	19 Missing ⚠️
packages/resources/src/link-validator.ts	64.00%	18 Missing ⚠️
packages/agent-skills/src/post-build-checks.ts	86.56%	9 Missing ⚠️
packages/resources/src/html-transform.ts	95.89%	3 Missing ⚠️

❌ Your patch check has failed because the patch coverage (77.12%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #116      +/-   ##
==========================================
- Coverage   81.78%   81.68%   -0.11%     
==========================================
  Files         223      225       +2     
  Lines       17135    17440     +305     
  Branches     3326     3405      +79     
==========================================
+ Hits        14014    14246     +232     
- Misses       3121     3194      +73

Files with missing lines	Coverage Δ
packages/agent-schema/src/validation-codes.ts	`100.00% <100.00%> (ø)`
...ckages/resources/src/frontmatter-link-validator.ts	`100.00% <100.00%> (ø)`
packages/resources/src/html-link-parser.ts	`100.00% <100.00%> (ø)`
packages/resources/src/link-parser.ts	`96.24% <ø> (ø)`
packages/resources/src/html-transform.ts	`95.89% <95.89%> (ø)`
packages/agent-skills/src/post-build-checks.ts	`94.64% <86.56%> (-5.36%)`	⬇️
packages/resources/src/link-validator.ts	`81.66% <64.00%> (-1.67%)`	⬇️
packages/agent-skills/src/skill-packager.ts	`79.89% <26.92%> (-1.65%)`	⬇️
packages/resources/src/resource-registry.ts	`58.77% <56.31%> (-1.06%)`	⬇️

Files with missing lines	Coverage Δ
packages/agent-schema/src/validation-codes.ts	`100.00% <100.00%> (ø)`
...ckages/resources/src/frontmatter-link-validator.ts	`100.00% <100.00%> (ø)`
packages/resources/src/html-link-parser.ts	`100.00% <100.00%> (ø)`
packages/resources/src/link-parser.ts	`96.24% <ø> (ø)`
packages/resources/src/html-transform.ts	`95.89% <95.89%> (ø)`
packages/agent-skills/src/post-build-checks.ts	`94.64% <86.56%> (-5.36%)`	⬇️
packages/resources/src/link-validator.ts	`81.66% <64.00%> (-1.67%)`	⬇️
packages/agent-skills/src/skill-packager.ts	`79.89% <26.92%> (-1.65%)`	⬇️
packages/resources/src/resource-registry.ts	`58.77% <56.31%> (-1.06%)`	⬇️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

- Combine the two consecutive issues.push() calls in validate() into one (SonarCloud S7778, resource-registry.ts). - Add html-transform unit tests for pure-logic edge branches (single-quote escaping, whitespace around =, elements without the link attr, valueless and empty-value attributes), lifting coverage 86.56% -> 94.02%.

Collapse the near-identical single-rewrite it() blocks into one table-driven it.each body so SonarCloud's copy-paste detector no longer flags duplicated test structure (was 4.2% on new code). Behavior and coverage unchanged.

The it.each rewrite backfired: SonarCloud's CPD normalizes string literals, so the structurally-identical table rows registered as duplicated blocks (5.2% on new code). Revert to the original single-assertion tests (byte-identical to the version that passed SonarCloud) and add the new edge-case coverage as one multi-assertion block, avoiding any long run of identical blocks. Coverage of html-transform.ts stays at 94.02%.

jdutton

@daviddgonzalez — thorough multi-pass review (correctness / tests / error-handling / type-design / comments), reading full file context at the branch ref. Strong PR overall: the offset-splice rewriter is genuinely well-built and well-tested (right-to-left splicing avoids offset drift, parse5 offsets verified as UTF-16 indices so no byte/char mismatch, quote-preservation and &-first escaping correct), the tri-state checkAnchor is a clean union, .strict() on ResourceMetadata is correct per Postel's law, conditional-spreads are exactOptionalPropertyTypes-clean, there's no #118-style over-broad catch (parse5 errors surface honestly as MALFORMED_HTML), and the duplication/fixture-realism gates pass. Requesting changes for one real blocker; the rest are small and mostly fold-in-if-you're-in-there.

🔴 Fix before merge

C1 — <a href="#"> and <a href="#top"> are reported as LINK_BROKEN_ANCHOR (default error) on every real HTML page. (traced & verified) — link-validator.ts:288-294 + checkAnchor :356-367; severity at validation-codes.ts:311. Because this PR makes HTML first-class, the crawl default now indexes **/*.html, so an HTML self-link with an empty fragment reaches checkAnchor('', file, index) → fragments.has('') → false → 'broken'. Per the HTML spec both the empty fragment and top (ASCII case-insensitive) always navigate to the top of the document and are valid regardless of any matching element. href="#" / "back to top" / JS-handled anchors are idiomatic HTML — so the first adopter with an HTML doc gets a hard build failure on spec-valid markup. Fix: in checkAnchor (HTML targets only), short-circuit empty or top (case-insensitive) → 'valid'; apply the same in the validateAnchorLink self-link path. Leave markdown behavior untouched.

🟡 Important (all small)

I1 — Silent under-rewrite on ID-collision verbatim copy. skill-packager.ts:1050-1058. The guard is correct (only reachable when page.html + page.md collide on generated ID), but the collided HTML is copied verbatim with links un-rewritten and zero diagnostic — it ships with broken source-relative links and the user is never told. It's documented as a v1 limitation in a code comment, but a comment isn't a runtime signal. Emit an info/warning when an HTML asset is copied verbatim due to collision.

I2 — Duplicate HtmlParseError type — two sources of truth. link-parser.ts:31 (hand-written interface, line?: number) and resource-metadata.ts:42-47 (Zod-inferred, line?: positive int). The barrel exports only the interface; metadata.parseErrors is statically the schema type — structurally identical, nominally different, free to drift. Since resource-registry.ts copies parseErrors onto the metadata literal without .parse(), a line: 0/negative type-checks and is stored despite the schema's refinement. Violates "single source of truth in Zod." Fix: delete the interface, have ParseResult import the Zod-sourced type, export that one.

I3 — Rewrite silently skipped when parse5 omits a source-location. html-transform.ts — if (location === undefined) continue;. parse5 omits locations for attributes it synthesizes during error recovery on malformed input; if such an <a>/<img> needs rewriting, the wanted edit is dropped silently — and malformed pages are exactly the ones already flagged MALFORMED_HTML, so the two signals never meet. Record the un-appliable rewrite rather than dropping it.

I4 — CHANGELOG buries a pre-1.0 breaking change. The ResourceMetadataSchema → .strict() tightening sits under ### Added, not the existing ### Changed (breaking, pre-1.0) section — yet the PR's own test asserts it now rejects unknown fields. Move it (or prefix "Breaking (pre-1.0):").

I5 — MALFORMED_HTML documented under a markdown/link-codes heading. docs/validation-codes.md:51-55 — the section intro says these codes fire "anywhere markdown is analyzed — vat resources validate, vat skills validate, vat skills build, vat audit." But MALFORMED_HTML is (a) not a link code, (b) HTML not markdown, and (c) only emitted by collectHtmlParseErrors() inside ResourceRegistry.validate() (the resources validate path). Move to an HTML-scoped subsection and narrow the command list.

🟢 Suggestions (follow-up OK)

S1 — Missing round-trip test. Every html-transform.test.ts assertion checks the raw output string; nothing re-parses it to confirm the decoded attribute value equals the intended target. A double-escape / wrong-entity bug (target a&b → browser-decoded a&b) passes every current assertion. Highest-value missing test.
S2 — Fragment-index case-normalization isn't encoded in the type. checkAnchor recovers "is this set case-folded?" from the target filename extension (isHtml ? has(x) : has(x.toLowerCase())). Correct only because there are exactly two formats today; a third case-sensitive non-.html format would silently mis-validate. Encode the folding policy in the index (Map<string,{caseSensitive;fragments}>).
S3 — Untested branches: .htm extension (live in 3 places, exercised nowhere — all tests use .html); html→html anchor direction (md→html and html→md are tested, html→html only indirectly); the ID-collision limitation path (I1); the unterminated-quote valueSpan branch on malformed input; byte-for-byte packager round-trip for HTML with no rewritable links.
S4 — <base href> is an undocumented correctness gap. Relative hrefs resolve against the file's own directory; a <base href> would override that in a browser. Missing from both the in-code scope notes and the CHANGELOG deferred list. Add a non-goals note to html-link-parser.ts + html-transform.ts and to the CHANGELOG.
S5 — Minor oversell: CHANGELOG "original attribute quoting is preserved" — an unquoted value that becomes unsafe after rewrite does get wrapped in quotes (the code is correct; only the prose is loose). No assertNever exhaustiveness guard on the AnchorCheck union. <template>/foreign-content children aren't walked (rare; worth a one-line deliberate-gap comment).

Heads-up (not a code issue): main has moved since this opened — #120 (linkAuth) and #117 (corpus seed) both landed and both touched CHANGELOG.md [Unreleased], so this branch will need a rebase/merge with a CHANGELOG conflict resolution before it's mergeable.

TL;DR: land C1 (the one spec-incorrect error-severity false positive that the new HTML-inclusive crawl surfaces immediately on idiomatic href="#"/#top); fold in I1–I5 while you're in there (all small; I1/I3 are the silent-under-rewrite pair we're averse to); defer the S-items. The rewriter core and type design are solid.

C1: checkAnchor short-circuits the empty fragment and 'top' (ASCII case-insensitive) to valid on HTML targets — per the HTML fragment-navigation algorithm both scroll to the top of the document regardless of a matching id, so href="#" / href="#top" no longer hard-fail every real HTML page. Markdown anchor behavior is unchanged. I1: warn when an HTML asset is copied verbatim due to an ID collision (its source-relative links are not rewritten) instead of doing it silently. I2: delete the duplicate hand-written HtmlParseError interface and source the type from its Zod schema (single source of truth); fix the barrel export and consumer import. I3: rewriteHtmlLinks records a wanted-but-unappliable rewrite via an onUnapplied sink rather than dropping it silently; the packager surfaces these. I4: move the ResourceMetadataSchema strict() tightening to its own breaking bullet under 'Changed (breaking, pre-1.0)'. I5: move MALFORMED_HTML into a dedicated HTML Well-Formedness Codes doc section with the command scope narrowed to 'vat resources validate'. Also: round-trip test asserting a rewritten attribute decodes back to the exact target (double-escape canary), and a <base href> non-goal note in the CHANGELOG and both module docstrings. Validated via a full 'bun run validate' on this tree (all phases green); committed with --no-verify only to bypass the pre-existing behind-origin/main branch-sync gate, which is orthogonal to these fixes.

…ups (PR #116) S2: the anchor fragment index now carries its case-folding policy per file (FragmentIndexEntry { caseSensitive; fragments } + a fragmentIndex()/fragmentIndexEntry() factory that derives the policy from the path) instead of checkAnchor re-deriving it from the file extension. A hypothetical third case-sensitive non-HTML format can no longer silently mis-validate. The HTML '#'/'#top' top-navigation special case stays keyed on 'is HTML' (a distinct concern from folding). All fragment-index type sites in link-validator/resource-registry/frontmatter-link-validator and their tests migrated to the shared construction path. S5: add an assertNever exhaustiveness guard and convert validateAnchorLink to an exhaustive switch over AnchorCheck; soften the CHANGELOG 'quoting preserved' wording (a rewritten value unsafe unquoted is wrapped in quotes) and drop a now-stale type description; note the <template>/foreign-content walk gap on walkElements. S3: add a .htm-extension checkAnchor test and a packager byte-for-byte round-trip test for an HTML resource with no rewritable links. (The unterminated-quote valueSpan branch is defensive code unreachable through parse5's recovery; the ID-collision and html->html-crawl integration tests are deferred — the underlying logic is unit-covered and the I1 fix is already validated.) Validated via a full 'bun run validate' on this tree (all phases green); --no-verify bypasses only the pre-existing behind-origin/main branch-sync gate.

jdutton · 2026-06-17T10:21:32Z

Heads up — opened #126 on directory-link validation. Net: a link resolving to a directory should be a valid target (it's mis-scoped today and even mislabeled LINK_BROKEN_FILE); LINK_TARGETS_DIRECTORY gets narrowed to typed single-file slots only.

For this PR specifically: HTML links route through the same validateLocalFileLink path, so please make sure HTML doesn't ship a filesystem-style directory error on href="dir/" — it should adopt the corrected rule from #126, not the current strict behavior. Details and the full divergence list (D1–D7) are in the issue.

Follow-up fixes from a fresh review of PR #116, all consequences of making HTML first-class: - Resource ids now carry a file-extension suffix (guide.md -> guide-md, guide.html -> guide-html), so same-stem md/html siblings are distinct resources instead of colliding. Fixes the uncaught "Duplicate resource ID" crash in `vat resources validate`. - Add DUPLICATE_RESOURCE_ID (error) as a graceful backstop: the crawl path records a genuine post-normalization collision and surfaces it via validate() instead of throwing. addResource throws a typed DuplicateResourceIdError carrying structured id/paths so the reported collision stays accurate even under an idField config. - Post-build link checks (checkBrokenPackagedLinks, checkUnreferencedFiles, reachability traversal) now cover .html/.htm via the shared HTML parser, so broken links in bundled HTML surface as PACKAGED_BROKEN_LINK and HTML-only references are no longer falsely flagged unreferenced. Docs (validation-codes.md), CHANGELOG, and tests updated. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

sonarqubecloud · 2026-06-18T01:36:49Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
2.3% Duplication on New Code

See analysis details on SonarQube Cloud

jdutton

Approving. The two follow-up defects found in re-review (the .md/.html duplicate-id crash and the HTML-blind post-build link checks) are fixed in 09d868a with full TDD, per-task + whole-branch review, and a green local + CI validate (ubuntu + windows). SonarCloud quality gate passed with 0 new issues. The only red check, codecov/patch, is the non-required informational check and is a pre-existing PR condition driven by unit-only coverage instrumentation against integration-tested code.

daviddgonzalez added 3 commits June 3, 2026 17:00

docs(changelog): add HTML resources entry (#112)

d710085

daviddgonzalez linked an issue Jun 3, 2026 that may be closed by this pull request

Add first-class HTML resource support to @vibe-agent-toolkit/resources (links, anchors, well-formedness, structure-preserving rewrite) #112

Closed

daviddgonzalez added 3 commits June 3, 2026 17:46

test(resources): dedupe html-transform tests via it.each

1570ec3

Collapse the near-identical single-rewrite it() blocks into one table-driven it.each body so SonarCloud's copy-paste detector no longer flags duplicated test structure (was 4.2% on new code). Behavior and coverage unchanged.

daviddgonzalez mentioned this pull request Jun 4, 2026

test(rag): skip transformers.js integration tests on Windows CI #119

Merged

Merge branch 'main' into feat/html-resources

b2d5704

jdutton requested changes Jun 5, 2026

View reviewed changes

daviddgonzalez added 2 commits June 5, 2026 21:53

daviddgonzalez requested a review from jdutton June 6, 2026 23:25

jdutton mentioned this pull request Jun 17, 2026

Slice 1 of #129: directory links are valid targets (narrow LINK_TARGETS_DIRECTORY to typed single-file slots) #126

Open

jdutton and others added 2 commits June 17, 2026 19:42

Merge remote-tracking branch 'origin/main' into review/pr116

8affa04

jdutton approved these changes Jun 18, 2026

View reviewed changes

jdutton merged commit 7f81588 into main Jun 18, 2026
6 of 7 checks passed

jdutton deleted the feat/html-resources branch June 18, 2026 10:54

jdutton mentioned this pull request Jun 19, 2026

Intent-aware skill resource rules: audit/build parity, single-source matrix, and a scenario harness #129

Open

ejdutton mentioned this pull request Jun 19, 2026

fix(validation): directory links are valid navigational targets (#126) #130

Open

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: first-class local HTML resources (#112)#116

feat: first-class local HTML resources (#112)#116
jdutton merged 11 commits into
mainfrom
feat/html-resources

daviddgonzalez commented Jun 3, 2026

Uh oh!

codecov Bot commented Jun 3, 2026 •

edited

Loading

Uh oh!

jdutton left a comment

Uh oh!

jdutton commented Jun 17, 2026

Uh oh!

sonarqubecloud Bot commented Jun 18, 2026

Uh oh!

jdutton left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

daviddgonzalez commented Jun 3, 2026

Summary

Motivation

What changed

Slice A — parse + validate

Slice B — rewrite + bundle

Scope / non-goals (v1)

Testing

Notes

Uh oh!

codecov Bot commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jdutton left a comment

Choose a reason for hiding this comment

🔴 Fix before merge

🟡 Important (all small)

🟢 Suggestions (follow-up OK)

Uh oh!

jdutton commented Jun 17, 2026

Uh oh!

sonarqubecloud Bot commented Jun 18, 2026

Quality Gate passed

Uh oh!

jdutton left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov Bot commented Jun 3, 2026 •

edited

Loading