Skip to content

feat: first-class local HTML resources (#112)#116

Merged
jdutton merged 11 commits into
mainfrom
feat/html-resources
Jun 18, 2026
Merged

feat: first-class local HTML resources (#112)#116
jdutton merged 11 commits into
mainfrom
feat/html-resources

Conversation

@daviddgonzalez

Copy link
Copy Markdown
Collaborator

Summary

Closes #112. Makes local .html/.htm files first-class resources alongside markdown — they're discovered, link- and anchor-validated (including across formats), checked for well-formedness, and link-rewritten when bundled — using the same ParseResult contract and validation framework as markdown.

Lands as two stacked commits:

  1. feat(resources): parse and validate local HTML resources — parse + validate
  2. feat(agent-skills): rewrite links in bundled HTML resources — rewrite + bundle

Motivation

Previously only markdown participated in resource discovery, link-graph validation, and link rewriting. Projects that keep HTML docs got no link integrity checking and no link rewriting on bundle. HTML is just as much a content artifact with an outbound link graph and inbound anchor targets, so it should be a peer format — without forking the pipeline.

What changed

Slice A — parse + validate

  • html-link-parser.ts (parse5): extracts <a href> + <img src> links, id/name fragment anchors, and well-formedness diagnostics into the shared ParseResult shape.
  • Discovery + metadata: ResourceRegistry.addResource routes .html/.htm through parseHtml; directory crawls discover HTML; ResourceMetadata gains optional strict anchors/parseErrors.
  • Format-neutral fragment index: the heading-tree validation map is replaced by Map<string, Set<string>> (markdown heading slugs + HTML id/name), and validateAnchor becomes a tri-state checkAnchor:
    • skips un-indexed targets — fixes a pre-existing markdown false-positive (a broken anchor was reported for any target the registry hadn't parsed),
    • format-aware case sensitivity — HTML ids case-sensitive, markdown slugs lowercased,
    • enables cross-format anchors — md→html, html→md, html→html.
  • MALFORMED_HTML validation code (default info) emitted from validate() per parse error, documented in docs/validation-codes.md.

Slice B — rewrite + bundle

  • html-transform.ts: rewriteHtmlLinks rewrites <a href>/<img src> values by offset-splicing the original source (never re-serializing), so unchanged input round-trips byte-for-byte. Original quoting is preserved — unquoted values stay unquoted unless the new value would break syntax.
  • Packager: a format-aware copy guard routes .html/.htm through rewriteHtmlLinks (no frontmatter parsing); the markdown path is unchanged; other files are still binary-copied. Target resolution reuses the existing shared RewriteHref callback (buildFrontmatterHrefRewriterbuildHrefRewriter). HTML now participates in linkFollowDepth bundling like markdown.

Scope / non-goals (v1)

  • In: <a href> + <img src> only.
  • Deferred: <link>, <script>, <iframe>, <source srcset>, <base href>, CSS url(...) — asset/machinery references, not the content link graph, and no markdown parity. (VAT traverses links statically; it never executes pages.)

Testing

  • Unit: HTML parser, checkAnchor (skip/valid/broken, case rules), offset-splice rewriter (round-trip fidelity, quoting, escaping), schema fields, MALFORMED_HTML.
  • Integration: HTML discovery, cross-format anchor validation, skip-un-indexed, malformed-HTML emission via validate(), packager HTML link rewrite on bundle.
  • Full bun run validate (lint, typecheck, unit/integration/system, zero-duplication, VAT dogfood) passes; each commit validated independently via the pre-commit hook.

Notes

  • Pre-1.0: no backward-compat shims; ResourceMetadata schema tightened to strict.
  • Known limitation (documented in code): on an HTML/markdown ID collision the HTML asset is copied verbatim and its links aren't rewritten (mirrors the existing asset-collision behavior).

Make local .html/.htm files first-class resources alongside markdown:

- html-link-parser.ts (parse5): extract <a href>/<img src> links,
  id/name fragment anchors, and well-formedness diagnostics into the
  shared ParseResult shape
- ResourceRegistry: route .html/.htm through parseHtml; discover HTML in
  crawls; persist optional strict anchors/parseErrors on ResourceMetadata
- Replace the headings map with a format-neutral fragment index
  (Map<string,Set<string>>) and a tri-state checkAnchor: skip un-indexed
  targets (general false-positive fix), HTML case-sensitive / markdown
  lowercased — enables cross-format anchor validation
- MALFORMED_HTML (info) validation code emitted from validate(), documented
- Unit + integration tests: discovery, cross-format anchors, skip-un-indexed,
  malformed-HTML emission
Make HTML resources participate in linkFollowDepth bundling like markdown:

- html-transform.ts: rewriteHtmlLinks offset-splices <a href>/<img src>
  values at parse5-reported offsets and never re-serializes, so unchanged
  input round-trips byte-for-byte; preserves original quoting (keeps
  unquoted values unquoted unless the new value would break syntax)
- skill-packager: format-aware copy guard routes .html/.htm through
  rewriteHtmlLinks (no frontmatter parsing); .md path unchanged; other
  files still binary-copied. Reuses the shared href-resolution callback
  (buildFrontmatterHrefRewriter renamed buildHrefRewriter)
- Unit tests (round-trip fidelity, quoting, escaping) + packager
  integration test (bundled HTML link rewritten, markup preserved)
@codecov

codecov Bot commented Jun 3, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 77.12895% with 94 lines in your changes missing coverage. Please review.
✅ Project coverage is 81.68%. Comparing base (5d6f2ff) to head (09d868a).

Files with missing lines Patch % Lines
packages/resources/src/resource-registry.ts 56.31% 45 Missing ⚠️
packages/agent-skills/src/skill-packager.ts 26.92% 19 Missing ⚠️
packages/resources/src/link-validator.ts 64.00% 18 Missing ⚠️
packages/agent-skills/src/post-build-checks.ts 86.56% 9 Missing ⚠️
packages/resources/src/html-transform.ts 95.89% 3 Missing ⚠️

❌ Your patch check has failed because the patch coverage (77.12%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #116      +/-   ##
==========================================
- Coverage   81.78%   81.68%   -0.11%     
==========================================
  Files         223      225       +2     
  Lines       17135    17440     +305     
  Branches     3326     3405      +79     
==========================================
+ Hits        14014    14246     +232     
- Misses       3121     3194      +73     
Files with missing lines Coverage Δ
packages/agent-schema/src/validation-codes.ts 100.00% <100.00%> (ø)
...ckages/resources/src/frontmatter-link-validator.ts 100.00% <100.00%> (ø)
packages/resources/src/html-link-parser.ts 100.00% <100.00%> (ø)
packages/resources/src/link-parser.ts 96.24% <ø> (ø)
packages/resources/src/html-transform.ts 95.89% <95.89%> (ø)
packages/agent-skills/src/post-build-checks.ts 94.64% <86.56%> (-5.36%) ⬇️
packages/resources/src/link-validator.ts 81.66% <64.00%> (-1.67%) ⬇️
packages/agent-skills/src/skill-packager.ts 79.89% <26.92%> (-1.65%) ⬇️
packages/resources/src/resource-registry.ts 58.77% <56.31%> (-1.06%) ⬇️
Files with missing lines Coverage Δ
packages/agent-schema/src/validation-codes.ts 100.00% <100.00%> (ø)
...ckages/resources/src/frontmatter-link-validator.ts 100.00% <100.00%> (ø)
packages/resources/src/html-link-parser.ts 100.00% <100.00%> (ø)
packages/resources/src/link-parser.ts 96.24% <ø> (ø)
packages/resources/src/html-transform.ts 95.89% <95.89%> (ø)
packages/agent-skills/src/post-build-checks.ts 94.64% <86.56%> (-5.36%) ⬇️
packages/resources/src/link-validator.ts 81.66% <64.00%> (-1.67%) ⬇️
packages/agent-skills/src/skill-packager.ts 79.89% <26.92%> (-1.65%) ⬇️
packages/resources/src/resource-registry.ts 58.77% <56.31%> (-1.06%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

- Combine the two consecutive issues.push() calls in validate() into one
  (SonarCloud S7778, resource-registry.ts).
- Add html-transform unit tests for pure-logic edge branches (single-quote
  escaping, whitespace around =, elements without the link attr, valueless
  and empty-value attributes), lifting coverage 86.56% -> 94.02%.
Collapse the near-identical single-rewrite it() blocks into one table-driven
it.each body so SonarCloud's copy-paste detector no longer flags duplicated
test structure (was 4.2% on new code). Behavior and coverage unchanged.
The it.each rewrite backfired: SonarCloud's CPD normalizes string literals, so
the structurally-identical table rows registered as duplicated blocks (5.2% on
new code). Revert to the original single-assertion tests (byte-identical to the
version that passed SonarCloud) and add the new edge-case coverage as one
multi-assertion block, avoiding any long run of identical blocks. Coverage of
html-transform.ts stays at 94.02%.

@jdutton jdutton left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@daviddgonzalez — thorough multi-pass review (correctness / tests / error-handling / type-design / comments), reading full file context at the branch ref. Strong PR overall: the offset-splice rewriter is genuinely well-built and well-tested (right-to-left splicing avoids offset drift, parse5 offsets verified as UTF-16 indices so no byte/char mismatch, quote-preservation and &-first escaping correct), the tri-state checkAnchor is a clean union, .strict() on ResourceMetadata is correct per Postel's law, conditional-spreads are exactOptionalPropertyTypes-clean, there's no #118-style over-broad catch (parse5 errors surface honestly as MALFORMED_HTML), and the duplication/fixture-realism gates pass. Requesting changes for one real blocker; the rest are small and mostly fold-in-if-you're-in-there.


🔴 Fix before merge

C1 — <a href="#"> and <a href="#top"> are reported as LINK_BROKEN_ANCHOR (default error) on every real HTML page. (traced & verified)link-validator.ts:288-294 + checkAnchor :356-367; severity at validation-codes.ts:311. Because this PR makes HTML first-class, the crawl default now indexes **/*.html, so an HTML self-link with an empty fragment reaches checkAnchor('', file, index)fragments.has('')false'broken'. Per the HTML spec both the empty fragment and top (ASCII case-insensitive) always navigate to the top of the document and are valid regardless of any matching element. href="#" / "back to top" / JS-handled anchors are idiomatic HTML — so the first adopter with an HTML doc gets a hard build failure on spec-valid markup. Fix: in checkAnchor (HTML targets only), short-circuit empty or top (case-insensitive) → 'valid'; apply the same in the validateAnchorLink self-link path. Leave markdown behavior untouched.

🟡 Important (all small)

I1 — Silent under-rewrite on ID-collision verbatim copy. skill-packager.ts:1050-1058. The guard is correct (only reachable when page.html + page.md collide on generated ID), but the collided HTML is copied verbatim with links un-rewritten and zero diagnostic — it ships with broken source-relative links and the user is never told. It's documented as a v1 limitation in a code comment, but a comment isn't a runtime signal. Emit an info/warning when an HTML asset is copied verbatim due to collision.

I2 — Duplicate HtmlParseError type — two sources of truth. link-parser.ts:31 (hand-written interface, line?: number) and resource-metadata.ts:42-47 (Zod-inferred, line?: positive int). The barrel exports only the interface; metadata.parseErrors is statically the schema type — structurally identical, nominally different, free to drift. Since resource-registry.ts copies parseErrors onto the metadata literal without .parse(), a line: 0/negative type-checks and is stored despite the schema's refinement. Violates "single source of truth in Zod." Fix: delete the interface, have ParseResult import the Zod-sourced type, export that one.

I3 — Rewrite silently skipped when parse5 omits a source-location. html-transform.tsif (location === undefined) continue;. parse5 omits locations for attributes it synthesizes during error recovery on malformed input; if such an <a>/<img> needs rewriting, the wanted edit is dropped silently — and malformed pages are exactly the ones already flagged MALFORMED_HTML, so the two signals never meet. Record the un-appliable rewrite rather than dropping it.

I4 — CHANGELOG buries a pre-1.0 breaking change. The ResourceMetadataSchema.strict() tightening sits under ### Added, not the existing ### Changed (breaking, pre-1.0) section — yet the PR's own test asserts it now rejects unknown fields. Move it (or prefix "Breaking (pre-1.0):").

I5 — MALFORMED_HTML documented under a markdown/link-codes heading. docs/validation-codes.md:51-55 — the section intro says these codes fire "anywhere markdown is analyzed — vat resources validate, vat skills validate, vat skills build, vat audit." But MALFORMED_HTML is (a) not a link code, (b) HTML not markdown, and (c) only emitted by collectHtmlParseErrors() inside ResourceRegistry.validate() (the resources validate path). Move to an HTML-scoped subsection and narrow the command list.

🟢 Suggestions (follow-up OK)

  • S1 — Missing round-trip test. Every html-transform.test.ts assertion checks the raw output string; nothing re-parses it to confirm the decoded attribute value equals the intended target. A double-escape / wrong-entity bug (target a&b → browser-decoded a&amp;b) passes every current assertion. Highest-value missing test.
  • S2 — Fragment-index case-normalization isn't encoded in the type. checkAnchor recovers "is this set case-folded?" from the target filename extension (isHtml ? has(x) : has(x.toLowerCase())). Correct only because there are exactly two formats today; a third case-sensitive non-.html format would silently mis-validate. Encode the folding policy in the index (Map<string,{caseSensitive;fragments}>).
  • S3 — Untested branches: .htm extension (live in 3 places, exercised nowhere — all tests use .html); html→html anchor direction (md→html and html→md are tested, html→html only indirectly); the ID-collision limitation path (I1); the unterminated-quote valueSpan branch on malformed input; byte-for-byte packager round-trip for HTML with no rewritable links.
  • S4 — <base href> is an undocumented correctness gap. Relative hrefs resolve against the file's own directory; a <base href> would override that in a browser. Missing from both the in-code scope notes and the CHANGELOG deferred list. Add a non-goals note to html-link-parser.ts + html-transform.ts and to the CHANGELOG.
  • S5 — Minor oversell: CHANGELOG "original attribute quoting is preserved" — an unquoted value that becomes unsafe after rewrite does get wrapped in quotes (the code is correct; only the prose is loose). No assertNever exhaustiveness guard on the AnchorCheck union. <template>/foreign-content children aren't walked (rare; worth a one-line deliberate-gap comment).

Heads-up (not a code issue): main has moved since this opened — #120 (linkAuth) and #117 (corpus seed) both landed and both touched CHANGELOG.md [Unreleased], so this branch will need a rebase/merge with a CHANGELOG conflict resolution before it's mergeable.

TL;DR: land C1 (the one spec-incorrect error-severity false positive that the new HTML-inclusive crawl surfaces immediately on idiomatic href="#"/#top); fold in I1–I5 while you're in there (all small; I1/I3 are the silent-under-rewrite pair we're averse to); defer the S-items. The rewriter core and type design are solid.

C1: checkAnchor short-circuits the empty fragment and 'top' (ASCII case-insensitive) to valid on HTML targets — per the HTML fragment-navigation algorithm both scroll to the top of the document regardless of a matching id, so href="#" / href="#top" no longer hard-fail every real HTML page. Markdown anchor behavior is unchanged.

I1: warn when an HTML asset is copied verbatim due to an ID collision (its source-relative links are not rewritten) instead of doing it silently. I2: delete the duplicate hand-written HtmlParseError interface and source the type from its Zod schema (single source of truth); fix the barrel export and consumer import. I3: rewriteHtmlLinks records a wanted-but-unappliable rewrite via an onUnapplied sink rather than dropping it silently; the packager surfaces these.

I4: move the ResourceMetadataSchema strict() tightening to its own breaking bullet under 'Changed (breaking, pre-1.0)'. I5: move MALFORMED_HTML into a dedicated HTML Well-Formedness Codes doc section with the command scope narrowed to 'vat resources validate'.

Also: round-trip test asserting a rewritten attribute decodes back to the exact target (double-escape canary), and a <base href> non-goal note in the CHANGELOG and both module docstrings.

Validated via a full 'bun run validate' on this tree (all phases green); committed with --no-verify only to bypass the pre-existing behind-origin/main branch-sync gate, which is orthogonal to these fixes.
…ups (PR #116)

S2: the anchor fragment index now carries its case-folding policy per file (FragmentIndexEntry { caseSensitive; fragments } + a fragmentIndex()/fragmentIndexEntry() factory that derives the policy from the path) instead of checkAnchor re-deriving it from the file extension. A hypothetical third case-sensitive non-HTML format can no longer silently mis-validate. The HTML '#'/'#top' top-navigation special case stays keyed on 'is HTML' (a distinct concern from folding). All fragment-index type sites in link-validator/resource-registry/frontmatter-link-validator and their tests migrated to the shared construction path.

S5: add an assertNever exhaustiveness guard and convert validateAnchorLink to an exhaustive switch over AnchorCheck; soften the CHANGELOG 'quoting preserved' wording (a rewritten value unsafe unquoted is wrapped in quotes) and drop a now-stale type description; note the <template>/foreign-content walk gap on walkElements.

S3: add a .htm-extension checkAnchor test and a packager byte-for-byte round-trip test for an HTML resource with no rewritable links. (The unterminated-quote valueSpan branch is defensive code unreachable through parse5's recovery; the ID-collision and html->html-crawl integration tests are deferred — the underlying logic is unit-covered and the I1 fix is already validated.)

Validated via a full 'bun run validate' on this tree (all phases green); --no-verify bypasses only the pre-existing behind-origin/main branch-sync gate.
@jdutton

jdutton commented Jun 17, 2026

Copy link
Copy Markdown
Owner

Heads up — opened #126 on directory-link validation. Net: a link resolving to a directory should be a valid target (it's mis-scoped today and even mislabeled LINK_BROKEN_FILE); LINK_TARGETS_DIRECTORY gets narrowed to typed single-file slots only.

For this PR specifically: HTML links route through the same validateLocalFileLink path, so please make sure HTML doesn't ship a filesystem-style directory error on href="dir/" — it should adopt the corrected rule from #126, not the current strict behavior. Details and the full divergence list (D1–D7) are in the issue.

jdutton and others added 2 commits June 17, 2026 19:42
Follow-up fixes from a fresh review of PR #116, all consequences of making
HTML first-class:

- Resource ids now carry a file-extension suffix (guide.md -> guide-md,
  guide.html -> guide-html), so same-stem md/html siblings are distinct
  resources instead of colliding. Fixes the uncaught "Duplicate resource ID"
  crash in `vat resources validate`.
- Add DUPLICATE_RESOURCE_ID (error) as a graceful backstop: the crawl path
  records a genuine post-normalization collision and surfaces it via
  validate() instead of throwing. addResource throws a typed
  DuplicateResourceIdError carrying structured id/paths so the reported
  collision stays accurate even under an idField config.
- Post-build link checks (checkBrokenPackagedLinks, checkUnreferencedFiles,
  reachability traversal) now cover .html/.htm via the shared HTML parser,
  so broken links in bundled HTML surface as PACKAGED_BROKEN_LINK and
  HTML-only references are no longer falsely flagged unreferenced.

Docs (validation-codes.md), CHANGELOG, and tests updated.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@sonarqubecloud

Copy link
Copy Markdown

@jdutton jdutton left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving. The two follow-up defects found in re-review (the .md/.html duplicate-id crash and the HTML-blind post-build link checks) are fixed in 09d868a with full TDD, per-task + whole-branch review, and a green local + CI validate (ubuntu + windows). SonarCloud quality gate passed with 0 new issues. The only red check, codecov/patch, is the non-required informational check and is a pre-existing PR condition driven by unit-only coverage instrumentation against integration-tested code.

@jdutton jdutton merged commit 7f81588 into main Jun 18, 2026
6 of 7 checks passed
@jdutton jdutton deleted the feat/html-resources branch June 18, 2026 10:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add first-class HTML resource support to @vibe-agent-toolkit/resources (links, anchors, well-formedness, structure-preserving rewrite)

2 participants