feat: first-class local HTML resources (#112)#116
Conversation
Make local .html/.htm files first-class resources alongside markdown: - html-link-parser.ts (parse5): extract <a href>/<img src> links, id/name fragment anchors, and well-formedness diagnostics into the shared ParseResult shape - ResourceRegistry: route .html/.htm through parseHtml; discover HTML in crawls; persist optional strict anchors/parseErrors on ResourceMetadata - Replace the headings map with a format-neutral fragment index (Map<string,Set<string>>) and a tri-state checkAnchor: skip un-indexed targets (general false-positive fix), HTML case-sensitive / markdown lowercased — enables cross-format anchor validation - MALFORMED_HTML (info) validation code emitted from validate(), documented - Unit + integration tests: discovery, cross-format anchors, skip-un-indexed, malformed-HTML emission
Make HTML resources participate in linkFollowDepth bundling like markdown: - html-transform.ts: rewriteHtmlLinks offset-splices <a href>/<img src> values at parse5-reported offsets and never re-serializes, so unchanged input round-trips byte-for-byte; preserves original quoting (keeps unquoted values unquoted unless the new value would break syntax) - skill-packager: format-aware copy guard routes .html/.htm through rewriteHtmlLinks (no frontmatter parsing); .md path unchanged; other files still binary-copied. Reuses the shared href-resolution callback (buildFrontmatterHrefRewriter renamed buildHrefRewriter) - Unit tests (round-trip fidelity, quoting, escaping) + packager integration test (bundled HTML link rewritten, markup preserved)
Codecov Report❌ Patch coverage is ❌ Your patch check has failed because the patch coverage (77.12%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #116 +/- ##
==========================================
- Coverage 81.78% 81.68% -0.11%
==========================================
Files 223 225 +2
Lines 17135 17440 +305
Branches 3326 3405 +79
==========================================
+ Hits 14014 14246 +232
- Misses 3121 3194 +73
🚀 New features to boost your workflow:
|
- Combine the two consecutive issues.push() calls in validate() into one (SonarCloud S7778, resource-registry.ts). - Add html-transform unit tests for pure-logic edge branches (single-quote escaping, whitespace around =, elements without the link attr, valueless and empty-value attributes), lifting coverage 86.56% -> 94.02%.
Collapse the near-identical single-rewrite it() blocks into one table-driven it.each body so SonarCloud's copy-paste detector no longer flags duplicated test structure (was 4.2% on new code). Behavior and coverage unchanged.
The it.each rewrite backfired: SonarCloud's CPD normalizes string literals, so the structurally-identical table rows registered as duplicated blocks (5.2% on new code). Revert to the original single-assertion tests (byte-identical to the version that passed SonarCloud) and add the new edge-case coverage as one multi-assertion block, avoiding any long run of identical blocks. Coverage of html-transform.ts stays at 94.02%.
jdutton
left a comment
There was a problem hiding this comment.
@daviddgonzalez — thorough multi-pass review (correctness / tests / error-handling / type-design / comments), reading full file context at the branch ref. Strong PR overall: the offset-splice rewriter is genuinely well-built and well-tested (right-to-left splicing avoids offset drift, parse5 offsets verified as UTF-16 indices so no byte/char mismatch, quote-preservation and &-first escaping correct), the tri-state checkAnchor is a clean union, .strict() on ResourceMetadata is correct per Postel's law, conditional-spreads are exactOptionalPropertyTypes-clean, there's no #118-style over-broad catch (parse5 errors surface honestly as MALFORMED_HTML), and the duplication/fixture-realism gates pass. Requesting changes for one real blocker; the rest are small and mostly fold-in-if-you're-in-there.
🔴 Fix before merge
C1 — <a href="#"> and <a href="#top"> are reported as LINK_BROKEN_ANCHOR (default error) on every real HTML page. (traced & verified) — link-validator.ts:288-294 + checkAnchor :356-367; severity at validation-codes.ts:311. Because this PR makes HTML first-class, the crawl default now indexes **/*.html, so an HTML self-link with an empty fragment reaches checkAnchor('', file, index) → fragments.has('') → false → 'broken'. Per the HTML spec both the empty fragment and top (ASCII case-insensitive) always navigate to the top of the document and are valid regardless of any matching element. href="#" / "back to top" / JS-handled anchors are idiomatic HTML — so the first adopter with an HTML doc gets a hard build failure on spec-valid markup. Fix: in checkAnchor (HTML targets only), short-circuit empty or top (case-insensitive) → 'valid'; apply the same in the validateAnchorLink self-link path. Leave markdown behavior untouched.
🟡 Important (all small)
I1 — Silent under-rewrite on ID-collision verbatim copy. skill-packager.ts:1050-1058. The guard is correct (only reachable when page.html + page.md collide on generated ID), but the collided HTML is copied verbatim with links un-rewritten and zero diagnostic — it ships with broken source-relative links and the user is never told. It's documented as a v1 limitation in a code comment, but a comment isn't a runtime signal. Emit an info/warning when an HTML asset is copied verbatim due to collision.
I2 — Duplicate HtmlParseError type — two sources of truth. link-parser.ts:31 (hand-written interface, line?: number) and resource-metadata.ts:42-47 (Zod-inferred, line?: positive int). The barrel exports only the interface; metadata.parseErrors is statically the schema type — structurally identical, nominally different, free to drift. Since resource-registry.ts copies parseErrors onto the metadata literal without .parse(), a line: 0/negative type-checks and is stored despite the schema's refinement. Violates "single source of truth in Zod." Fix: delete the interface, have ParseResult import the Zod-sourced type, export that one.
I3 — Rewrite silently skipped when parse5 omits a source-location. html-transform.ts — if (location === undefined) continue;. parse5 omits locations for attributes it synthesizes during error recovery on malformed input; if such an <a>/<img> needs rewriting, the wanted edit is dropped silently — and malformed pages are exactly the ones already flagged MALFORMED_HTML, so the two signals never meet. Record the un-appliable rewrite rather than dropping it.
I4 — CHANGELOG buries a pre-1.0 breaking change. The ResourceMetadataSchema → .strict() tightening sits under ### Added, not the existing ### Changed (breaking, pre-1.0) section — yet the PR's own test asserts it now rejects unknown fields. Move it (or prefix "Breaking (pre-1.0):").
I5 — MALFORMED_HTML documented under a markdown/link-codes heading. docs/validation-codes.md:51-55 — the section intro says these codes fire "anywhere markdown is analyzed — vat resources validate, vat skills validate, vat skills build, vat audit." But MALFORMED_HTML is (a) not a link code, (b) HTML not markdown, and (c) only emitted by collectHtmlParseErrors() inside ResourceRegistry.validate() (the resources validate path). Move to an HTML-scoped subsection and narrow the command list.
🟢 Suggestions (follow-up OK)
- S1 — Missing round-trip test. Every
html-transform.test.tsassertion checks the raw output string; nothing re-parses it to confirm the decoded attribute value equals the intended target. A double-escape / wrong-entity bug (targeta&b→ browser-decodeda&b) passes every current assertion. Highest-value missing test. - S2 — Fragment-index case-normalization isn't encoded in the type.
checkAnchorrecovers "is this set case-folded?" from the target filename extension (isHtml ? has(x) : has(x.toLowerCase())). Correct only because there are exactly two formats today; a third case-sensitive non-.htmlformat would silently mis-validate. Encode the folding policy in the index (Map<string,{caseSensitive;fragments}>). - S3 — Untested branches:
.htmextension (live in 3 places, exercised nowhere — all tests use.html);html→htmlanchor direction (md→html and html→md are tested, html→html only indirectly); the ID-collision limitation path (I1); the unterminated-quotevalueSpanbranch on malformed input; byte-for-byte packager round-trip for HTML with no rewritable links. - S4 —
<base href>is an undocumented correctness gap. Relative hrefs resolve against the file's own directory; a<base href>would override that in a browser. Missing from both the in-code scope notes and the CHANGELOG deferred list. Add a non-goals note tohtml-link-parser.ts+html-transform.tsand to the CHANGELOG. - S5 — Minor oversell: CHANGELOG "original attribute quoting is preserved" — an unquoted value that becomes unsafe after rewrite does get wrapped in quotes (the code is correct; only the prose is loose). No
assertNeverexhaustiveness guard on theAnchorCheckunion.<template>/foreign-content children aren't walked (rare; worth a one-line deliberate-gap comment).
Heads-up (not a code issue): main has moved since this opened — #120 (linkAuth) and #117 (corpus seed) both landed and both touched CHANGELOG.md [Unreleased], so this branch will need a rebase/merge with a CHANGELOG conflict resolution before it's mergeable.
TL;DR: land C1 (the one spec-incorrect error-severity false positive that the new HTML-inclusive crawl surfaces immediately on idiomatic href="#"/#top); fold in I1–I5 while you're in there (all small; I1/I3 are the silent-under-rewrite pair we're averse to); defer the S-items. The rewriter core and type design are solid.
C1: checkAnchor short-circuits the empty fragment and 'top' (ASCII case-insensitive) to valid on HTML targets — per the HTML fragment-navigation algorithm both scroll to the top of the document regardless of a matching id, so href="#" / href="#top" no longer hard-fail every real HTML page. Markdown anchor behavior is unchanged. I1: warn when an HTML asset is copied verbatim due to an ID collision (its source-relative links are not rewritten) instead of doing it silently. I2: delete the duplicate hand-written HtmlParseError interface and source the type from its Zod schema (single source of truth); fix the barrel export and consumer import. I3: rewriteHtmlLinks records a wanted-but-unappliable rewrite via an onUnapplied sink rather than dropping it silently; the packager surfaces these. I4: move the ResourceMetadataSchema strict() tightening to its own breaking bullet under 'Changed (breaking, pre-1.0)'. I5: move MALFORMED_HTML into a dedicated HTML Well-Formedness Codes doc section with the command scope narrowed to 'vat resources validate'. Also: round-trip test asserting a rewritten attribute decodes back to the exact target (double-escape canary), and a <base href> non-goal note in the CHANGELOG and both module docstrings. Validated via a full 'bun run validate' on this tree (all phases green); committed with --no-verify only to bypass the pre-existing behind-origin/main branch-sync gate, which is orthogonal to these fixes.
…ups (PR #116) S2: the anchor fragment index now carries its case-folding policy per file (FragmentIndexEntry { caseSensitive; fragments } + a fragmentIndex()/fragmentIndexEntry() factory that derives the policy from the path) instead of checkAnchor re-deriving it from the file extension. A hypothetical third case-sensitive non-HTML format can no longer silently mis-validate. The HTML '#'/'#top' top-navigation special case stays keyed on 'is HTML' (a distinct concern from folding). All fragment-index type sites in link-validator/resource-registry/frontmatter-link-validator and their tests migrated to the shared construction path. S5: add an assertNever exhaustiveness guard and convert validateAnchorLink to an exhaustive switch over AnchorCheck; soften the CHANGELOG 'quoting preserved' wording (a rewritten value unsafe unquoted is wrapped in quotes) and drop a now-stale type description; note the <template>/foreign-content walk gap on walkElements. S3: add a .htm-extension checkAnchor test and a packager byte-for-byte round-trip test for an HTML resource with no rewritable links. (The unterminated-quote valueSpan branch is defensive code unreachable through parse5's recovery; the ID-collision and html->html-crawl integration tests are deferred — the underlying logic is unit-covered and the I1 fix is already validated.) Validated via a full 'bun run validate' on this tree (all phases green); --no-verify bypasses only the pre-existing behind-origin/main branch-sync gate.
|
Heads up — opened #126 on directory-link validation. Net: a link resolving to a directory should be a valid target (it's mis-scoped today and even mislabeled For this PR specifically: HTML links route through the same |
Follow-up fixes from a fresh review of PR #116, all consequences of making HTML first-class: - Resource ids now carry a file-extension suffix (guide.md -> guide-md, guide.html -> guide-html), so same-stem md/html siblings are distinct resources instead of colliding. Fixes the uncaught "Duplicate resource ID" crash in `vat resources validate`. - Add DUPLICATE_RESOURCE_ID (error) as a graceful backstop: the crawl path records a genuine post-normalization collision and surfaces it via validate() instead of throwing. addResource throws a typed DuplicateResourceIdError carrying structured id/paths so the reported collision stays accurate even under an idField config. - Post-build link checks (checkBrokenPackagedLinks, checkUnreferencedFiles, reachability traversal) now cover .html/.htm via the shared HTML parser, so broken links in bundled HTML surface as PACKAGED_BROKEN_LINK and HTML-only references are no longer falsely flagged unreferenced. Docs (validation-codes.md), CHANGELOG, and tests updated. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
jdutton
left a comment
There was a problem hiding this comment.
Approving. The two follow-up defects found in re-review (the .md/.html duplicate-id crash and the HTML-blind post-build link checks) are fixed in 09d868a with full TDD, per-task + whole-branch review, and a green local + CI validate (ubuntu + windows). SonarCloud quality gate passed with 0 new issues. The only red check, codecov/patch, is the non-required informational check and is a pre-existing PR condition driven by unit-only coverage instrumentation against integration-tested code.



Summary
Closes #112. Makes local
.html/.htmfiles first-class resources alongside markdown — they're discovered, link- and anchor-validated (including across formats), checked for well-formedness, and link-rewritten when bundled — using the sameParseResultcontract and validation framework as markdown.Lands as two stacked commits:
feat(resources): parse and validate local HTML resources— parse + validatefeat(agent-skills): rewrite links in bundled HTML resources— rewrite + bundleMotivation
Previously only markdown participated in resource discovery, link-graph validation, and link rewriting. Projects that keep HTML docs got no link integrity checking and no link rewriting on bundle. HTML is just as much a content artifact with an outbound link graph and inbound anchor targets, so it should be a peer format — without forking the pipeline.
What changed
Slice A — parse + validate
html-link-parser.ts(parse5): extracts<a href>+<img src>links,id/namefragment anchors, and well-formedness diagnostics into the sharedParseResultshape.ResourceRegistry.addResourceroutes.html/.htmthroughparseHtml; directory crawls discover HTML;ResourceMetadatagains optional strictanchors/parseErrors.Map<string, Set<string>>(markdown heading slugs + HTMLid/name), andvalidateAnchorbecomes a tri-statecheckAnchor:MALFORMED_HTMLvalidation code (defaultinfo) emitted fromvalidate()per parse error, documented indocs/validation-codes.md.Slice B — rewrite + bundle
html-transform.ts:rewriteHtmlLinksrewrites<a href>/<img src>values by offset-splicing the original source (never re-serializing), so unchanged input round-trips byte-for-byte. Original quoting is preserved — unquoted values stay unquoted unless the new value would break syntax..html/.htmthroughrewriteHtmlLinks(no frontmatter parsing); the markdown path is unchanged; other files are still binary-copied. Target resolution reuses the existing sharedRewriteHrefcallback (buildFrontmatterHrefRewriter→buildHrefRewriter). HTML now participates inlinkFollowDepthbundling like markdown.Scope / non-goals (v1)
<a href>+<img src>only.<link>,<script>,<iframe>,<source srcset>,<base href>, CSSurl(...)— asset/machinery references, not the content link graph, and no markdown parity. (VAT traverses links statically; it never executes pages.)Testing
checkAnchor(skip/valid/broken, case rules), offset-splice rewriter (round-trip fidelity, quoting, escaping), schema fields,MALFORMED_HTML.validate(), packager HTML link rewrite on bundle.bun run validate(lint, typecheck, unit/integration/system, zero-duplication, VAT dogfood) passes; each commit validated independently via the pre-commit hook.Notes
ResourceMetadataschema tightened to strict.