Contributions welcome. This issue contains a complete, approved design spec for first-class HTML
resource support in @vibe-agent-toolkit/resources. The full spec is inline below so you can drop it
straight into Claude Code (or work it by hand) and implement against it. All decisions are locked; §12 has
a file touch-list and §10 has the test plan. The spec has been independently reviewed twice against the
live codebase (file:line citations verified). Please follow the repo's CLAUDE.md workflow
(bun run validate, zero-duplication policy, test pyramid) and open a PR.
Design Spec: First-Class HTML Resources in @vibe-agent-toolkit/resources
Status: Approved design — ready for implementation. Contributions welcome.
Date: 2026-06-01
Package: packages/resources (with touchpoints in packages/agent-skills and packages/cli)
Orientation for implementers — read this first. The live validation/packaging pipeline is built on
the ResourceMetadata model (packages/resources/src/schemas/resource-metadata.ts), produced by
parseMarkdown() (link-parser.ts:52) and assembled in ResourceRegistry.addResource()
(resource-registry.ts:300). That is where HTML plugs in.
There is a second, unrelated type system in packages/resources/src/types/resources.ts — the
ResourceType enum, the Resource/MarkdownResource discriminated union, and
detectResourceType/parse*Resource. Do not build this feature there. Those symbols are consumed
only by packages/resource-compiler (markdown→TypeScript compilation) and by the types.ts barrel — the
registry, link validator, and skill packager never touch them. Adding an HtmlResource to that union
would compile but would not make HTML files validate or bundle. (Teaching resource-compiler about HTML
is a separate, out-of-scope enhancement.)
⚠️ Updated for the post-#114 validation framework (2026-06-02). PR #114 (validation-code
consolidation) landed after this spec was first written. It (a) replaced the resources package's old
free-form ValidationIssue.type (z.string()) model with the unified, registry-backed ValidationIssue
from @vibe-agent-toolkit/agent-schema, (b) promoted every resources code into the canonical
CODE_REGISTRY (LINK_BROKEN_FILE, LINK_BROKEN_ANCHOR, LINK_UNKNOWN, FRONTMATTER_*,
EXTERNAL_URL_DEAD/TIMEOUT/ERROR), and (c) made vat resources validate derive its exit code purely from
the framework's severity-based hasErrors. §6 and §12 below are rewritten for this model; HTML now
adds exactly one new registry code (MALFORMED_HTML) and writes no exit-code plumbing. All :line
citations elsewhere in this spec predate #114 (which rewrote link-validator.ts, resource-registry.ts,
validate.ts, and validation-result.ts) — treat them as approximate and re-locate symbols by name.
1. Summary
Today the resources package is markdown-first. .html/.htm files are not resources: they are not
discovered, parsed, validated, link-checked, or rewritten on bundle. This spec makes local HTML files
first-class resources — files that produce a ResourceMetadata just like markdown does — so they
participate fully in:
- Discovery (collection
include globs / path-arg scan)
- Link validation (links inside HTML, validated through the existing link validator)
- Anchor/fragment integrity (cross-format link graph:
md → html → md)
- Well-formedness reporting (parse errors)
- Link rewriting on bundle (
vat skills build / linkFollowDepth), byte-for-byte structure-preserving
HTML metadata/schema validation is explicitly deferred, but the design reserves a pluggable seam so it
can land later without rework.
Non-goals (v1)
- Remote HTML (fetching
http(s) HTML pages as resources). Documented as a future extension; the link
graph still checks remote URLs that HTML links point at, via the existing --check-external-urls
mechanism — that is unchanged.
- HTML metadata/schema validation (
<meta>, JSON-LD, OpenGraph, microdata). Deferred; seam reserved.
- Non-
<a>/<img> URL-bearing elements (<link>, <script>, <iframe>, <source>, media). Deferred.
- HTML → text extraction for RAG. Out of scope.
- HTML in
resource-compiler (the Resource union / ResourceType enum subsystem). Out of scope.
- On-demand parsing of out-of-scope link targets. Anchor validation only runs against files that were
discovered and parsed into the index (see §5).
- HTML reformatting / prettifying. We never re-serialize HTML.
2. Scope Decisions (locked)
| Decision |
Choice |
Rationale |
| Resource source |
Local files only in v1; remote documented as future |
Keeps v1 tractable; the primary use case is local HTML docs in a KB |
| Resource model |
ResourceMetadata (the live pipeline model), via a parseHtml branch in addResource |
Not the resource-compiler Resource union (see Orientation) |
| Metadata validation |
Deferred, seam reserved |
HTML has many metadata surfaces; premature to pick one. Must not preclude pluggable extractors |
| v1 validation |
Links + anchors/fragments + well-formedness |
Well-formedness falls out by necessity (we must parse to extract) |
| HTML parser |
parse5 |
WHATWG-spec, pure-JS (no native deps), exposes per-attribute source locations + onParseError |
| Link elements extracted |
<a href> + <img src> |
Navigational graph + broken-image detection; matches a docs/KB use case |
| Link rewriting mechanism |
Offset splicing, never AST serialize |
parse5 serialize is lossy; offset splice preserves bytes |
3. Architecture
HTML files become a second input format that produces the same ResourceMetadata shape markdown does.
The integration point is a single extension branch in ResourceRegistry.addResource.
3.1 The model: ResourceMetadata (unchanged shape, one new optional field)
ResourceMetadata (schemas/resource-metadata.ts) is what the registry indexes and what the validator and
packager consume. Its relevant fields:
// existing (schemas/resource-metadata.ts)
ResourceMetadata = {
id: string;
filePath: string;
links: ResourceLink[]; // rich objects: { href, type, line, text, nodeType, ... }
headings: HeadingNode[]; // markdown headings (slug-based)
frontmatter?: ...;
sizeBytes: number;
estimatedTokenCount: number;
modifiedAt: Date;
checksum: ...;
collections?: string[];
// ...
}
Two additive changes (both optional, so markdown is unaffected):
/** Fragment identifiers an HTML file exposes: element `id`s + <a name> values. Undefined for markdown. */
anchors?: string[];
/** Well-formedness errors from the HTML parser. Undefined for markdown. */
parseErrors?: HtmlParseError[];
There is no type discriminator on ResourceMetadata and we do not add one. Format-dependent
behavior (e.g. anchor case-sensitivity in §5) keys off the file extension, consistent with how the
rest of the pipeline already works.
A reserved future seam — metadata?: Record<string, unknown> for pluggable HTML metadata extractors — is
described in §8 but not added in v1.
3.2 html-parser.ts (new module, parallels parseMarkdown in link-parser.ts)
parseHtml returns the same field shape addResource already destructures from parseMarkdown, so
addResource can build a ResourceMetadata uniformly regardless of format. Uses parse5 with
sourceCodeLocationInfo: true and an onParseError hook.
export interface HtmlParseError {
code: string; // parse5 error code, e.g. 'missing-end-tag'
line: number;
col: number;
}
export interface ParsedHtml {
content: string; // raw source, unmodified
links: ResourceLink[]; // SAME type as parseMarkdown — { href, type, line, ... }
headings: HeadingNode[]; // empty for HTML (no markdown headings); kept for shape parity
anchors: string[]; // element ids + <a name> → ResourceMetadata.anchors
frontmatter: undefined; // HTML has no frontmatter (v1)
sizeBytes: number;
estimatedTokenCount: number; // content.length / 4, matching markdown
parseErrors: HtmlParseError[];
}
export async function parseHtml(absolutePath: string): Promise<ParsedHtml>;
Extraction rules:
- Links →
ResourceLink[]: walk the parse5 AST; collect <a href> and <img src>. For each, build a
ResourceLink whose type is assigned by the existing classifyLink(href) (link-parser.ts:190)
and whose line comes from parse5's location info — so an HTML <a href="../foo.md"> produces the same
kind of ResourceLink a markdown link does (local_file/anchor/external/email/unknown) and flows
through validateLink and the external-URL collector with no changes to either.
- Anchors →
string[]: every element with an id attribute, plus <a name="...">. Raw strings, case
preserved (see §5).
- Parse errors: captured from parse5's
onParseError, filtered to a meaningful set (see §6).
Note: ParsedHtml.links carries no byte offsets. The rewriter (§7) does its own parse5 pass over
the source to recover attribute locations — exactly as the markdown rewriter re-scans the body rather than
threading offsets through ResourceLink. This keeps ResourceLink/ResourceMetadata unchanged.
3.3 Integration point: extension branch in addResource
ResourceRegistry.addResource (resource-registry.ts:300) currently calls parseMarkdown(absolutePath)
unconditionally. Change it to dispatch by extension:
// `path` is already imported in resource-registry.ts (:12); `path.extname` is the
// established pattern here (:1333) — it is NOT one of the safePath-enforced fns.
const ext = path.extname(absolutePath).toLowerCase();
const parseResult = (ext === '.html' || ext === '.htm')
? await parseHtml(absolutePath)
: await parseMarkdown(absolutePath);
The rest of addResource is shape-compatible: parseResult.links, .headings, .frontmatter,
.sizeBytes, .estimatedTokenCount already feed the ResourceMetadata literal (resource-registry.ts:323-339).
Add anchors and parseErrors to that literal when present. ID generation falls back to the path/filename
stem when frontmatter is undefined (existing behavior), which is correct for HTML.
4. Data Flow
crawl (include globs) / addResource(path)
→ extension branch: .html/.htm → parseHtml(path) (else parseMarkdown)
→ ResourceMetadata { links: ResourceLink[], anchors, parseErrors, ... }
→ indexResource() → fragment index gets this file's anchors (§5)
→ validate(): each ResourceLink → existing validateLink:
local_file → existence + git-ignore safety + anchor (via generalized index)
anchor → fragment check against THIS file's anchors
external → existing --check-external-urls pipeline (unchanged)
email → ok
unknown → LINK_UNKNOWN warning
→ parseErrors → MALFORMED_HTML issues (§6)
classifyLink and validateLink require no changes — HTML links are ResourceLinks from a different
parser.
5. The Cross-Format Core: Generalized Fragment Index
Today validateAnchor (link-validator.ts:344) consults headingsByFile: Map<string, HeadingNode[]>
(defined by buildHeadingsByFileMap at resource-registry.ts:1245, called at :672, and threaded into
validateLink at :458) and matches a heading slug case-insensitively (link-validator.ts:355, :377-381).
To make
md → html → md work, generalize the fragment set a file exposes:
fragmentsByFile: Map<string /* absolute file path */, Set<string> /* valid fragment ids */>
- Markdown contributes heading slugs (existing behavior via
github-slugger).
- HTML contributes
ResourceMetadata.anchors (element ids + <a name>).
validateAnchor(fragment, targetPath, index) becomes a set-membership check, choosing case semantics
by the target file's extension:
.md target → case-insensitive match (preserve current behavior; slugs are already lowercased).
.html/.htm target → case-sensitive match (HTML fragment id matching is case-sensitive per the
HTML standard in no-quirks mode).
Out-of-scope targets. If a link's target file exists but was not discovered/parsed into the index
(e.g. an HTML file outside any collection, or a markdown file the scan didn't include), anchor validation is
skipped — it does not emit a false LINK_BROKEN_ANCHOR. (Today validateAnchor returns false
for a target absent from the map, which yields a false LINK_BROKEN_ANCHOR. Fixing this to skip also
smooths that latent markdown sharp-edge. #114 rewrote link-validator.ts, so re-locate the exact branch by
name rather than line.) File-existence and git-ignore safety still apply.
Implementation latitude: the plan may change the value type of the existing map to Set<string>, or
derive a parallel fragmentsByFile from both markdown headings and HTML anchors. The contract is one
format-neutral fragment-set index consulted by validateAnchor. ResourceMetadata.headings stays as-is
(consumed elsewhere).
6. Error Handling & Issue Types
Post-#114 model. The resources package no longer has a free-form ValidationIssue.type (z.string()).
Every validator now emits the unified ValidationIssue from @vibe-agent-toolkit/agent-schema, carrying a
code from the canonical CODE_REGISTRY and a resolved severity. Adding a finding = adding one registry
code, not documenting a string.
New code: MALFORMED_HTML — add it to CODE_REGISTRY in
packages/agent-schema/src/validation-codes.ts (alongside the already-present LINK_BROKEN_*,
FRONTMATTER_*, and EXTERNAL_URL_* entries).
- Emitted during
validate() from each resource's parseErrors (curated set of meaningful parse5 codes —
e.g. unexpected/missing tags, duplicate attributes — not every HTML5 recovery quirk). Carries line,
the parse5 code in the message, and the file path.
- Default severity:
info (advisory; warning is also acceptable). Stays user-overridable like every
other registry code.
Exit-code wiring — already fixed by #114, no work here. The old brittle filter
(issue.type !== 'external_url') is gone. vat resources validate now derives its exit code purely from
the framework's severity-based hasErrors (process.exit(hasErrors ? 1 : 0) in validate.ts): exit 1 iff
some emitted issue resolves to error. So MALFORMED_HTML at info/warning is automatically non-fatal —
there is no predicate to write and no bug to fix. The external-URL codes are likewise already registry
codes at warning default. The HTML implementer only needs to (a) add the MALFORMED_HTML registry entry
and (b) emit it from parseErrors during validation.
Reuses unchanged (already registry codes post-#114): LINK_BROKEN_FILE, LINK_BROKEN_ANCHOR, LINK_UNKNOWN,
and the EXTERNAL_URL_* family.
7. Link Rewriting & Round-Trip Fidelity (in-scope)
HTML resources must support the same deterministic, structure-preserving link rewriting markdown has, so
HTML files can be bundled into skills (vat skills build, linkFollowDepth) with their relative link
targets remapped.
7.1 The fidelity trap
The markdown rewriter (content-transform.ts, transformContent at :398, replace loop at :430) uses
String.replaceAll(regex, callback) and returns the original match for any link without a rewrite rule —
everything except the changed target is byte-identical. Frontmatter uses FrontmatterEditor, whose contract
is openFrontmatter(x).toString() === x byte-for-byte (packages/resources/src/frontmatter-editor.ts:7).
The obvious HTML approach — parse5 → mutate AST → serialize() — violates this contract. parse5's
serializer normalizes whitespace, attribute quoting, void elements, comments, and the doctype. A
parse→serialize round-trip is lossy. We must never re-serialize the document.
7.2 The mechanism: offset splicing
New module html-transform.ts (analog of content-transform.ts):
export function rewriteHtmlLinks(
source: string,
rules: LinkRewriteRules,
ctx: TransformContext,
): string;
It re-parses source with parse5 (sourceCodeLocationInfo: true) to obtain, for each <a href> / <img src>,
the attribute's source location.
Load-bearing detail — value sub-range. parse5's element.sourceCodeLocation.attrs[name] gives
{ startOffset, endOffset, startLine, ... } for the entire attribute (href="value" / src='value' /
href=value), not the value alone. The rewriter must compute the value sub-range itself:
- Slice the attribute span from
source.
- Locate
= after the attribute name, then the first non-whitespace char: if it's " or ', the value is
between that quote and its match; otherwise the value is unquoted and runs to the attribute endOffset.
- Record
(valueStart, valueEnd) (JS string indices, not bytes) and the quote char.
Algorithm:
- For each link, compute the new target via the same rule/template machinery markdown uses
(linkRewriteRules, resourceRegistry, relative-path computation). Links with no applicable rule are
left untouched.
- Collect
(valueStart, valueEnd, newValue) edits.
- Apply edits to the original source string in descending
valueStart order (so earlier edits
don't shift later offsets). Replace only the value characters, preserving the original quote char.
HTML-escape the written value minimally (&, and the active quote char); computed relative paths
normally need no escaping.
- The document is never re-serialized — comments, whitespace, indentation, and every untouched
attribute remain byte-identical.
7.3 Round-trip identity contract (tested)
rewriteHtmlLinks(src, /* no matching rules */, ctx) === src, byte-for-byte.
- Rewriting a single link changes only that target; surrounding comments, whitespace, and other
attributes are unchanged (mirror packager-frontmatter-rewrite.integration.test.ts).
7.4 Packager integration
The real gate is the early-return guard at skill-packager.ts:1031, not the markdown rewrite block
below it:
// skill-packager.ts:1030-1034 — today this binary-copies EVERY non-.md file,
// so .html never reaches the rewrite path (the openFrontmatter/transformContent
// block at :1053-1056 is markdown-only).
if (!sourcePath.endsWith('.md') || !ctx.rewriteLinks) {
await copyFile(sourcePath, targetPath);
return;
}
Make this guard format-aware so HTML reaches a rewrite path instead of being binary-copied:
.html/.htm (when ctx.rewriteLinks): read the file, look up the resource in ctx.fromRegistry
(as the markdown path does at :1041), then writeFile(targetPath, rewriteHtmlLinks(content, rules, ctx)).
Do not call openFrontmatter — HTML has no frontmatter and must not go through the frontmatter split.
.md: existing openFrontmatter + transformContent path (:1053-1056), unchanged.
- everything else /
rewriteLinks disabled: binary copy (unchanged).
This makes HTML resources participate in linkFollowDepth bundling exactly like markdown.
7.5 Known limitations (documented)
- Duplicate or malformed attributes use parse5's reported location; pathological hand-authored HTML is
best-effort.
- Unquoted attribute values: write the new value with double quotes (deterministic), or preserve unquoted
when the new value needs no quoting — the plan picks one rule and documents it.
8. Future-Proofing Seam: Pluggable Metadata (documented, not built)
A future ResourceMetadata.metadata?: Record<string, unknown> field (do not add in v1) would be
populated by a pluggable extractor registry:
interface HtmlMetadataExtractor {
name: string; // e.g. 'meta-tags', 'json-ld', 'opengraph'
extract(parsed: ParsedHtml, raw: string): Record<string, unknown>;
}
A future config surface would select/compose extractors per collection, populate metadata, and run the
existing collection frontmatterSchema validation against that object. v1 ships zero extractors.
v1 behavior to document explicitly: a collection's frontmatterSchema does not apply to HTML files
in v1. HTML files in such a collection are link/anchor/well-formedness-checked but not schema-validated.
State this in user docs to avoid surprise.
9. Discovery & CLI
- Discovery stays glob-driven. A user opts HTML in via a collection
include such as
["docs/**/*.{md,html}"], or via the path-arg recursive scan. The default crawl include is
['**/*.md'] (resource-registry.ts:391) — leave the default markdown-only; HTML is opt-in by glob.
- The real dispatch fix is in
addResource (resource-registry.ts:300), which today calls
parseMarkdown for every file (§3.3). Once it branches by extension, any .html/.htm that the crawl
yields is parsed correctly. Audit crawl/indexResource for any other hardcoded .md assumption and
ensure unsupported extensions are skipped silently (not errored).
- No new CLI flags.
vat resources validate picks up HTML automatically; --check-external-urls
(cli/.../resources/index.ts:74) already covers HTML external links because the external-URL collector
iterates resource.links filtering link.type === 'external' across all resources
(resource-registry.ts:786-787) — not markdown-gated.
- No config schema change required for v1.
10. Testing
Follow the repo test pyramid (unit > integration > system) and the duplication policy. All fixtures are
generic synthetic documents — no proprietary or organization-specific content.
Unit (packages/resources/test/):
parseHtml: link extraction (<a href>, <img src>) into ResourceLink[] with correct type/line,
anchor extraction (id, <a name>), parse-error capture.
- Generalized fragment matching: case-insensitive for
.md targets, case-sensitive for .html targets;
skip (not fail) for out-of-scope targets.
addResource extension branch: .html produces a ResourceMetadata with anchors populated and
frontmatter undefined.
rewriteHtmlLinks: value sub-range computation across quoted/single-quoted/unquoted attributes;
round-trip identity (no-op === byte-identical); single-link rewrite preserves surroundings.
- Severity wiring:
MALFORMED_HTML resolves to a non-error severity (so it never flips the exit code),
while LINK_BROKEN_FILE/LINK_BROKEN_ANCHOR resolve to error; assert via the framework's hasErrors.
Integration (packages/resources/test/integration/):
- A synthetic fixture dir exercising
md → html → md with: a valid cross-format link, a broken local-file
link, a broken fragment (case-mismatched HTML id), and a malformed HTML file. Assert the exact issue set
and that exit code reflects only the real errors.
Integration (packages/agent-skills/test/integration/):
- Bundle a skill whose resource graph includes an HTML file via
linkFollowDepth; assert HTML links are
rewritten to bundled-relative targets and comments/whitespace are preserved (mirror
packager-frontmatter-rewrite).
11. Dependency
Add parse5 to packages/resources/package.json. Pure-JS, no native deps, MIT. It is the candidate
parser exposing per-attribute source locations (required by §7) plus onParseError (required by §6). Verify
the exact sourceCodeLocation.attrs[name] field shape against the installed version during implementation.
12. File Touch List (for the plan)
| File |
Change |
packages/resources/src/html-parser.ts |
NEW — parseHtml (parse5): ResourceLink[] links, anchors, parseErrors, shape-parity with parseMarkdown |
packages/resources/src/html-transform.ts |
NEW — rewriteHtmlLinks (re-parse for offsets, value sub-range computation, offset splice, never serialize) |
packages/resources/src/resource-registry.ts |
addResource (~:300) extension branch → parseHtml/parseMarkdown; add anchors/parseErrors to the ResourceMetadata literal; generalize the fragment index (buildHeadingsByFileMap, ~:672) to include HTML anchors |
packages/resources/src/schemas/resource-metadata.ts |
Add optional anchors?: string[] and parseErrors?: HtmlParseError[] to ResourceMetadata; export HtmlParseError type |
packages/resources/src/link-validator.ts |
validateAnchor (:344) → set-membership w/ per-format case rules; skip (don't fail) out-of-scope targets |
packages/agent-schema/src/validation-codes.ts |
Add MALFORMED_HTML to CODE_REGISTRY (default severity info). Post-#114 this is the home for all resources/link codes; the old free-form type-string model no longer exists. |
packages/resources/src/schemas/validation-result.ts |
No code-list change — it now re-exports ValidationIssue/ValidationIssueSchema from @vibe-agent-toolkit/agent-schema. |
packages/cli/src/commands/resources/validate.ts |
Emit MALFORMED_HTML from each resource's parseErrors. No exit-code change — #114 made the exit code purely severity-based (hasErrors); the external_url filter is gone and info/warning codes are already non-fatal. |
packages/agent-skills/src/skill-packager.ts |
Make the .md-only early-return guard (:1031) format-aware: .html/.htm → read + lookup + rewriteHtmlLinks (no openFrontmatter); .md unchanged (:1053-1056); else binary copy |
packages/resources/package.json |
Add parse5 dependency |
| docs |
Document HTML support + "frontmatterSchema does not apply to HTML in v1" caveat |
docs/validation-codes.md |
Add a #malformed_html heading section. A test (packages/agent-schema/test/docs/validation-codes.test.ts) asserts every CODE_REGISTRY entry has a matching doc anchor — a new registry code without its doc section fails CI. |
13. Open Questions
None blocking. Resolved during design:
- Resource model:
ResourceMetadata via addResource branch — not the resource-compiler Resource union (locked).
- Parser: parse5 (locked).
- Link elements:
<a href> + <img src> (locked).
- Rewriting: offset splice with computed value sub-range, never serialize (locked).
- Metadata: deferred with reserved seam (locked).
Design Spec: First-Class HTML Resources in
@vibe-agent-toolkit/resourcesStatus: Approved design — ready for implementation. Contributions welcome.
Date: 2026-06-01
Package:
packages/resources(with touchpoints inpackages/agent-skillsandpackages/cli)1. Summary
Today the resources package is markdown-first.
.html/.htmfiles are not resources: they are notdiscovered, parsed, validated, link-checked, or rewritten on bundle. This spec makes local HTML files
first-class resources — files that produce a
ResourceMetadatajust like markdown does — so theyparticipate fully in:
includeglobs / path-arg scan)md → html → md)vat skills build/linkFollowDepth), byte-for-byte structure-preservingHTML metadata/schema validation is explicitly deferred, but the design reserves a pluggable seam so it
can land later without rework.
Non-goals (v1)
http(s)HTML pages as resources). Documented as a future extension; the linkgraph still checks remote URLs that HTML links point at, via the existing
--check-external-urlsmechanism — that is unchanged.
<meta>, JSON-LD, OpenGraph, microdata). Deferred; seam reserved.<a>/<img>URL-bearing elements (<link>,<script>,<iframe>,<source>, media). Deferred.resource-compiler(theResourceunion /ResourceTypeenum subsystem). Out of scope.discovered and parsed into the index (see §5).
2. Scope Decisions (locked)
ResourceMetadata(the live pipeline model), via aparseHtmlbranch inaddResourceresource-compilerResourceunion (see Orientation)parse5onParseError<a href>+<img src>3. Architecture
HTML files become a second input format that produces the same
ResourceMetadatashape markdown does.The integration point is a single extension branch in
ResourceRegistry.addResource.3.1 The model:
ResourceMetadata(unchanged shape, one new optional field)ResourceMetadata(schemas/resource-metadata.ts) is what the registry indexes and what the validator andpackager consume. Its relevant fields:
Two additive changes (both optional, so markdown is unaffected):
3.2
html-parser.ts(new module, parallelsparseMarkdowninlink-parser.ts)parseHtmlreturns the same field shapeaddResourcealready destructures fromparseMarkdown, soaddResourcecan build aResourceMetadatauniformly regardless of format. Uses parse5 withsourceCodeLocationInfo: trueand anonParseErrorhook.Extraction rules:
ResourceLink[]: walk the parse5 AST; collect<a href>and<img src>. For each, build aResourceLinkwhosetypeis assigned by the existingclassifyLink(href)(link-parser.ts:190)and whose
linecomes from parse5's location info — so an HTML<a href="../foo.md">produces the samekind of
ResourceLinka markdown link does (local_file/anchor/external/email/unknown) and flowsthrough
validateLinkand the external-URL collector with no changes to either.string[]: every element with anidattribute, plus<a name="...">. Raw strings, casepreserved (see §5).
onParseError, filtered to a meaningful set (see §6).3.3 Integration point: extension branch in
addResourceResourceRegistry.addResource(resource-registry.ts:300) currently callsparseMarkdown(absolutePath)unconditionally. Change it to dispatch by extension:
The rest of
addResourceis shape-compatible:parseResult.links,.headings,.frontmatter,.sizeBytes,.estimatedTokenCountalready feed theResourceMetadataliteral (resource-registry.ts:323-339).Add
anchorsandparseErrorsto that literal when present. ID generation falls back to the path/filenamestem when
frontmatteris undefined (existing behavior), which is correct for HTML.4. Data Flow
classifyLinkandvalidateLinkrequire no changes — HTML links areResourceLinks from a differentparser.
5. The Cross-Format Core: Generalized Fragment Index
Today
validateAnchor(link-validator.ts:344) consultsheadingsByFile: Map<string, HeadingNode[]>(defined by
buildHeadingsByFileMapatresource-registry.ts:1245, called at:672, and threaded intovalidateLinkat:458) and matches a heading slug case-insensitively (link-validator.ts:355, :377-381).To make
md → html → mdwork, generalize the fragment set a file exposes:github-slugger).ResourceMetadata.anchors(elementids +<a name>).validateAnchor(fragment, targetPath, index)becomes a set-membership check, choosing case semanticsby the target file's extension:
.mdtarget → case-insensitive match (preserve current behavior; slugs are already lowercased)..html/.htmtarget → case-sensitive match (HTML fragmentidmatching is case-sensitive per theHTML standard in no-quirks mode).
Out-of-scope targets. If a link's target file exists but was not discovered/parsed into the index
(e.g. an HTML file outside any collection, or a markdown file the scan didn't include), anchor validation is
skipped — it does not emit a false
LINK_BROKEN_ANCHOR. (TodayvalidateAnchorreturnsfalsefor a target absent from the map, which yields a false
LINK_BROKEN_ANCHOR. Fixing this to skip alsosmooths that latent markdown sharp-edge. #114 rewrote
link-validator.ts, so re-locate the exact branch byname rather than line.) File-existence and git-ignore safety still apply.
6. Error Handling & Issue Types
New code:
MALFORMED_HTML— add it toCODE_REGISTRYinpackages/agent-schema/src/validation-codes.ts(alongside the already-presentLINK_BROKEN_*,FRONTMATTER_*, andEXTERNAL_URL_*entries).validate()from each resource'sparseErrors(curated set of meaningful parse5 codes —e.g. unexpected/missing tags, duplicate attributes — not every HTML5 recovery quirk). Carries
line,the parse5
codein the message, and the file path.info(advisory;warningis also acceptable). Stays user-overridable like everyother registry code.
Exit-code wiring — already fixed by #114, no work here. The old brittle filter
(
issue.type !== 'external_url') is gone.vat resources validatenow derives its exit code purely fromthe framework's severity-based
hasErrors(process.exit(hasErrors ? 1 : 0)invalidate.ts): exit 1 iffsome emitted issue resolves to
error. SoMALFORMED_HTMLatinfo/warningis automatically non-fatal —there is no predicate to write and no bug to fix. The external-URL codes are likewise already registry
codes at
warningdefault. The HTML implementer only needs to (a) add theMALFORMED_HTMLregistry entryand (b) emit it from
parseErrorsduring validation.Reuses unchanged (already registry codes post-#114):
LINK_BROKEN_FILE,LINK_BROKEN_ANCHOR,LINK_UNKNOWN,and the
EXTERNAL_URL_*family.7. Link Rewriting & Round-Trip Fidelity (in-scope)
HTML resources must support the same deterministic, structure-preserving link rewriting markdown has, so
HTML files can be bundled into skills (
vat skills build,linkFollowDepth) with their relative linktargets remapped.
7.1 The fidelity trap
The markdown rewriter (
content-transform.ts,transformContentat:398, replace loop at:430) usesString.replaceAll(regex, callback)and returns the original match for any link without a rewrite rule —everything except the changed target is byte-identical. Frontmatter uses
FrontmatterEditor, whose contractis
openFrontmatter(x).toString() === xbyte-for-byte (packages/resources/src/frontmatter-editor.ts:7).The obvious HTML approach — parse5 → mutate AST →
serialize()— violates this contract. parse5'sserializer normalizes whitespace, attribute quoting, void elements, comments, and the doctype. A
parse→serialize round-trip is lossy. We must never re-serialize the document.
7.2 The mechanism: offset splicing
New module
html-transform.ts(analog ofcontent-transform.ts):It re-parses
sourcewith parse5 (sourceCodeLocationInfo: true) to obtain, for each<a href>/<img src>,the attribute's source location.
Load-bearing detail — value sub-range. parse5's
element.sourceCodeLocation.attrs[name]gives{ startOffset, endOffset, startLine, ... }for the entire attribute (href="value"/src='value'/href=value), not the value alone. The rewriter must compute the value sub-range itself:source.=after the attribute name, then the first non-whitespace char: if it's"or', the value isbetween that quote and its match; otherwise the value is unquoted and runs to the attribute
endOffset.(valueStart, valueEnd)(JS string indices, not bytes) and the quote char.Algorithm:
(
linkRewriteRules,resourceRegistry, relative-path computation). Links with no applicable rule areleft untouched.
(valueStart, valueEnd, newValue)edits.valueStartorder (so earlier editsdon't shift later offsets). Replace only the value characters, preserving the original quote char.
HTML-escape the written value minimally (
&, and the active quote char); computed relative pathsnormally need no escaping.
attribute remain byte-identical.
7.3 Round-trip identity contract (tested)
rewriteHtmlLinks(src, /* no matching rules */, ctx)===src, byte-for-byte.attributes are unchanged (mirror
packager-frontmatter-rewrite.integration.test.ts).7.4 Packager integration
The real gate is the early-return guard at
skill-packager.ts:1031, not the markdown rewrite blockbelow it:
Make this guard format-aware so HTML reaches a rewrite path instead of being binary-copied:
.html/.htm(whenctx.rewriteLinks): read the file, look up the resource inctx.fromRegistry(as the markdown path does at
:1041), thenwriteFile(targetPath, rewriteHtmlLinks(content, rules, ctx)).Do not call
openFrontmatter— HTML has no frontmatter and must not go through the frontmatter split..md: existingopenFrontmatter+transformContentpath (:1053-1056), unchanged.rewriteLinksdisabled: binary copy (unchanged).This makes HTML resources participate in
linkFollowDepthbundling exactly like markdown.7.5 Known limitations (documented)
best-effort.
when the new value needs no quoting — the plan picks one rule and documents it.
8. Future-Proofing Seam: Pluggable Metadata (documented, not built)
A future
ResourceMetadata.metadata?: Record<string, unknown>field (do not add in v1) would bepopulated by a pluggable extractor registry:
A future config surface would select/compose extractors per collection, populate
metadata, and run theexisting collection
frontmatterSchemavalidation against that object. v1 ships zero extractors.v1 behavior to document explicitly: a collection's
frontmatterSchemadoes not apply to HTML filesin v1. HTML files in such a collection are link/anchor/well-formedness-checked but not schema-validated.
State this in user docs to avoid surprise.
9. Discovery & CLI
includesuch as["docs/**/*.{md,html}"], or via the path-arg recursive scan. The default crawlincludeis['**/*.md'](resource-registry.ts:391) — leave the default markdown-only; HTML is opt-in by glob.addResource(resource-registry.ts:300), which today callsparseMarkdownfor every file (§3.3). Once it branches by extension, any.html/.htmthat the crawlyields is parsed correctly. Audit
crawl/indexResourcefor any other hardcoded.mdassumption andensure unsupported extensions are skipped silently (not errored).
vat resources validatepicks up HTML automatically;--check-external-urls(
cli/.../resources/index.ts:74) already covers HTML external links because the external-URL collectoriterates
resource.linksfilteringlink.type === 'external'across all resources(
resource-registry.ts:786-787) — not markdown-gated.10. Testing
Follow the repo test pyramid (unit > integration > system) and the duplication policy. All fixtures are
generic synthetic documents — no proprietary or organization-specific content.
Unit (
packages/resources/test/):parseHtml: link extraction (<a href>,<img src>) intoResourceLink[]with correcttype/line,anchor extraction (
id,<a name>), parse-error capture..mdtargets, case-sensitive for.htmltargets;skip (not fail) for out-of-scope targets.
addResourceextension branch:.htmlproduces aResourceMetadatawithanchorspopulated andfrontmatterundefined.rewriteHtmlLinks: value sub-range computation across quoted/single-quoted/unquoted attributes;round-trip identity (no-op === byte-identical); single-link rewrite preserves surroundings.
MALFORMED_HTMLresolves to a non-errorseverity (so it never flips the exit code),while
LINK_BROKEN_FILE/LINK_BROKEN_ANCHORresolve toerror; assert via the framework'shasErrors.Integration (
packages/resources/test/integration/):md → html → mdwith: a valid cross-format link, a broken local-filelink, a broken fragment (case-mismatched HTML id), and a malformed HTML file. Assert the exact issue set
and that exit code reflects only the real errors.
Integration (
packages/agent-skills/test/integration/):linkFollowDepth; assert HTML links arerewritten to bundled-relative targets and comments/whitespace are preserved (mirror
packager-frontmatter-rewrite).11. Dependency
Add
parse5topackages/resources/package.json. Pure-JS, no native deps, MIT. It is the candidateparser exposing per-attribute source locations (required by §7) plus
onParseError(required by §6). Verifythe exact
sourceCodeLocation.attrs[name]field shape against the installed version during implementation.12. File Touch List (for the plan)
packages/resources/src/html-parser.tsparseHtml(parse5):ResourceLink[]links,anchors,parseErrors, shape-parity withparseMarkdownpackages/resources/src/html-transform.tsrewriteHtmlLinks(re-parse for offsets, value sub-range computation, offset splice, never serialize)packages/resources/src/resource-registry.tsaddResource(~:300) extension branch →parseHtml/parseMarkdown; addanchors/parseErrorsto theResourceMetadataliteral; generalize the fragment index (buildHeadingsByFileMap, ~:672) to include HTML anchorspackages/resources/src/schemas/resource-metadata.tsanchors?: string[]andparseErrors?: HtmlParseError[]toResourceMetadata; exportHtmlParseErrortypepackages/resources/src/link-validator.tsvalidateAnchor(:344) → set-membership w/ per-format case rules; skip (don't fail) out-of-scope targetspackages/agent-schema/src/validation-codes.tsMALFORMED_HTMLtoCODE_REGISTRY(default severityinfo). Post-#114 this is the home for all resources/link codes; the old free-formtype-string model no longer exists.packages/resources/src/schemas/validation-result.tsValidationIssue/ValidationIssueSchemafrom@vibe-agent-toolkit/agent-schema.packages/cli/src/commands/resources/validate.tsMALFORMED_HTMLfrom each resource'sparseErrors. No exit-code change — #114 made the exit code purely severity-based (hasErrors); theexternal_urlfilter is gone andinfo/warningcodes are already non-fatal.packages/agent-skills/src/skill-packager.ts.md-only early-return guard (:1031) format-aware:.html/.htm→ read + lookup +rewriteHtmlLinks(noopenFrontmatter);.mdunchanged (:1053-1056); else binary copypackages/resources/package.jsonparse5dependencydocs/validation-codes.md#malformed_htmlheading section. A test (packages/agent-schema/test/docs/validation-codes.test.ts) asserts everyCODE_REGISTRYentry has a matching doc anchor — a new registry code without its doc section fails CI.13. Open Questions
None blocking. Resolved during design:
ResourceMetadataviaaddResourcebranch — not theresource-compilerResourceunion (locked).<a href>+<img src>(locked).