Skip to content

Modernize HTML entity decoding to the WHATWG named character references#444

Merged
xroche merged 2 commits into
masterfrom
entities-html5-fnv1a
Jun 28, 2026
Merged

Modernize HTML entity decoding to the WHATWG named character references#444
xroche merged 2 commits into
masterfrom
entities-html5-fnv1a

Conversation

@xroche

@xroche xroche commented Jun 28, 2026

Copy link
Copy Markdown
Owner

The entity table was generated from the 1998 HTML 4.0 spec, so the decoder recognized only 252 named entities and left every HTML5 name untouched. This regenerates htsentities.h from the WHATWG entities.json (2032 single-codepoint names) and reworks the lookup.

The dispatch hash moves from a 32-bit LCG to 64-bit FNV-1a. The old code relied on the 32-bit hash being collision-free "statistically"; the generator now proves it, aborting if any two names share a (hash, len) key, so the hash-only switch stays correct with no runtime name compare. The consumer name-length cap grows from 10 to 31 (the longest name, CounterClockwiseContourIntegral), otherwise long names would be rejected outright. Multi-codepoint references (~93 obscure math entities like fj) can't fit the single-codepoint return and are skipped, left verbatim as before. Also fixes the dead ftp://ftp.unicode.org URLs in htsbasiccharsets.sh.

The 01_engine-entities self-test gains HTML5 names, the long-name boundary, an astral codepoint, and a skipped multi-codepoint case.

Closes #443

xroche and others added 2 commits June 28, 2026 15:07
Regenerate htsentities.h from the WHATWG entities.json (2032 single-codepoint
names) instead of the 1998 HTML 4.0 set (252 names). The dispatch hash moves
from a 32-bit LCG to 64-bit FNV-1a; the generator now aborts on any (hash,len)
collision, so the hash-only switch stays correct without a runtime name compare.
Bump the consumer name-length cap from 10 to 31, the longest name
(CounterClockwiseContourIntegral), or long names would be rejected outright.
Multi-codepoint references (~93 obscure math entities) can't fit the
single-codepoint return and are skipped, left verbatim as before.

Also fix the dead ftp://ftp.unicode.org URLs in htsbasiccharsets.sh.

Closes #443

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
Review follow-up. The switch keys on the hash alone, so check hash-alone
uniqueness among emitted names (a same-hash/different-len pair would otherwise
slip the old (hash,len) check and surface only as a cryptic duplicate-case
compile error). Also hash the ~93 skipped multi-codepoint names and abort if any
aliases an emitted hash, so "skipped means verbatim" is enforced rather than
assumed on future regens.

Add a runtime sweep of common HTML4 names (copy/reg/trade/mdash/ndash/alpha/beta)
to 01_engine-entities.test: a regression guard against accidental drops and a
generator-vs-consumer hash cross-check on names beyond the handful already
probed.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Roche <roche@httrack.com>
@xroche xroche merged commit cca83e5 into master Jun 28, 2026
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

HTML entity decoder is stuck on the HTML 4.0 set (252 names)

1 participant