Modernize HTML entity decoding to the WHATWG named character references#444
Merged
Conversation
Regenerate htsentities.h from the WHATWG entities.json (2032 single-codepoint names) instead of the 1998 HTML 4.0 set (252 names). The dispatch hash moves from a 32-bit LCG to 64-bit FNV-1a; the generator now aborts on any (hash,len) collision, so the hash-only switch stays correct without a runtime name compare. Bump the consumer name-length cap from 10 to 31, the longest name (CounterClockwiseContourIntegral), or long names would be rejected outright. Multi-codepoint references (~93 obscure math entities) can't fit the single-codepoint return and are skipped, left verbatim as before. Also fix the dead ftp://ftp.unicode.org URLs in htsbasiccharsets.sh. Closes #443 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Xavier Roche <roche@httrack.com>
Review follow-up. The switch keys on the hash alone, so check hash-alone uniqueness among emitted names (a same-hash/different-len pair would otherwise slip the old (hash,len) check and surface only as a cryptic duplicate-case compile error). Also hash the ~93 skipped multi-codepoint names and abort if any aliases an emitted hash, so "skipped means verbatim" is enforced rather than assumed on future regens. Add a runtime sweep of common HTML4 names (copy/reg/trade/mdash/ndash/alpha/beta) to 01_engine-entities.test: a regression guard against accidental drops and a generator-vs-consumer hash cross-check on names beyond the handful already probed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Xavier Roche <roche@httrack.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The entity table was generated from the 1998 HTML 4.0 spec, so the decoder recognized only 252 named entities and left every HTML5 name untouched. This regenerates
htsentities.hfrom the WHATWGentities.json(2032 single-codepoint names) and reworks the lookup.The dispatch hash moves from a 32-bit LCG to 64-bit FNV-1a. The old code relied on the 32-bit hash being collision-free "statistically"; the generator now proves it, aborting if any two names share a
(hash, len)key, so the hash-only switch stays correct with no runtime name compare. The consumer name-length cap grows from 10 to 31 (the longest name,CounterClockwiseContourIntegral), otherwise long names would be rejected outright. Multi-codepoint references (~93 obscure math entities likefj) can't fit the single-codepoint return and are skipped, left verbatim as before. Also fixes the deadftp://ftp.unicode.orgURLs inhtsbasiccharsets.sh.The
01_engine-entitiesself-test gains HTML5 names, the long-name boundary, an astral codepoint, and a skipped multi-codepoint case.Closes #443