Skip to content

Forbidden numeric character references appear in sanitized HTML #223

@simon-greatrix

Description

@simon-greatrix

The HTML living standard ( https://html.spec.whatwg.org/multipage/syntax.html#character-references ) states:

The numeric character reference forms described above are allowed to reference any code point excluding U+000D CR, noncharacters, and controls other than ASCII whitespace.

However, non-characters from the supplemental planes are encoded numerically.

The code:

    StringBuilder builder = new StringBuilder();
    Encoding.encodeRcdataOnto(Character.toString(0x5fffe), builder);
    System.out.println(builder.toString());

Produces: ""

I see two possible simple possible solutions, but I am loathe to recommend either one:

First the characters could be elided in line with the elision of U+FFFE and U+FFFF. This produces a strange botch that is not required by the rules of HTML nor XML, and I don't like it.

Alternatively, the character is allowed if it is not numerically escaped and the noncharacters U+FDD0 to U+FDEF are presented unescaped - so consistency with other non-characters would produce legal HTML. However, all other supplemental code points are represented by numeric escapes to avoid corruption when converting between unicode encodings. I am not happy with introducing special cases for these supplemental code points.

More complex solutions would be to introduce a policy for handling of the "discouraged" characters defined in https://www.w3.org/TR/2008/REC-xml-20081126/#charsets.

I am happy to put the time into creating a fix and test cases, but I need guidance as to what is the "correct" solution.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions