Forbidden numeric character references appear in sanitized HTML

The HTML living standard ( https://html.spec.whatwg.org/multipage/syntax.html#character-references ) states:

> The numeric character reference forms described above are allowed to reference any code point excluding U+000D CR, noncharacters, and controls other than ASCII whitespace.

However, non-characters from the supplemental planes are encoded numerically.

The code:

```
    StringBuilder builder = new StringBuilder();
    Encoding.encodeRcdataOnto(Character.toString(0x5fffe), builder);
    System.out.println(builder.toString());
```

Produces: "&amp;#x5fffe;"

I see two possible simple possible solutions, but I am loathe to recommend either one:

First the characters could be elided in line with the elision of U+FFFE and U+FFFF. This produces a strange botch that is not required by the rules of HTML nor XML, and I don't like it.

Alternatively, the character is allowed if it is not numerically escaped and the noncharacters U+FDD0 to U+FDEF are presented unescaped - so consistency with other non-characters would produce legal HTML. However, all other supplemental code points are represented by numeric escapes to avoid corruption when converting between unicode encodings. I am not happy with introducing special cases for these supplemental code points.

More complex solutions would be to introduce a policy for handling of the "discouraged" characters defined in https://www.w3.org/TR/2008/REC-xml-20081126/#charsets. 

I am happy to put the time into creating a fix and test cases, but I need guidance as to what is the "correct" solution.



 



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Forbidden numeric character references appear in sanitized HTML #223

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Forbidden numeric character references appear in sanitized HTML #223

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions