“Invalid Unicode code points” in numeric character references #614

wooorm · 2019-10-01T18:36:22Z

Decimal numeric character references references “Invalid Unicode code points”, but nowhere is it defined what those are.

Hexadecimal numeric character references do not mention this limitation, but I guess imply it (with “They too are parsed as the corresponding Unicode character”).

The HTML spec defines several limitations on numerical character references: https://html.spec.whatwg.org/multipage/parsing.html#numeric-character-reference-end-state, so I’m guessing some or all of that applies to CM as well.

However, HTML defines that some “invalid” references map to other characters (the table at the bottom of the linked section).

Why mention code points instead of characters? Is it just surrogates?

jgm · 2019-10-02T04:48:50Z

See https://stackoverflow.com/questions/27331819/whats-the-difference-between-a-character-a-code-point-a-glyph-and-a-grapheme

wooorm · 2019-10-02T06:20:13Z

Thanks! That’s a good read but a) doesn’t answer the “invalid code points” part, and b) the CM spec already defines “A character is a Unicode code point [...] all code points count as characters for purposes of this spec”, so I’m not sure why not to use the word “character” in references.

nwellnhof · 2024-04-05T14:09:42Z

The invalid code points are

U+0000, this is mentioned explicitly
Surrogates
Code points larger than 0x10FFFF

wooorm mentioned this issue Apr 5, 2024

Numeric character references: Should HTML spec be followed for codes mapping to control characters #765

Open

dpk mentioned this issue Sep 29, 2024

Code points, scalar values, and validity #778

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

“Invalid Unicode code points” in numeric character references #614

“Invalid Unicode code points” in numeric character references #614

wooorm commented Oct 1, 2019

jgm commented Oct 2, 2019

wooorm commented Oct 2, 2019

nwellnhof commented Apr 5, 2024

“Invalid Unicode code points” in numeric character references #614

“Invalid Unicode code points” in numeric character references #614

Comments

wooorm commented Oct 1, 2019

jgm commented Oct 2, 2019

wooorm commented Oct 2, 2019

nwellnhof commented Apr 5, 2024