Code points, scalar values, and validity #778

dpk · 2024-09-29T12:18:41Z

A character is defined as a ‘Unicode code point’. This means (unpaired) surrogates are allowed in input and, by implication, in output. If this is not intended (which is what I glean from the answer to “Invalid Unicode code points” in numeric character references #614) the definition should be changed to ‘Unicode scalar value’. Changing ‘invalid Unicode code points’ to ‘invalid Unicode scalar values’ would also resolve “Invalid Unicode code points” in numeric character references #614.
It is not explicitly stated that every possible sequence of Unicode scalar values (or code points?) is a valid CommonMark input text for which some HTML output must be produced, although I also believe that this is the intention. If so, it should be made explicit that a processor which fails to parse any input document is non-conforming.

dbuenzli · 2024-09-29T13:23:54Z

Provide feedback