You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I try to generate all of characters which particular encoding supports to generate a test files for a quick-xml. I found, that using encoding_rs crate, some codepoints, declared in https://github.com/whatwg/encoding/blob/main/indexes.json for Big5 encoding actually represented as HTML references (&#...;). Digging into that I realized, that such output is generated when character is unmappable by the encoding.
So the question is: what the rationale to include in index characters that is unmappable by the encoding? I cannot find the answer on the https://encoding.spec.whatwg.org/. It has description of how to deal with that strange index, but does not explain why this index is so strange.
The text was updated successfully, but these errors were encountered:
The Big5 encoder and decoder are asymmetric (like the EUC-JP encoder and decoder). The visualizations visualize what can be decoded. The spec excludes part of the decoding space from round-tripping via the encoder in order for HTML form submission not to generate extension-range bytes that some server-side recipients may not support.
For EUC-JP, the asymmetry is based on historical experience. For Big5, it is by prudent analogy of the problem initially seen with EUC-JP. Also, for Big5, the exclusion for Big5 is questionable and possibly by accident excluding less than what was intended: The encoder only excludes the extension part below the original Big5 range but doesn't exclude the other extension part above the original Big5 range.
I try to generate all of characters which particular encoding supports to generate a test files for a quick-xml. I found, that using encoding_rs crate, some codepoints, declared in https://github.com/whatwg/encoding/blob/main/indexes.json for Big5 encoding actually represented as HTML references (
&#...;
). Digging into that I realized, that such output is generated when character is unmappable by the encoding.So the question is: what the rationale to include in index characters that is unmappable by the encoding? I cannot find the answer on the https://encoding.spec.whatwg.org/. It has description of how to deal with that strange index, but does not explain why this index is so strange.
The text was updated successfully, but these errors were encountered: