fix(parser): store lone surrogates as escape sequence#10041
Conversation
CodSpeed Instrumentation Performance ReportMerging #10041 will not alter performanceComparing Summary
|
45ce03d to
77ed9da
Compare
77ed9da to
dc171cd
Compare
|
cc @andreubotella for a quick review. |
IIUC, if That said, I worry that other users of oxc_ast might not be aware of this change, and might use |
76e6337 to
68f53e0
Compare
dc171cd to
025be46
Compare
1162843 to
0aa89a2
Compare
|
Something is wrong. This PR should make no difference to result of |
0aa89a2 to
d822f65
Compare
Nope. Snapshot was just out of date. I just need to add some tests, and this should be good to go. |
Merge activity
|
Closes #3526. Fully parse strings containing lone surrogates and encode the string in `value`. Encoding schema is to encode a lone surrogate as the lossy replacement character, followed by the code point in hex. i.e. `"\uD800"` in source code would be encoded as `\u{FFFD}d800` in `value`. The lossy replacement character itself is encoded as `\u{FFFD}fffd`. There's nothing special about the lossy replacement character. Just had to choose *some* valid Unicode character to be the escape marker, and that seemed like a reasonable choice of a character which is likely to be rare in real-world code. All the ESTree-Test262 test cases related to lone surrogates now pass. WIP. A bit of tidying up to do yet.
7e15588 to
a3c1dd9
Compare
d822f65 to
f0e1510
Compare
…ithout reference to `raw` (#10044) #10041 changed how lone surrogates are handled in `StringLiteral`s. `StringLiteral`s which include lone surrogates now have the `lone_surrogates` flag set, and `value` encodes lone surrogates as `\u{FFFD}XXXX`, where `XXXX` is the code unit encoded as hex. Codegen check the `lone_surrogates` flag and decode the lone surrogates if they're present. This means that: 1. A `StringLiteral` no longer needs to have `raw` field populated, so you can (if you choose to for some reason) create a new `StringLiteral` containing lone surrogates. 2. `StringLiteral`s containing lone surrogates now have any other characters escaped same as how `StringLiteral`s without lone surrogates are printed.
|
@overlookmotel Note to self: Add some tests. Especially for strings containing both lone surrogates and lossy replacement characters. |
|
Tests added in #10175. |
…e sequence (#10182) Encode lone surrogates in `cooked` property of `TemplateElementValue` using same encoding scheme as for `StringLiteral`s. In fact, they were already being encoded like this after #10041, but add a `lone_surrogates` flag to `TemplateLiteral` to decode them correctly in ESTree AST. `oxc_codegen` ignores `cooked` and just prints `raw`, so needs no alteration.

Closes #3526.
Fully parse strings containing lone surrogates and encode the string in
value.Encoding schema is to encode a lone surrogate as the lossy replacement character, followed by the code point in hex. i.e.
"\uD800"in source code would be encoded as\u{FFFD}d800invalue. The lossy replacement character itself is encoded as\u{FFFD}fffd.There's nothing special about the lossy replacement character. Just had to choose some valid Unicode character to be the escape marker, and that seemed like a reasonable choice of a character which is likely to be rare in real-world code.
All the ESTree-Test262 test cases related to lone surrogates now pass.
WIP. A bit of tidying up to do yet.