fix(parser): store lone surrogates as escape sequence by overlookmotel · Pull Request #10041 · oxc-project/oxc

overlookmotel · 2025-03-25T18:19:07Z

Closes #3526.

Fully parse strings containing lone surrogates and encode the string in value.

Encoding schema is to encode a lone surrogate as the lossy replacement character, followed by the code point in hex. i.e. "\uD800" in source code would be encoded as \u{FFFD}d800 in value. The lossy replacement character itself is encoded as \u{FFFD}fffd.

There's nothing special about the lossy replacement character. Just had to choose some valid Unicode character to be the escape marker, and that seemed like a reasonable choice of a character which is likely to be rare in real-world code.

All the ESTree-Test262 test cases related to lone surrogates now pass.

WIP. A bit of tidying up to do yet.

overlookmotel · 2025-03-25T18:19:20Z

How to use the Graphite Merge Queue

Add either label to this PR to merge it via the merge queue:

0-merge - adds this PR to the back of the merge queue
hotfix - for urgent hot fixes, skip the queue and merge this PR next

You must have a Graphite account in order to use the merge queue. Sign up using this link.

_{An organization admin has enabled the Graphite Merge Queue in this repository.} _{Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.}

This stack of pull requests is managed by Graphite. Learn more about stacking.

codspeed-hq · 2025-03-25T18:29:28Z

CodSpeed Instrumentation Performance Report

Merging #10041 will not alter performance

_{Comparing 03-25-fix_parser_store_lone_surrogates_as_escape_sequence (f0e1510) with 03-28-test_minifier_update_cargo_minsize_snapshot (fe8625d)}

Summary

✅ 33 untouched benchmarks

Boshen · 2025-03-26T00:56:59Z

cc @andreubotella for a quick review.

andreubotella · 2025-03-26T01:16:37Z

cc @andreubotella for a quick review.

IIUC, if lone_surrogates is false, then any replacement characters in the value are true replacement characters; and if true, then the value string would need decoding, right? I think we can work with that in Nova. (cc @aapoalas)

That said, I worry that other users of oxc_ast might not be aware of this change, and might use value without checking for lone_surrogate. You could say that no AsRef<str> would accurately represent the JS string, but I'd say that decoding any lone surrogate lossily to the replacement character (as String::from_utf16_lossy does) would be a lot more useful for most use cases. If the extra memory use is not an issue, I'd suggest that value should be this lossy decode, and lone_surrogate_string would be an Option<Atom<'a>> that is only Some if there are lone surrogates.

overlookmotel · 2025-03-28T12:11:33Z

Something is wrong. This PR should make no difference to result of cargo minsize, because codegen is only using the value of raw when lone_surrogates flag is set (prior to #10044). I'll investigate.

overlookmotel · 2025-03-28T13:00:09Z

Something is wrong. This PR should make no difference to result of cargo minsize, because codegen is only using the value of raw when lone_surrogates flag is set (prior to #10044). I'll investigate.

Nope. Snapshot was just out of date.

I just need to add some tests, and this should be good to go.

graphite-app · 2025-03-29T12:29:20Z

Merge activity

Mar 29, 8:29 AM EDT: Boshen added this pull request to the Graphite merge queue.
Mar 29, 9:04 AM EDT: A user merged this pull request with the Graphite merge queue.

Closes #3526. Fully parse strings containing lone surrogates and encode the string in `value`. Encoding schema is to encode a lone surrogate as the lossy replacement character, followed by the code point in hex. i.e. `"\uD800"` in source code would be encoded as `\u{FFFD}d800` in `value`. The lossy replacement character itself is encoded as `\u{FFFD}fffd`. There's nothing special about the lossy replacement character. Just had to choose *some* valid Unicode character to be the escape marker, and that seemed like a reasonable choice of a character which is likely to be rare in real-world code. All the ESTree-Test262 test cases related to lone surrogates now pass. WIP. A bit of tidying up to do yet.

…ithout reference to `raw` (#10044) #10041 changed how lone surrogates are handled in `StringLiteral`s. `StringLiteral`s which include lone surrogates now have the `lone_surrogates` flag set, and `value` encodes lone surrogates as `\u{FFFD}XXXX`, where `XXXX` is the code unit encoded as hex. Codegen check the `lone_surrogates` flag and decode the lone surrogates if they're present. This means that: 1. A `StringLiteral` no longer needs to have `raw` field populated, so you can (if you choose to for some reason) create a new `StringLiteral` containing lone surrogates. 2. `StringLiteral`s containing lone surrogates now have any other characters escaped same as how `StringLiteral`s without lone surrogates are printed.

overlookmotel · 2025-03-31T08:11:28Z

@overlookmotel Note to self: Add some tests. Especially for strings containing both lone surrogates and lossy replacement characters.

overlookmotel · 2025-04-02T06:53:13Z

Tests added in #10175.

…10175) Add tests for correct parsing of `StringLiteral`s containing lone surrogates and lossy replacement characters, after #10041. The tests are in `oxc_codegen` but primarily these tests test the logic in parser.

…e sequence (#10182) Encode lone surrogates in `cooked` property of `TemplateElementValue` using same encoding scheme as for `StringLiteral`s. In fact, they were already being encoded like this after #10041, but add a `lone_surrogates` flag to `TemplateLiteral` to decode them correctly in ESTree AST. `oxc_codegen` ignores `cooked` and just prints `raw`, so needs no alteration.

overlookmotel mentioned this pull request Mar 25, 2025

fix(ast/estree): fix StringLiterals containing lone surrogates #10036

Merged

This was referenced Mar 25, 2025

refactor(lexer): simplify macros for string parsing + correct comment #10039

Merged

refactor(lexer): clarify and reformat comments #10040

Merged

github-actions bot added A-parser Area - Parser A-ast Area - AST A-codegen Area - Code Generation C-bug Category - Bug labels Mar 25, 2025

overlookmotel force-pushed the 03-25-fix_parser_store_lone_surrogates_as_escape_sequence branch 2 times, most recently from 45ce03d to 77ed9da Compare March 25, 2025 18:39

overlookmotel changed the base branch from 03-25-refactor_lexer_clarify_and_reformat_comments to graphite-base/10041 March 25, 2025 18:48

overlookmotel force-pushed the 03-25-fix_parser_store_lone_surrogates_as_escape_sequence branch from 77ed9da to dc171cd Compare March 25, 2025 18:48

overlookmotel changed the base branch from graphite-base/10041 to 03-26-refactor_lexer_remove_unnecessary_line March 25, 2025 18:48

overlookmotel changed the base branch from 03-26-refactor_lexer_remove_unnecessary_line to graphite-base/10041 March 27, 2025 07:26

overlookmotel changed the base branch from graphite-base/10041 to 03-26-refactor_lexer_remove_unnecessary_line March 27, 2025 07:26

overlookmotel changed the base branch from 03-26-refactor_lexer_remove_unnecessary_line to graphite-base/10041 March 27, 2025 07:26

overlookmotel mentioned this pull request Mar 27, 2025

refactor(lexer): shorten code for parsing hex digit #10072

Merged

overlookmotel force-pushed the graphite-base/10041 branch from 76e6337 to 68f53e0 Compare March 27, 2025 07:28

overlookmotel force-pushed the 03-25-fix_parser_store_lone_surrogates_as_escape_sequence branch from dc171cd to 025be46 Compare March 27, 2025 07:28

overlookmotel changed the base branch from graphite-base/10041 to 03-27-perf_lexer_faster_decoding_unicode_escape_sequences March 27, 2025 07:28

overlookmotel mentioned this pull request Mar 27, 2025

perf(lexer): faster decoding unicode escape sequences #10073

Merged

overlookmotel marked this pull request as ready for review March 27, 2025 07:30

overlookmotel marked this pull request as draft March 27, 2025 07:32

This was referenced Mar 28, 2025

fix(codegen): prevent arithmetic overflow calculating quote for StringLiterals #10102

Merged

fix(codegen): do not escape $ in strings unless using backtick as quote #10103

Merged

overlookmotel force-pushed the 03-25-fix_parser_store_lone_surrogates_as_escape_sequence branch from 1162843 to 0aa89a2 Compare March 28, 2025 12:07

overlookmotel changed the base branch from main to graphite-base/10041 March 28, 2025 12:58

overlookmotel changed the base branch from graphite-base/10041 to main March 28, 2025 12:58

overlookmotel changed the base branch from main to graphite-base/10041 March 28, 2025 12:59

overlookmotel force-pushed the 03-25-fix_parser_store_lone_surrogates_as_escape_sequence branch from 0aa89a2 to d822f65 Compare March 28, 2025 12:59

overlookmotel changed the base branch from graphite-base/10041 to 03-28-test_minifier_update_cargo_minsize_snapshot March 28, 2025 12:59

This was referenced Mar 28, 2025

test(minifier): update cargo minsize snapshot #10105

Merged

ci(codegen): add benchmark for minified printing #10109

Closed

Boshen marked this pull request as ready for review March 29, 2025 12:24

graphite-app bot force-pushed the 03-28-test_minifier_update_cargo_minsize_snapshot branch from 7e15588 to a3c1dd9 Compare March 29, 2025 12:48

graphite-app bot force-pushed the 03-25-fix_parser_store_lone_surrogates_as_escape_sequence branch from d822f65 to f0e1510 Compare March 29, 2025 12:48

Base automatically changed from 03-28-test_minifier_update_cargo_minsize_snapshot to main March 29, 2025 13:01

graphite-app bot removed the 0-merge Merge with Graphite Merge Queue label Mar 29, 2025

graphite-app bot merged commit f0e1510 into main Mar 29, 2025
27 checks passed

graphite-app bot deleted the 03-25-fix_parser_store_lone_surrogates_as_escape_sequence branch March 29, 2025 13:04

oxc-bot mentioned this pull request Apr 1, 2025

release(crates): v0.62.0 #10158

Merged

overlookmotel mentioned this pull request Apr 2, 2025

test(parser): tests for lone surrogates and lossy escape characters #10175

Merged

overlookmotel mentioned this pull request Apr 2, 2025

fix(parser): store lone surrogates in TemplateElementValue as escape sequence #10182

Merged

Boshen mentioned this pull request Aug 6, 2025

Non-Unicode Escape sequence incorrectly generated in string literals swc-project/swc#10978

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(parser): store lone surrogates as escape sequence#10041

fix(parser): store lone surrogates as escape sequence#10041
graphite-app[bot] merged 1 commit intomainfrom
03-25-fix_parser_store_lone_surrogates_as_escape_sequence

overlookmotel commented Mar 25, 2025 •

edited

Loading

Uh oh!

overlookmotel commented Mar 25, 2025 •

edited

Loading

Uh oh!

codspeed-hq bot commented Mar 25, 2025 •

edited

Loading

Uh oh!

Boshen commented Mar 26, 2025

Uh oh!

andreubotella commented Mar 26, 2025 •

edited

Loading

Uh oh!

overlookmotel commented Mar 28, 2025

Uh oh!

overlookmotel commented Mar 28, 2025 •

edited

Loading

Uh oh!

graphite-app bot commented Mar 29, 2025 •

edited

Loading

Uh oh!

Uh oh!

overlookmotel commented Mar 31, 2025

Uh oh!

overlookmotel commented Apr 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

overlookmotel commented Mar 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

overlookmotel commented Mar 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How to use the Graphite Merge Queue

Uh oh!

codspeed-hq bot commented Mar 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CodSpeed Instrumentation Performance Report

Merging #10041 will not alter performance

Summary

Uh oh!

Boshen commented Mar 26, 2025

Uh oh!

andreubotella commented Mar 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

overlookmotel commented Mar 28, 2025

Uh oh!

overlookmotel commented Mar 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

graphite-app bot commented Mar 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merge activity

Uh oh!

Uh oh!

overlookmotel commented Mar 31, 2025

Uh oh!

overlookmotel commented Apr 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

overlookmotel commented Mar 25, 2025 •

edited

Loading

overlookmotel commented Mar 25, 2025 •

edited

Loading

codspeed-hq bot commented Mar 25, 2025 •

edited

Loading

andreubotella commented Mar 26, 2025 •

edited

Loading

overlookmotel commented Mar 28, 2025 •

edited

Loading

graphite-app bot commented Mar 29, 2025 •

edited

Loading