Skip to content

fix(parser): store lone surrogates as escape sequence#10041

Merged
graphite-app[bot] merged 1 commit intomainfrom
03-25-fix_parser_store_lone_surrogates_as_escape_sequence
Mar 29, 2025
Merged

fix(parser): store lone surrogates as escape sequence#10041
graphite-app[bot] merged 1 commit intomainfrom
03-25-fix_parser_store_lone_surrogates_as_escape_sequence

Conversation

@overlookmotel
Copy link
Member

@overlookmotel overlookmotel commented Mar 25, 2025

Closes #3526.

Fully parse strings containing lone surrogates and encode the string in value.

Encoding schema is to encode a lone surrogate as the lossy replacement character, followed by the code point in hex. i.e. "\uD800" in source code would be encoded as \u{FFFD}d800 in value. The lossy replacement character itself is encoded as \u{FFFD}fffd.

There's nothing special about the lossy replacement character. Just had to choose some valid Unicode character to be the escape marker, and that seemed like a reasonable choice of a character which is likely to be rare in real-world code.

All the ESTree-Test262 test cases related to lone surrogates now pass.

WIP. A bit of tidying up to do yet.

Copy link
Member Author

overlookmotel commented Mar 25, 2025


How to use the Graphite Merge Queue

Add either label to this PR to merge it via the merge queue:

  • 0-merge - adds this PR to the back of the merge queue
  • hotfix - for urgent hot fixes, skip the queue and merge this PR next

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@codspeed-hq
Copy link

codspeed-hq bot commented Mar 25, 2025

CodSpeed Instrumentation Performance Report

Merging #10041 will not alter performance

Comparing 03-25-fix_parser_store_lone_surrogates_as_escape_sequence (f0e1510) with 03-28-test_minifier_update_cargo_minsize_snapshot (fe8625d)

Summary

✅ 33 untouched benchmarks

@overlookmotel overlookmotel force-pushed the 03-25-fix_parser_store_lone_surrogates_as_escape_sequence branch 2 times, most recently from 45ce03d to 77ed9da Compare March 25, 2025 18:39
@overlookmotel overlookmotel changed the base branch from 03-25-refactor_lexer_clarify_and_reformat_comments to graphite-base/10041 March 25, 2025 18:48
@overlookmotel overlookmotel force-pushed the 03-25-fix_parser_store_lone_surrogates_as_escape_sequence branch from 77ed9da to dc171cd Compare March 25, 2025 18:48
@overlookmotel overlookmotel changed the base branch from graphite-base/10041 to 03-26-refactor_lexer_remove_unnecessary_line March 25, 2025 18:48
@Boshen
Copy link
Member

Boshen commented Mar 26, 2025

cc @andreubotella for a quick review.

@andreubotella
Copy link

andreubotella commented Mar 26, 2025

cc @andreubotella for a quick review.

IIUC, if lone_surrogates is false, then any replacement characters in the value are true replacement characters; and if true, then the value string would need decoding, right? I think we can work with that in Nova. (cc @aapoalas)

That said, I worry that other users of oxc_ast might not be aware of this change, and might use value without checking for lone_surrogate. You could say that no AsRef<str> would accurately represent the JS string, but I'd say that decoding any lone surrogate lossily to the replacement character (as String::from_utf16_lossy does) would be a lot more useful for most use cases. If the extra memory use is not an issue, I'd suggest that value should be this lossy decode, and lone_surrogate_string would be an Option<Atom<'a>> that is only Some if there are lone surrogates.

@overlookmotel overlookmotel changed the base branch from 03-26-refactor_lexer_remove_unnecessary_line to graphite-base/10041 March 27, 2025 07:26
@overlookmotel overlookmotel changed the base branch from graphite-base/10041 to 03-26-refactor_lexer_remove_unnecessary_line March 27, 2025 07:26
@overlookmotel overlookmotel changed the base branch from 03-26-refactor_lexer_remove_unnecessary_line to graphite-base/10041 March 27, 2025 07:26
@overlookmotel overlookmotel force-pushed the 03-25-fix_parser_store_lone_surrogates_as_escape_sequence branch from dc171cd to 025be46 Compare March 27, 2025 07:28
@overlookmotel overlookmotel changed the base branch from graphite-base/10041 to 03-27-perf_lexer_faster_decoding_unicode_escape_sequences March 27, 2025 07:28
@overlookmotel overlookmotel marked this pull request as ready for review March 27, 2025 07:30
@overlookmotel overlookmotel marked this pull request as draft March 27, 2025 07:32
@overlookmotel
Copy link
Member Author

Something is wrong. This PR should make no difference to result of cargo minsize, because codegen is only using the value of raw when lone_surrogates flag is set (prior to #10044). I'll investigate.

@overlookmotel overlookmotel changed the base branch from main to graphite-base/10041 March 28, 2025 12:58
@overlookmotel overlookmotel changed the base branch from graphite-base/10041 to main March 28, 2025 12:58
@overlookmotel overlookmotel changed the base branch from main to graphite-base/10041 March 28, 2025 12:59
@overlookmotel overlookmotel force-pushed the 03-25-fix_parser_store_lone_surrogates_as_escape_sequence branch from 0aa89a2 to d822f65 Compare March 28, 2025 12:59
@overlookmotel overlookmotel changed the base branch from graphite-base/10041 to 03-28-test_minifier_update_cargo_minsize_snapshot March 28, 2025 12:59
@overlookmotel
Copy link
Member Author

overlookmotel commented Mar 28, 2025

Something is wrong. This PR should make no difference to result of cargo minsize, because codegen is only using the value of raw when lone_surrogates flag is set (prior to #10044). I'll investigate.

Nope. Snapshot was just out of date.

I just need to add some tests, and this should be good to go.

@Boshen Boshen marked this pull request as ready for review March 29, 2025 12:24
@graphite-app
Copy link
Contributor

graphite-app bot commented Mar 29, 2025

Merge activity

Closes #3526.

Fully parse strings containing lone surrogates and encode the string in `value`.

Encoding schema is to encode a lone surrogate as the lossy replacement character, followed by the code point in hex. i.e. `"\uD800"` in source code would be encoded as `\u{FFFD}d800` in `value`. The lossy replacement character itself is encoded as `\u{FFFD}fffd`.

There's nothing special about the lossy replacement character. Just had to choose *some* valid Unicode character to be the escape marker, and that seemed like a reasonable choice of a character which is likely to be rare in real-world code.

All the ESTree-Test262 test cases related to lone surrogates now pass.

WIP. A bit of tidying up to do yet.
@graphite-app graphite-app bot force-pushed the 03-28-test_minifier_update_cargo_minsize_snapshot branch from 7e15588 to a3c1dd9 Compare March 29, 2025 12:48
@graphite-app graphite-app bot force-pushed the 03-25-fix_parser_store_lone_surrogates_as_escape_sequence branch from d822f65 to f0e1510 Compare March 29, 2025 12:48
graphite-app bot pushed a commit that referenced this pull request Mar 29, 2025
…ithout reference to `raw` (#10044)

#10041 changed how lone surrogates are handled in `StringLiteral`s.

`StringLiteral`s which include lone surrogates now have the `lone_surrogates` flag set, and `value` encodes lone surrogates as `\u{FFFD}XXXX`, where `XXXX` is the code unit encoded as hex.

Codegen check the `lone_surrogates` flag and decode the lone surrogates if they're present. This means that:

1. A `StringLiteral` no longer needs to have `raw` field populated, so you can (if you choose to for some reason) create a new `StringLiteral` containing lone surrogates.

2. `StringLiteral`s containing lone surrogates now have any other characters escaped same as how `StringLiteral`s without lone surrogates are printed.
Base automatically changed from 03-28-test_minifier_update_cargo_minsize_snapshot to main March 29, 2025 13:01
@graphite-app graphite-app bot removed the 0-merge Merge with Graphite Merge Queue label Mar 29, 2025
@graphite-app graphite-app bot merged commit f0e1510 into main Mar 29, 2025
27 checks passed
@graphite-app graphite-app bot deleted the 03-25-fix_parser_store_lone_surrogates_as_escape_sequence branch March 29, 2025 13:04
@overlookmotel
Copy link
Member Author

@overlookmotel Note to self: Add some tests. Especially for strings containing both lone surrogates and lossy replacement characters.

@overlookmotel
Copy link
Member Author

Tests added in #10175.

graphite-app bot pushed a commit that referenced this pull request Apr 2, 2025
…10175)

Add tests for correct parsing of `StringLiteral`s containing lone surrogates and lossy replacement characters, after #10041.

The tests are in `oxc_codegen` but primarily these tests test the logic in parser.
graphite-app bot pushed a commit that referenced this pull request Apr 2, 2025
…e sequence (#10182)

Encode lone surrogates in `cooked` property of `TemplateElementValue` using same encoding scheme as for `StringLiteral`s.

In fact, they were already being encoded like this after #10041, but add a `lone_surrogates` flag to `TemplateLiteral` to decode them correctly in ESTree AST.

`oxc_codegen` ignores `cooked` and just prints `raw`, so needs no alteration.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-ast Area - AST A-codegen Area - Code Generation A-parser Area - Parser C-bug Category - Bug

Projects

None yet

Development

Successfully merging this pull request may close these issues.

codegen: string surrogates printed incorrectly

3 participants