fix(es/ast): Fix unicode lone surrogates handling #10987

h-a-n-a · 2025-08-07T12:18:37Z

Description:

This PR fixed an issue related to lone surrogates handling in Rust.

This fix's credits all go to Oxc team #10978 (comment). What I'm doing is porting the fix that was made in Oxc and make it working under SWC.

Problem:

The problem is related to the fundamental difference between how Rust and JavaScript handle Unicode, especially lone surrogates.

JavaScript's Unicode Model

// JavaScript allows this - lone surrogates are stored in UTF-16
let str = "\uD800";  // High surrogate alone - technically invalid Unicode
let obj = { "\uD800": "value" };  // Works fine in JS

JavaScript uses UTF-16 internally and tolerates invalid Unicode sequences:

Strings are UTF-16 code unit sequences, not Unicode scalar sequences
Lone surrogates (U+D800-U+DFFF) are allowed and preserved
No validation that surrogates come in proper high/low pairs
Engine just stores the raw UTF-16 code units

Rust's Unicode Model

// This CANNOT exist in Rust:
let s = "\u{D800}";  // ❌ COMPILE ERROR - not a valid Unicode scalar
let c: char = '\u{D800}';  // ❌ COMPILE ERROR - char excludes surrogates

Rust enforces strict Unicode validity:

String is UTF-8 and must contain valid Unicode scalar values
char represents Unicode scalar values (U+0000-U+D7FF, U+E000-U+10FFFF)
Surrogate code points (U+D800-U+DFFF) are explicitly excluded
No way to represent lone surrogates in Rust's standard string types

Key Changes:

AST Structure: Added lone_surrogates: bool field to Str and TplElement structs to track when strings contain lone surrogates
Encoding Strategy: Lone surrogates are encoded using \u{FFFD} (replacement character) followed by the original hex digits for internal representation
Code Generation: Modified string output to properly escape lone surrogates back to \uXXXX format during codegen
Test: Also fixed some cases related to member expression optimizations and string concatenation optimizations

TODOs:

Add support for serializing and deserializing literals with lone surrogates in swc_estree_compat
Reflect AST changes in binding crates

Breaking changes:

Breaks the AST by adding lone_surrogates field to Str and TplElement and breaks the value and cooked respectly in Str and TplElement. Both of the field is using \u{FFFD} (Replacement Character) as an escape if lone_surrogates set to true.

To consume the real value, you need to first check if lone_surrogates is true, then unescape it by removing the char and construct it with the four trailing hexs(from \u{FFFD}D800 to \uD800).

Related issue (if exists):

closes #10978
closes #10353

Fixed a regression of #7678

changeset-bot · 2025-08-07T12:18:44Z

🦋 Changeset detected

Latest commit: bad060c

The changes in this PR will be included in the next version bump.

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

codspeed-hq · 2025-08-14T09:36:30Z

CodSpeed Performance Report

Merging #10987 will not alter performance

_{Comparing h-a-n-a:fix-unicode-escape (bad060c) with main (be6b695)}

Summary

✅ 140 untouched benchmarks

kdy1 · 2025-08-14T22:23:34Z

Thank you so much! Acutally I tried to fix this several times but it was very confusing :(

socket-security · 2025-08-18T12:53:11Z

No dependency changes detected. Learn more about Socket for GitHub.

👍 No dependency changes detected in pull request

h-a-n-a · 2025-08-19T08:20:27Z

@kdy1, Hi!
I was quite confused about the CI failure in https://github.com/swc-project/swc/actions/runs/17063469624/job/48375172890?pr=10987 and I've also tried running the test locally on main but I have no luck. Would you please shed some light on this?

In addition to this, I've added lone_surrogate field to StringLiteral as a fix suggested in the original issue and we should probably also get TplElement fixed as well. But this would very likely to add some performance regression. The use case is that the lone surrogates in TplElement gets transformed to StringLiteral if the target is set to <=es5.
Failed case: https://github.com/swc-project/swc/actions/runs/17063469624/job/48375172821?pr=10987

EDIT: Added support for TemplateLiteral

kdy1 · 2025-08-20T07:28:51Z

Do we really need to change AST?

kdy1 · 2025-08-20T07:29:37Z

Actually changing AST is not allowed in our case because for v2 we are going to aligh the AST with babel or typescript-eslint

h-a-n-a · 2025-08-20T07:43:18Z

Actually changing AST is not allowed in our case because for v2 we are going to aligh the AST with babel or typescript-eslint

Hi! @kdy1
It's related to the fundamental difference between Rust string and JavaScript string. Without adding the custom escape sequence and the lone_surrogates sign, it's pretty hard to distinguish whether it's a lone surrogates or it's a raw string.

For example, for \uD800, you cannot escape it to \\uD800 as it changes the semantic of the string. So I introduced lone_surrogates sign to indicate that we've used \u{FFFD} to escape the lone surrogates. Then we can convert the lone surrogates into valid code output and do the following string optimizations. Without this, I believe we cannot. Otherwise, maybe we could change the Atom type to support UTF-16 characters.

babel or typescript-eslint

We could probably just change the way it serialize and deserialize to align the AST with them? Adding another layer seems necessary to me as we know there's some difference between Rust and JavaScript that I mentioned above.

Boshen · 2025-08-20T07:44:30Z

I'd appreciate a reference to Oxc because I assume the code is based on my comment #10978 (comment), it took us a lot of time to understand the problem and then make the right fix.

Actually changing AST is not allowed in our case because for v2 we are going to aligh the AST with babel or typescript-eslint

You probably want to bend the rule here, because the String type we use in Rust is UTF8, you need some extra information to make it UTF16.

h-a-n-a · 2025-08-20T07:45:25Z

I'd appreciate a reference to Oxc because I assume the code is based on my comment #10978 (comment), it took us a lot of time to understand the problem and then make the right fix.

Will do! Really appreciate the job y'all have done ;-)

kdy1 · 2025-08-20T07:58:32Z

@h-a-n-a @Boshen I see, thank you so much for the hard work and explanation! I think it might be possible to conform to their AST even with some changes... I'll review it later.

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

h-a-n-a · 2025-08-28T06:52:38Z

@kdy1 Copilot seems broken with this huge code change ;-(.

kdy1

cargo denies publising if there's a cycle

Cargo.lock

kdy1

You deleted ./crates/swc_ecma_transforms_proposal/tests/decorator-tests

Please restore it

kdy1 · 2025-09-04T12:33:31Z

crates/swc_ecma_transforms_proposal/tests/decorator-tests

Fixed. Didn't know why I would ever delete that 🥲

kdy1 · 2025-09-04T12:41:31Z

The submodule change is the last request 👍.
Thank you so much for the hard work.

cc @hardfist to align the time to publish this, as this is a Wasm-ABI breaking change. I'll also talk with the next.js team

hardfist · 2025-09-05T01:09:06Z

we can merge this breaking in Rspack 1.6 web-infra-dev/rspack#11554, I'm not sure whether we can make it compatible with codes that don't have lone surrogates

h-a-n-a · 2025-09-05T07:31:23Z

Thanks for reviewing this! ❤️

kdy1 · 2025-09-05T14:53:47Z

I'll revert this PR as requested by @CPunisher. Can you rebase this PR and resend to dev/rust branch?

This reverts commit 0557609.

Reverts #10987

kdy1 · 2025-09-06T03:51:47Z

I'll do it by myself

**Description:** This PR fixed an issue related to lone surrogates handling in Rust. This fix's credits all go to Oxc team swc-project#10978 (comment). What I'm doing is porting the fix that was made in Oxc and make it working under SWC. ### Problem: The problem is related to the fundamental difference between how Rust and JavaScript handle Unicode, especially lone surrogates. **JavaScript's Unicode Model** ```javascript // JavaScript allows this - lone surrogates are stored in UTF-16 let str = "\uD800"; // High surrogate alone - technically invalid Unicode let obj = { "\uD800": "value" }; // Works fine in JS ``` JavaScript uses UTF-16 internally and tolerates invalid Unicode sequences: - Strings are UTF-16 code unit sequences, not Unicode scalar sequences - Lone surrogates (U+D800-U+DFFF) are allowed and preserved - No validation that surrogates come in proper high/low pairs - Engine just stores the raw UTF-16 code units **Rust's Unicode Model** ```rust // This CANNOT exist in Rust: let s = "\u{D800}"; // ❌ COMPILE ERROR - not a valid Unicode scalar let c: char = '\u{D800}'; // ❌ COMPILE ERROR - char excludes surrogates ``` Rust enforces strict Unicode validity: - String is UTF-8 and must contain valid Unicode scalar values - char represents Unicode scalar values (U+0000-U+D7FF, U+E000-U+10FFFF) - Surrogate code points (U+D800-U+DFFF) are explicitly excluded - No way to represent lone surrogates in Rust's standard string types ### Key Changes: 1. AST Structure: Added `lone_surrogates: bool` field to `Str` and `TplElement` structs to track when strings contain lone surrogates 2. Encoding Strategy: Lone surrogates are encoded using \u{FFFD} (replacement character) followed by the original hex digits for internal representation 3. Code Generation: Modified string output to properly escape lone surrogates back to \uXXXX format during codegen 4. Test: Also fixed some cases related to member expression optimizations and string concatenation optimizations ### TODOs: 1. Add support for serializing and deserializing literals with lone surrogates in `swc_estree_compat` 2. Reflect AST changes in `binding` crates ### Breaking changes: Breaks the AST by adding `lone_surrogates` field to `Str` and `TplElement` and breaks the `value` and `cooked` respectly in `Str` and `TplElement`. Both of the field is using `\u{FFFD}` (Replacement Character) as an escape if `lone_surrogates` set to `true`. To consume the real value, you need to first check if `lone_surrogates` is `true`, then unescape it by removing the char and construct it with the four trailing hexs(from `\u{FFFD}D800` to `\uD800`). **Related issue:** - Closes swc-project#10978 - Closes swc-project#10353 Fixed a regression of swc-project#7678

**Description:** Continue from #11085 This PR adds `Wtf8Atom` to represent unpaired surrogates (i.e. lone surrogates) in Rust. **Related issue:** Reimplemented a part of #10987

**Description:** This PR fixed an issue related to lone surrogates handling in Rust. This fix's credits all go to Oxc team swc-project#10978 (comment). What I'm doing is porting the fix that was made in Oxc and make it working under SWC. ### Problem: The problem is related to the fundamental difference between how Rust and JavaScript handle Unicode, especially lone surrogates. **JavaScript's Unicode Model** ```javascript // JavaScript allows this - lone surrogates are stored in UTF-16 let str = "\uD800"; // High surrogate alone - technically invalid Unicode let obj = { "\uD800": "value" }; // Works fine in JS ``` JavaScript uses UTF-16 internally and tolerates invalid Unicode sequences: - Strings are UTF-16 code unit sequences, not Unicode scalar sequences - Lone surrogates (U+D800-U+DFFF) are allowed and preserved - No validation that surrogates come in proper high/low pairs - Engine just stores the raw UTF-16 code units **Rust's Unicode Model** ```rust // This CANNOT exist in Rust: let s = "\u{D800}"; // ❌ COMPILE ERROR - not a valid Unicode scalar let c: char = '\u{D800}'; // ❌ COMPILE ERROR - char excludes surrogates ``` Rust enforces strict Unicode validity: - String is UTF-8 and must contain valid Unicode scalar values - char represents Unicode scalar values (U+0000-U+D7FF, U+E000-U+10FFFF) - Surrogate code points (U+D800-U+DFFF) are explicitly excluded - No way to represent lone surrogates in Rust's standard string types ### Key Changes: 1. AST Structure: Added `lone_surrogates: bool` field to `Str` and `TplElement` structs to track when strings contain lone surrogates 2. Encoding Strategy: Lone surrogates are encoded using \u{FFFD} (replacement character) followed by the original hex digits for internal representation 3. Code Generation: Modified string output to properly escape lone surrogates back to \uXXXX format during codegen 4. Test: Also fixed some cases related to member expression optimizations and string concatenation optimizations ### TODOs: 1. Add support for serializing and deserializing literals with lone surrogates in `swc_estree_compat` 2. Reflect AST changes in `binding` crates ### Breaking changes: Breaks the AST by adding `lone_surrogates` field to `Str` and `TplElement` and breaks the `value` and `cooked` respectly in `Str` and `TplElement`. Both of the field is using `\u{FFFD}` (Replacement Character) as an escape if `lone_surrogates` set to `true`. To consume the real value, you need to first check if `lone_surrogates` is `true`, then unescape it by removing the char and construct it with the four trailing hexs(from `\u{FFFD}D800` to `\uD800`). **Related issue:** - Closes swc-project#10978 - Closes swc-project#10353 Fixed a regression of swc-project#7678

h-a-n-a force-pushed the fix-unicode-escape branch from e32046f to 5f04ddb Compare August 14, 2025 09:39

h-a-n-a force-pushed the fix-unicode-escape branch from 4d12831 to df0b9ee Compare August 19, 2025 07:13

CPunisher self-assigned this Aug 20, 2025

h-a-n-a changed the title ~~fix: unicode surrogates should be handled in string literals~~ fix: fix unicode lone surrogates handling Aug 20, 2025

h-a-n-a marked this pull request as ready for review August 20, 2025 07:03

h-a-n-a requested review from a team as code owners August 20, 2025 07:03

h-a-n-a mentioned this pull request Aug 20, 2025

fix(minifier): fix string concatenation for lone surrogates oxc-project/oxc#13229

Open

CPunisher changed the title ~~fix: fix unicode lone surrogates handling~~ fix: unicode lone surrogates handling Aug 22, 2025

kdy1 requested a review from Copilot August 23, 2025 22:47

Copilot AI reviewed Aug 23, 2025

View reviewed changes

kdy1 requested a review from Copilot August 24, 2025 01:37

Copilot AI reviewed Aug 24, 2025

View reviewed changes

h-a-n-a requested a review from Copilot August 25, 2025 03:22

Copilot AI reviewed Aug 25, 2025

View reviewed changes

kdy1 changed the title ~~fix: unicode lone surrogates handling~~ fix(es/ast): Fix unicode lone surrogates handling Aug 31, 2025

kdy1 requested changes Sep 1, 2025

View reviewed changes

Cargo.lock Outdated Show resolved Hide resolved

kdy1 added this to the Planned milestone Sep 1, 2025

fix: try to fix binding_core_node bundle

120c4fb

kdy1 reviewed Sep 4, 2025

View reviewed changes

kdy1 self-assigned this Sep 4, 2025

Create metal-radios-swim.md

634c47b

chore: restore decorator-tests

bad060c

kdy1 approved these changes Sep 5, 2025

View reviewed changes

kdy1 merged commit 0557609 into swc-project:main Sep 5, 2025
175 checks passed

h-a-n-a deleted the fix-unicode-escape branch September 5, 2025 07:30

kdy1 added a commit that referenced this pull request Sep 5, 2025

Revert "fix(es/ast): Fix unicode lone surrogates handling (#10987)"

ac9ef55

This reverts commit 0557609.

kdy1 mentioned this pull request Sep 5, 2025

Revert "fix(es/ast): Fix unicode lone surrogates handling" #11063

Merged

kdy1 added a commit that referenced this pull request Sep 5, 2025

Revert "fix(es/ast): Fix unicode lone surrogates handling" (#11063)

d2baf6a

Reverts #10987

kdy1 mentioned this pull request Sep 6, 2025

refactor(es/ast): Cherry-pick #10763 #11060

Merged

claude bot mentioned this pull request Sep 6, 2025

feat(es/ast): Reapply #10987 #11066

Merged

kdy1 added a commit that referenced this pull request Sep 6, 2025

feat(es/ast): Reapply #10987 (#11066)

1db02a1

github-actions bot modified the milestones: Planned, 1.13.7 Sep 19, 2025

h-a-n-a mentioned this pull request Sep 23, 2025

feat(hstr): Introduce Wtf8Atom #11104

Merged

kdy1 pushed a commit that referenced this pull request Sep 23, 2025

feat(hstr): Introduce Wtf8Atom (#11104)

8cfd47b

**Description:** Continue from #11085 This PR adds `Wtf8Atom` to represent unpaired surrogates (i.e. lone surrogates) in Rust. **Related issue:** Reimplemented a part of #10987

swc-project locked as resolved and limited conversation to collaborators Oct 19, 2025

Uh oh!

fix(es/ast): Fix unicode lone surrogates handling #10987

fix(es/ast): Fix unicode lone surrogates handling #10987

Uh oh!

Conversation

h-a-n-a commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem:

Key Changes:

TODOs:

Breaking changes:

Uh oh!

changeset-bot bot commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🦋 Changeset detected

Uh oh!

codspeed-hq bot commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CodSpeed Performance Report

Merging #10987 will not alter performance

Summary

Uh oh!

kdy1 commented Aug 14, 2025

Uh oh!

socket-security bot commented Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

h-a-n-a commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kdy1 commented Aug 20, 2025

Uh oh!

kdy1 commented Aug 20, 2025

Uh oh!

h-a-n-a commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Boshen commented Aug 20, 2025

Uh oh!

h-a-n-a commented Aug 20, 2025

Uh oh!

kdy1 commented Aug 20, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

h-a-n-a commented Aug 28, 2025

Uh oh!

kdy1 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kdy1 left a comment

Choose a reason for hiding this comment

Uh oh!

kdy1 Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

h-a-n-a Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

kdy1 commented Sep 4, 2025

Uh oh!

hardfist commented Sep 5, 2025

Uh oh!

Uh oh!

h-a-n-a commented Sep 5, 2025

Uh oh!

kdy1 commented Sep 5, 2025

Uh oh!

kdy1 commented Sep 6, 2025

Uh oh!

Reviewers

Assignees

h-a-n-a commented Aug 7, 2025 •

edited

Loading

changeset-bot bot commented Aug 7, 2025 •

edited

Loading

codspeed-hq bot commented Aug 14, 2025 •

edited

Loading

socket-security bot commented Aug 18, 2025 •

edited

Loading

h-a-n-a commented Aug 19, 2025 •

edited

Loading

h-a-n-a commented Aug 20, 2025 •

edited

Loading