-
-
Notifications
You must be signed in to change notification settings - Fork 1.3k
fix(es/ast): Fix unicode lone surrogates handling #10987
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🦋 Changeset detectedLatest commit: bad060c The changes in this PR will be included in the next version bump. Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
CodSpeed Performance ReportMerging #10987 will not alter performanceComparing Summary
|
e32046f to
5f04ddb
Compare
|
Thank you so much! Acutally I tried to fix this several times but it was very confusing :( |
|
No dependency changes detected. Learn more about Socket for GitHub. 👍 No dependency changes detected in pull request |
4d12831 to
df0b9ee
Compare
|
@kdy1, Hi!
EDIT: Added support for |
|
Do we really need to change AST? |
|
Actually changing AST is not allowed in our case because for v2 we are going to aligh the AST with babel or typescript-eslint |
Hi! @kdy1 For example, for
We could probably just change the way it serialize and deserialize to align the AST with them? Adding another layer seems necessary to me as we know there's some difference between Rust and JavaScript that I mentioned above. |
|
I'd appreciate a reference to Oxc because I assume the code is based on my comment #10978 (comment), it took us a lot of time to understand the problem and then make the right fix.
You probably want to bend the rule here, because the |
Will do! Really appreciate the job y'all have done ;-) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.
|
@kdy1 Copilot seems broken with this huge code change ;-(. |
kdy1
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cargo denies publising if there's a cycle
kdy1
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You deleted ./crates/swc_ecma_transforms_proposal/tests/decorator-tests
Please restore it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. Didn't know why I would ever delete that 🥲
|
The submodule change is the last request 👍.
cc @hardfist to align the time to publish this, as this is a Wasm-ABI breaking change. I'll also talk with the next.js team |
|
we can merge this breaking in Rspack 1.6 web-infra-dev/rspack#11554, I'm not sure whether we can make it compatible with codes that don't have lone surrogates |
|
Thanks for reviewing this! ❤️ |
|
I'll revert this PR as requested by @CPunisher. Can you rebase this PR and resend to |
|
I'll do it by myself |
**Description:** This PR fixed an issue related to lone surrogates handling in Rust. This fix's credits all go to Oxc team swc-project#10978 (comment). What I'm doing is porting the fix that was made in Oxc and make it working under SWC. ### Problem: The problem is related to the fundamental difference between how Rust and JavaScript handle Unicode, especially lone surrogates. **JavaScript's Unicode Model** ```javascript // JavaScript allows this - lone surrogates are stored in UTF-16 let str = "\uD800"; // High surrogate alone - technically invalid Unicode let obj = { "\uD800": "value" }; // Works fine in JS ``` JavaScript uses UTF-16 internally and tolerates invalid Unicode sequences: - Strings are UTF-16 code unit sequences, not Unicode scalar sequences - Lone surrogates (U+D800-U+DFFF) are allowed and preserved - No validation that surrogates come in proper high/low pairs - Engine just stores the raw UTF-16 code units **Rust's Unicode Model** ```rust // This CANNOT exist in Rust: let s = "\u{D800}"; // ❌ COMPILE ERROR - not a valid Unicode scalar let c: char = '\u{D800}'; // ❌ COMPILE ERROR - char excludes surrogates ``` Rust enforces strict Unicode validity: - String is UTF-8 and must contain valid Unicode scalar values - char represents Unicode scalar values (U+0000-U+D7FF, U+E000-U+10FFFF) - Surrogate code points (U+D800-U+DFFF) are explicitly excluded - No way to represent lone surrogates in Rust's standard string types ### Key Changes: 1. AST Structure: Added `lone_surrogates: bool` field to `Str` and `TplElement` structs to track when strings contain lone surrogates 2. Encoding Strategy: Lone surrogates are encoded using \u{FFFD} (replacement character) followed by the original hex digits for internal representation 3. Code Generation: Modified string output to properly escape lone surrogates back to \uXXXX format during codegen 4. Test: Also fixed some cases related to member expression optimizations and string concatenation optimizations ### TODOs: 1. Add support for serializing and deserializing literals with lone surrogates in `swc_estree_compat` 2. Reflect AST changes in `binding` crates ### Breaking changes: Breaks the AST by adding `lone_surrogates` field to `Str` and `TplElement` and breaks the `value` and `cooked` respectly in `Str` and `TplElement`. Both of the field is using `\u{FFFD}` (Replacement Character) as an escape if `lone_surrogates` set to `true`. To consume the real value, you need to first check if `lone_surrogates` is `true`, then unescape it by removing the char and construct it with the four trailing hexs(from `\u{FFFD}D800` to `\uD800`). **Related issue:** - Closes swc-project#10978 - Closes swc-project#10353 Fixed a regression of swc-project#7678
**Description:** This PR fixed an issue related to lone surrogates handling in Rust. This fix's credits all go to Oxc team swc-project#10978 (comment). What I'm doing is porting the fix that was made in Oxc and make it working under SWC. ### Problem: The problem is related to the fundamental difference between how Rust and JavaScript handle Unicode, especially lone surrogates. **JavaScript's Unicode Model** ```javascript // JavaScript allows this - lone surrogates are stored in UTF-16 let str = "\uD800"; // High surrogate alone - technically invalid Unicode let obj = { "\uD800": "value" }; // Works fine in JS ``` JavaScript uses UTF-16 internally and tolerates invalid Unicode sequences: - Strings are UTF-16 code unit sequences, not Unicode scalar sequences - Lone surrogates (U+D800-U+DFFF) are allowed and preserved - No validation that surrogates come in proper high/low pairs - Engine just stores the raw UTF-16 code units **Rust's Unicode Model** ```rust // This CANNOT exist in Rust: let s = "\u{D800}"; // ❌ COMPILE ERROR - not a valid Unicode scalar let c: char = '\u{D800}'; // ❌ COMPILE ERROR - char excludes surrogates ``` Rust enforces strict Unicode validity: - String is UTF-8 and must contain valid Unicode scalar values - char represents Unicode scalar values (U+0000-U+D7FF, U+E000-U+10FFFF) - Surrogate code points (U+D800-U+DFFF) are explicitly excluded - No way to represent lone surrogates in Rust's standard string types ### Key Changes: 1. AST Structure: Added `lone_surrogates: bool` field to `Str` and `TplElement` structs to track when strings contain lone surrogates 2. Encoding Strategy: Lone surrogates are encoded using \u{FFFD} (replacement character) followed by the original hex digits for internal representation 3. Code Generation: Modified string output to properly escape lone surrogates back to \uXXXX format during codegen 4. Test: Also fixed some cases related to member expression optimizations and string concatenation optimizations ### TODOs: 1. Add support for serializing and deserializing literals with lone surrogates in `swc_estree_compat` 2. Reflect AST changes in `binding` crates ### Breaking changes: Breaks the AST by adding `lone_surrogates` field to `Str` and `TplElement` and breaks the `value` and `cooked` respectly in `Str` and `TplElement`. Both of the field is using `\u{FFFD}` (Replacement Character) as an escape if `lone_surrogates` set to `true`. To consume the real value, you need to first check if `lone_surrogates` is `true`, then unescape it by removing the char and construct it with the four trailing hexs(from `\u{FFFD}D800` to `\uD800`). **Related issue:** - Closes swc-project#10978 - Closes swc-project#10353 Fixed a regression of swc-project#7678
**Description:** This PR fixed an issue related to lone surrogates handling in Rust. This fix's credits all go to Oxc team swc-project#10978 (comment). What I'm doing is porting the fix that was made in Oxc and make it working under SWC. ### Problem: The problem is related to the fundamental difference between how Rust and JavaScript handle Unicode, especially lone surrogates. **JavaScript's Unicode Model** ```javascript // JavaScript allows this - lone surrogates are stored in UTF-16 let str = "\uD800"; // High surrogate alone - technically invalid Unicode let obj = { "\uD800": "value" }; // Works fine in JS ``` JavaScript uses UTF-16 internally and tolerates invalid Unicode sequences: - Strings are UTF-16 code unit sequences, not Unicode scalar sequences - Lone surrogates (U+D800-U+DFFF) are allowed and preserved - No validation that surrogates come in proper high/low pairs - Engine just stores the raw UTF-16 code units **Rust's Unicode Model** ```rust // This CANNOT exist in Rust: let s = "\u{D800}"; // ❌ COMPILE ERROR - not a valid Unicode scalar let c: char = '\u{D800}'; // ❌ COMPILE ERROR - char excludes surrogates ``` Rust enforces strict Unicode validity: - String is UTF-8 and must contain valid Unicode scalar values - char represents Unicode scalar values (U+0000-U+D7FF, U+E000-U+10FFFF) - Surrogate code points (U+D800-U+DFFF) are explicitly excluded - No way to represent lone surrogates in Rust's standard string types ### Key Changes: 1. AST Structure: Added `lone_surrogates: bool` field to `Str` and `TplElement` structs to track when strings contain lone surrogates 2. Encoding Strategy: Lone surrogates are encoded using \u{FFFD} (replacement character) followed by the original hex digits for internal representation 3. Code Generation: Modified string output to properly escape lone surrogates back to \uXXXX format during codegen 4. Test: Also fixed some cases related to member expression optimizations and string concatenation optimizations ### TODOs: 1. Add support for serializing and deserializing literals with lone surrogates in `swc_estree_compat` 2. Reflect AST changes in `binding` crates ### Breaking changes: Breaks the AST by adding `lone_surrogates` field to `Str` and `TplElement` and breaks the `value` and `cooked` respectly in `Str` and `TplElement`. Both of the field is using `\u{FFFD}` (Replacement Character) as an escape if `lone_surrogates` set to `true`. To consume the real value, you need to first check if `lone_surrogates` is `true`, then unescape it by removing the char and construct it with the four trailing hexs(from `\u{FFFD}D800` to `\uD800`). **Related issue:** - Closes swc-project#10978 - Closes swc-project#10353 Fixed a regression of swc-project#7678
**Description:** This PR fixed an issue related to lone surrogates handling in Rust. This fix's credits all go to Oxc team swc-project#10978 (comment). What I'm doing is porting the fix that was made in Oxc and make it working under SWC. ### Problem: The problem is related to the fundamental difference between how Rust and JavaScript handle Unicode, especially lone surrogates. **JavaScript's Unicode Model** ```javascript // JavaScript allows this - lone surrogates are stored in UTF-16 let str = "\uD800"; // High surrogate alone - technically invalid Unicode let obj = { "\uD800": "value" }; // Works fine in JS ``` JavaScript uses UTF-16 internally and tolerates invalid Unicode sequences: - Strings are UTF-16 code unit sequences, not Unicode scalar sequences - Lone surrogates (U+D800-U+DFFF) are allowed and preserved - No validation that surrogates come in proper high/low pairs - Engine just stores the raw UTF-16 code units **Rust's Unicode Model** ```rust // This CANNOT exist in Rust: let s = "\u{D800}"; // ❌ COMPILE ERROR - not a valid Unicode scalar let c: char = '\u{D800}'; // ❌ COMPILE ERROR - char excludes surrogates ``` Rust enforces strict Unicode validity: - String is UTF-8 and must contain valid Unicode scalar values - char represents Unicode scalar values (U+0000-U+D7FF, U+E000-U+10FFFF) - Surrogate code points (U+D800-U+DFFF) are explicitly excluded - No way to represent lone surrogates in Rust's standard string types ### Key Changes: 1. AST Structure: Added `lone_surrogates: bool` field to `Str` and `TplElement` structs to track when strings contain lone surrogates 2. Encoding Strategy: Lone surrogates are encoded using \u{FFFD} (replacement character) followed by the original hex digits for internal representation 3. Code Generation: Modified string output to properly escape lone surrogates back to \uXXXX format during codegen 4. Test: Also fixed some cases related to member expression optimizations and string concatenation optimizations ### TODOs: 1. Add support for serializing and deserializing literals with lone surrogates in `swc_estree_compat` 2. Reflect AST changes in `binding` crates ### Breaking changes: Breaks the AST by adding `lone_surrogates` field to `Str` and `TplElement` and breaks the `value` and `cooked` respectly in `Str` and `TplElement`. Both of the field is using `\u{FFFD}` (Replacement Character) as an escape if `lone_surrogates` set to `true`. To consume the real value, you need to first check if `lone_surrogates` is `true`, then unescape it by removing the char and construct it with the four trailing hexs(from `\u{FFFD}D800` to `\uD800`). **Related issue:** - Closes swc-project#10978 - Closes swc-project#10353 Fixed a regression of swc-project#7678
**Description:** This PR fixed an issue related to lone surrogates handling in Rust. This fix's credits all go to Oxc team swc-project#10978 (comment). What I'm doing is porting the fix that was made in Oxc and make it working under SWC. ### Problem: The problem is related to the fundamental difference between how Rust and JavaScript handle Unicode, especially lone surrogates. **JavaScript's Unicode Model** ```javascript // JavaScript allows this - lone surrogates are stored in UTF-16 let str = "\uD800"; // High surrogate alone - technically invalid Unicode let obj = { "\uD800": "value" }; // Works fine in JS ``` JavaScript uses UTF-16 internally and tolerates invalid Unicode sequences: - Strings are UTF-16 code unit sequences, not Unicode scalar sequences - Lone surrogates (U+D800-U+DFFF) are allowed and preserved - No validation that surrogates come in proper high/low pairs - Engine just stores the raw UTF-16 code units **Rust's Unicode Model** ```rust // This CANNOT exist in Rust: let s = "\u{D800}"; // ❌ COMPILE ERROR - not a valid Unicode scalar let c: char = '\u{D800}'; // ❌ COMPILE ERROR - char excludes surrogates ``` Rust enforces strict Unicode validity: - String is UTF-8 and must contain valid Unicode scalar values - char represents Unicode scalar values (U+0000-U+D7FF, U+E000-U+10FFFF) - Surrogate code points (U+D800-U+DFFF) are explicitly excluded - No way to represent lone surrogates in Rust's standard string types ### Key Changes: 1. AST Structure: Added `lone_surrogates: bool` field to `Str` and `TplElement` structs to track when strings contain lone surrogates 2. Encoding Strategy: Lone surrogates are encoded using \u{FFFD} (replacement character) followed by the original hex digits for internal representation 3. Code Generation: Modified string output to properly escape lone surrogates back to \uXXXX format during codegen 4. Test: Also fixed some cases related to member expression optimizations and string concatenation optimizations ### TODOs: 1. Add support for serializing and deserializing literals with lone surrogates in `swc_estree_compat` 2. Reflect AST changes in `binding` crates ### Breaking changes: Breaks the AST by adding `lone_surrogates` field to `Str` and `TplElement` and breaks the `value` and `cooked` respectly in `Str` and `TplElement`. Both of the field is using `\u{FFFD}` (Replacement Character) as an escape if `lone_surrogates` set to `true`. To consume the real value, you need to first check if `lone_surrogates` is `true`, then unescape it by removing the char and construct it with the four trailing hexs(from `\u{FFFD}D800` to `\uD800`). **Related issue:** - Closes swc-project#10978 - Closes swc-project#10353 Fixed a regression of swc-project#7678

Description:
This PR fixed an issue related to lone surrogates handling in Rust.
This fix's credits all go to Oxc team #10978 (comment). What I'm doing is porting the fix that was made in Oxc and make it working under SWC.
Problem:
The problem is related to the fundamental difference between how Rust and JavaScript handle Unicode, especially lone surrogates.
JavaScript's Unicode Model
JavaScript uses UTF-16 internally and tolerates invalid Unicode sequences:
Rust's Unicode Model
Rust enforces strict Unicode validity:
Key Changes:
lone_surrogates: boolfield toStrandTplElementstructs to track when strings contain lone surrogatesTODOs:
swc_estree_compatbindingcratesBreaking changes:
Breaks the AST by adding
lone_surrogatesfield toStrandTplElementand breaks thevalueandcookedrespectly inStrandTplElement. Both of the field is using\u{FFFD}(Replacement Character) as an escape iflone_surrogatesset totrue.To consume the real value, you need to first check if
lone_surrogatesistrue, then unescape it by removing the char and construct it with the four trailing hexs(from\u{FFFD}D800to\uD800).Related issue (if exists):
closes #10978
closes #10353
Fixed a regression of #7678