Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The behavior for unassigned codepoint of Shift_JIS is incompatible with WHATWG spec #43962

Closed
cola119 opened this issue Jul 23, 2022 · 2 comments · Fixed by #43999
Closed

The behavior for unassigned codepoint of Shift_JIS is incompatible with WHATWG spec #43962

cola119 opened this issue Jul 23, 2022 · 2 comments · Fixed by #43999
Labels
confirmed-bug Issues with confirmed bugs. encoding Issues and PRs related to the TextEncoder and TextDecoder APIs.

Comments

@cola119
Copy link
Member

cola119 commented Jul 23, 2022

Version

v18.5.0

Platform

No response

Subsystem

No response

What steps will reproduce the bug?

const decoder = new TextDecoder('Shift_JIS');
const s = decoder.decode(new Uint8Array([255]));

How often does it reproduce? Is there a required condition?

Always

What is the expected behavior?

const decoder = new TextDecoder('Shift_JIS');
const s = decoder.decode(new Uint8Array([255]));
console.log(s) // '�' === '\ufffd'

According to WHATWG spec, any decoder should use �(U+FFFD) when an unassigned codepoint is found during decoding.

What do you see instead?

const decoder = new TextDecoder('Shift_JIS');
const s = decoder.decode(new Uint8Array([255]));
console.log(s) // '\x1A'

From my investigation, ICU intentionally uses \x1A for unassigned codepoint on Shift_JIS encoding, and Node.js uses it as it is.
Conversion Data - ICU Documentation
Which substitution character is used if a character cannot be converted?

Additional information

ICU provides the utility ucnv_setSubstChars to specify substitution characters for any encoding, and Node.js already has it in library. I'm working on this.

@daeyeon daeyeon added the encoding Issues and PRs related to the TextEncoder and TextDecoder APIs. label Jul 23, 2022
@cola119 cola119 changed the title The behavior for unassigned codepoint of Shift_JIS is Incompatible with WHATWG spec The behavior for unassigned codepoint of Shift_JIS is incompatible with WHATWG spec Jul 23, 2022
@hemanth
Copy link
Contributor

hemanth commented Jul 23, 2022

Able to reproduce this on v19.0.0-pre:

Welcome to Node.js v19.0.0-pre.
Type ".help" for more information.
> const decoder = new TextDecoder('Shift_JIS');
> const s = decoder.decode(new Uint8Array([255]));
> s
'\x1A'

@cola119 are you looking into ucnv.cpp for the fix?

@cola119
Copy link
Member Author

cola119 commented Jul 24, 2022

@hemanth
I'm thinking ConverterObject can set ? as a substitution character explicitly since node::Converter already have the method to change it.

node/src/node_i18n.cc

Lines 370 to 377 in 7ef069e

void Converter::set_subst_chars(const char* sub) {
CHECK(conv_);
UErrorCode status = U_ZERO_ERROR;
if (sub != nullptr) {
ucnv_setSubstChars(conv_.get(), sub, strlen(sub), &status);
CHECK(U_SUCCESS(status));
}
}

@aduh95 aduh95 added the confirmed-bug Issues with confirmed bugs. label Jul 24, 2022
nodejs-github-bot pushed a commit that referenced this issue Jul 29, 2022
PR-URL: #43999
Fixes: #43962
Reviewed-By: Antoine du Hamel <[email protected]>
Reviewed-By: Mohammed Keyvanzadeh <[email protected]>
Reviewed-By: Darshan Sen <[email protected]>
Reviewed-By: LiviaMedeiros <[email protected]>
Reviewed-By: Feng Yu <[email protected]>
danielleadams pushed a commit that referenced this issue Aug 16, 2022
PR-URL: #43999
Fixes: #43962
Reviewed-By: Antoine du Hamel <[email protected]>
Reviewed-By: Mohammed Keyvanzadeh <[email protected]>
Reviewed-By: Darshan Sen <[email protected]>
Reviewed-By: LiviaMedeiros <[email protected]>
Reviewed-By: Feng Yu <[email protected]>
ruyadorno pushed a commit that referenced this issue Aug 23, 2022
PR-URL: #43999
Fixes: #43962
Reviewed-By: Antoine du Hamel <[email protected]>
Reviewed-By: Mohammed Keyvanzadeh <[email protected]>
Reviewed-By: Darshan Sen <[email protected]>
Reviewed-By: LiviaMedeiros <[email protected]>
Reviewed-By: Feng Yu <[email protected]>
targos pushed a commit that referenced this issue Sep 5, 2022
PR-URL: #43999
Fixes: #43962
Reviewed-By: Antoine du Hamel <[email protected]>
Reviewed-By: Mohammed Keyvanzadeh <[email protected]>
Reviewed-By: Darshan Sen <[email protected]>
Reviewed-By: LiviaMedeiros <[email protected]>
Reviewed-By: Feng Yu <[email protected]>
Fyko pushed a commit to Fyko/node that referenced this issue Sep 15, 2022
PR-URL: nodejs#43999
Fixes: nodejs#43962
Reviewed-By: Antoine du Hamel <[email protected]>
Reviewed-By: Mohammed Keyvanzadeh <[email protected]>
Reviewed-By: Darshan Sen <[email protected]>
Reviewed-By: LiviaMedeiros <[email protected]>
Reviewed-By: Feng Yu <[email protected]>
juanarbol pushed a commit that referenced this issue Oct 10, 2022
PR-URL: #43999
Fixes: #43962
Reviewed-By: Antoine du Hamel <[email protected]>
Reviewed-By: Mohammed Keyvanzadeh <[email protected]>
Reviewed-By: Darshan Sen <[email protected]>
Reviewed-By: LiviaMedeiros <[email protected]>
Reviewed-By: Feng Yu <[email protected]>
juanarbol pushed a commit that referenced this issue Oct 11, 2022
PR-URL: #43999
Fixes: #43962
Reviewed-By: Antoine du Hamel <[email protected]>
Reviewed-By: Mohammed Keyvanzadeh <[email protected]>
Reviewed-By: Darshan Sen <[email protected]>
Reviewed-By: LiviaMedeiros <[email protected]>
Reviewed-By: Feng Yu <[email protected]>
guangwong pushed a commit to noslate-project/node that referenced this issue Jan 3, 2023
PR-URL: nodejs/node#43999
Fixes: nodejs/node#43962
Reviewed-By: Antoine du Hamel <[email protected]>
Reviewed-By: Mohammed Keyvanzadeh <[email protected]>
Reviewed-By: Darshan Sen <[email protected]>
Reviewed-By: LiviaMedeiros <[email protected]>
Reviewed-By: Feng Yu <[email protected]>
guangwong pushed a commit to noslate-project/node that referenced this issue Jan 3, 2023
PR-URL: nodejs/node#43999
Fixes: nodejs/node#43962
Reviewed-By: Antoine du Hamel <[email protected]>
Reviewed-By: Mohammed Keyvanzadeh <[email protected]>
Reviewed-By: Darshan Sen <[email protected]>
Reviewed-By: LiviaMedeiros <[email protected]>
Reviewed-By: Feng Yu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
confirmed-bug Issues with confirmed bugs. encoding Issues and PRs related to the TextEncoder and TextDecoder APIs.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants