fix: completing the InternalEncoder for utf8 by yosion-p · Pull Request #282 · pillarjs/iconv-lite

yosion-p · 2021-09-13T08:56:43Z

Hi bro,It's me, again.🤣
I have completed the InternalEncoder for utf8 with surrogates after I read what you meant(#275 ).
#275 looked a bit messy, so I submitted a brand new one,
I didn't know much before, but now I know more.
I think surrogates should be in pairs always.
Thank you for your patience.

ashtuchkin

Looks pretty good! Could you add more tests please (see comment)?

test/streams-test.js

ashtuchkin

See comments below.

The behavior we want is to be always equivalent to what Node does, independently of where chunk breaks happen. More formally, for every string S and for every split of S into chunks, the merged output of iconv-lite encoding must be equal to Node's encoding of the whole string (Buffer.from(S, "utf8")).

encodings/internal.js

test/streams-test.js

yosion-p · 2021-09-15T04:35:47Z

I think I got it,

the merged output of iconv-lite encoding must be equal to Node's encoding of the whole string (Buffer.from(S, "utf8")).

sorry,I thought incomplete surrogate pairs were invalid and could be filtered out.
I will correct all outputs.

But about that:

a loop over the whole string is pretty inefficient. Moreover, you don't need to touch internal surrogates - Node processes them correctly already. We only need to adjust boundaries, as these are what Node can't adjust as it doesn't know about them.

I don't really understand,
Because if I don't loop through the string, I won't know which character in the string needs to be processed.

yosion-p · 2021-09-15T08:45:39Z

I was wondering if there was something wrong with the test method,not Encoder.

Just like I thought before,
If UT into multiple strings in the input,IconvLiteEncoderStream.prototype._transform will be invoked many times, back to characters be cut by Encoder rather than a complete surrogates pair.

If we modify InternalEncoder,
When the surrogates pair was cut,
The first call doesn't return anything, only the second one returns a full surrogates pair,

That's why I need to go through a loop and check for characters one by one to see if it is a surrogate.

yosion-p · 2021-09-15T10:32:41Z

I think I can solve this problem in InternalEncoderUtf8.prototype.end(),
I will try.

…icient

yosion-p · 2021-09-16T01:48:43Z

Hi bro,It's me, again.🤣
I did it by modifying the code of InternalEncoderUtf8.prototype.write.
In a word, less is more.

ashtuchkin

Getting close, left a couple more comments.

encodings/internal.js

test/streams-test.js

yosion-p · 2021-09-16T07:04:30Z

have a look😃

encodings/internal.js

test/streams-test.js

yosion-p · 2021-09-16T09:13:49Z

Aha i tried to make it short, but didn't work it out.

nit: typo in work "surrogates"

ashtuchkin

Almost there, thank you for persistence!

encodings/internal.js

test/streams-test.js

yosion-p · 2021-09-17T03:59:39Z

Almost there, thank you for persistence!

Victory lies ahead!

ashtuchkin · 2021-09-17T14:32:00Z

test/streams-test.js

+    it("a single string", checkEncodeStream({
+        encoding: "utf8",
+        input: ["\uD83D\uDE3B"]
+    }))


could you explain why you've decided to get back to calling checkEncodeStream directly? Did my previous suggestion not work?

ashtuchkin · 2021-09-17T14:45:55Z

encodings/internal.js

+        str = this.lowSurrogate + str
+        this.lowSurrogate = ''
+    }
+    return Buffer.from(str, this.enc);


I see you're returning Buffer.from in the end, this is good.
The lastr handling became more messy though. I thought we could do something like this:

InternalEncoderUtf8.prototype.write = function (str) { if (!str) return; if (this.lowSurrogate) { str = this.lowSurrogate + str; this.lowSurrogate = ''; } var lastCharCode = str.charCodeAt(str.length-1); if (0xD800 < lastCharCode && lastCharCode <= 0x...) { this.lowSurrogate = str.slice(str.length - 1); str = str.slice(0, str.length - 1) } return Buffer.from(str, this.enc); }

yosion-p · 2021-09-18T02:26:13Z

Hi Buddy,

I see all the tests have the same structure ....

Now i realize actually you made it very clear, it's me misunderstand it, I'm terribly sorry…

make sure the function does not throw if str is empty.

About what you said,
I have my own interpretation,
I added one condition judgment was also in the first line of InternalEncoderUtf8.prototype.write , then I removed.

I think we still need to consider the incoming string is empty, as internal. js line: 81 InternalEncoder.prototype.write,
otherwise UT will report an error, like I just submitted for the first time.

Here's what I thought I'd do:

if (str) {
    if (this.lowSurrogate) {
        str = this.lowSurrogate + str;
        this.lowSurrogate = '';
    }

    var charCode = str.charCodeAt(str.length - 1);
    if (55296 < charCode && charCode <= 56319) {
        this.lowSurrogate = str[str.length - 1];
        str = str.slice(0, str.length - 1);
    }
}
return Buffer.from(str, this.enc);

But it has nested conditional judgments,
so, I did something like this:

if (!str) return Buffer.from('', this.enc);;

if (this.lowSurrogate) {
    str = this.lowSurrogate + str;
    this.lowSurrogate = '';
}

var charCode = str.charCodeAt(str.length - 1);
if (55296 < charCode && charCode <= 56319) {
    this.lowSurrogate = str[str.length - 1];
    str = str.slice(0, str.length - 1);
}

return Buffer.from(str, this.enc);

Although there are two Buffer.from in it.

ashtuchkin · 2021-09-18T17:55:47Z

I agree, we need to return an empty buffer, not undefined, when str is empty.

I've made some final minor adjustments (e.g. it's actually a high surrogate, not low :) and will merge. Thank you for your contribution!

yosion-p · 2021-09-19T14:43:00Z

Thank you buddy! It's a milestone for me the first time my code got merged in an open source project. Even though my pointcut and idea are right since the beginning, but the process was kind of tortuous. Maybe I need to pay more attention to the format and performance of the code, and try to get your full point.lol In short, thanks a lot for your patience and encouragement, respect! Best Regards, Yosion. At 2021-09-19 01:55:57, "Alexander Shtuchkin" ***@***.***> wrote: I agree, we need to return an empty buffer, not undefined, when str is empty. I've made some final minor adjustments (e.g. it's actually a high surrogate, not low :) and will merge. Thank you for your contribution! — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

ashtuchkin · 2021-09-19T16:45:59Z

You did great, keep it up! It all comes with practice, don't worry too much.

fix: completing the InternalEncoder for utf8 with surrogates

fb6847f

ashtuchkin requested changes Sep 14, 2021

View reviewed changes

test/streams-test.js Outdated Show resolved Hide resolved

update: add test cases,and pass it

17aa1ae

ashtuchkin requested changes Sep 14, 2021

View reviewed changes

encodings/internal.js Outdated Show resolved Hide resolved

test/streams-test.js Outdated Show resolved Hide resolved

update: correct surroagetes outputs and make InternalEncoder more eff…

ea2a267

…icient

ashtuchkin reviewed Sep 16, 2021

View reviewed changes

encodings/internal.js Outdated Show resolved Hide resolved

test/streams-test.js Outdated Show resolved Hide resolved

test/streams-test.js Show resolved Hide resolved

fix: Modify boundary judgment conditions

a1496ef

ashtuchkin reviewed Sep 16, 2021

View reviewed changes

encodings/internal.js Outdated Show resolved Hide resolved

test/streams-test.js Outdated Show resolved Hide resolved

test/streams-test.js Outdated Show resolved Hide resolved

yosion-p added 2 commits September 16, 2021 18:09

fix: to streamline the code

fba6f58

fix: to streamline the code

458f700

ashtuchkin reviewed Sep 16, 2021

View reviewed changes

encodings/internal.js Show resolved Hide resolved

test/streams-test.js Outdated Show resolved Hide resolved

fix: more streamlined code

8037e15

ashtuchkin reviewed Sep 17, 2021

View reviewed changes

yosion-p added 2 commits September 18, 2021 09:44

fix: Improve format of the code.

9b83784

fix: Returns an empty Buffer when a string is passed

d03e2ba

Minor adjustments

9e4888f

ashtuchkin merged commit 1d8d89e into pillarjs:master Sep 18, 2021

bjohansebas mentioned this pull request Aug 18, 2025

release: 0.7.0 #334

Merged

Uh oh!

Conversation

yosion-p commented Sep 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ashtuchkin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ashtuchkin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

yosion-p commented Sep 15, 2021

Uh oh!

yosion-p commented Sep 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yosion-p commented Sep 15, 2021

Uh oh!

yosion-p commented Sep 16, 2021

Uh oh!

ashtuchkin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yosion-p commented Sep 16, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yosion-p commented Sep 16, 2021

Uh oh!

ashtuchkin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

yosion-p commented Sep 17, 2021

Uh oh!

ashtuchkin Sep 17, 2021

Choose a reason for hiding this comment

Uh oh!

ashtuchkin Sep 17, 2021

Choose a reason for hiding this comment

Uh oh!

yosion-p commented Sep 18, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ashtuchkin commented Sep 18, 2021

Uh oh!

yosion-p commented Sep 19, 2021 via email

Uh oh!

ashtuchkin commented Sep 19, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yosion-p commented Sep 13, 2021 •

edited

Loading

yosion-p commented Sep 15, 2021 •

edited

Loading

yosion-p commented Sep 18, 2021 •

edited

Loading