Skip to content

fix: completing the InternalEncoder for utf8#282

Merged
ashtuchkin merged 10 commits intopillarjs:masterfrom
yosion-p:internalCodec
Sep 18, 2021
Merged

fix: completing the InternalEncoder for utf8#282
ashtuchkin merged 10 commits intopillarjs:masterfrom
yosion-p:internalCodec

Conversation

@yosion-p
Copy link
Contributor

@yosion-p yosion-p commented Sep 13, 2021

Hi bro,It's me, again.🤣
I have completed the InternalEncoder for utf8 with surrogates after I read what you meant(#275 ).
#275 looked a bit messy, so I submitted a brand new one,
I didn't know much before, but now I know more.
I think surrogates should be in pairs always.
Thank you for your patience.

Copy link
Contributor

@ashtuchkin ashtuchkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good! Could you add more tests please (see comment)?

Copy link
Contributor

@ashtuchkin ashtuchkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments below.

The behavior we want is to be always equivalent to what Node does, independently of where chunk breaks happen. More formally, for every string S and for every split of S into chunks, the merged output of iconv-lite encoding must be equal to Node's encoding of the whole string (Buffer.from(S, "utf8")).

@yosion-p
Copy link
Contributor Author

I think I got it,

the merged output of iconv-lite encoding must be equal to Node's encoding of the whole string (Buffer.from(S, "utf8")).

sorry,I thought incomplete surrogate pairs were invalid and could be filtered out.
I will correct all outputs.

But about that:

a loop over the whole string is pretty inefficient. Moreover, you don't need to touch internal surrogates - Node processes them correctly already. We only need to adjust boundaries, as these are what Node can't adjust as it doesn't know about them.

I don't really understand,
Because if I don't loop through the string, I won't know which character in the string needs to be processed.

@yosion-p
Copy link
Contributor Author

yosion-p commented Sep 15, 2021

image

I was wondering if there was something wrong with the test method,not Encoder.

Just like I thought before,
If UT into multiple strings in the input,IconvLiteEncoderStream.prototype._transform will be invoked many times, back to characters be cut by Encoder rather than a complete surrogates pair.

If we modify InternalEncoder,
When the surrogates pair was cut,
The first call doesn't return anything, only the second one returns a full surrogates pair,

That's why I need to go through a loop and check for characters one by one to see if it is a surrogate.

@yosion-p
Copy link
Contributor Author

I think I can solve this problem in InternalEncoderUtf8.prototype.end(),
I will try.

@yosion-p
Copy link
Contributor Author

Hi bro,It's me, again.🤣
I did it by modifying the code of InternalEncoderUtf8.prototype.write.
In a word, less is more.

Copy link
Contributor

@ashtuchkin ashtuchkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting close, left a couple more comments.

@yosion-p
Copy link
Contributor Author

have a look😃

@yosion-p
Copy link
Contributor Author

Aha i tried to make it short, but didn't work it out.

nit: typo in work "surrogates"

Copy link
Contributor

@ashtuchkin ashtuchkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there, thank you for persistence!

@yosion-p
Copy link
Contributor Author

Almost there, thank you for persistence!

Victory lies ahead!

it("a single string", checkEncodeStream({
encoding: "utf8",
input: ["\uD83D\uDE3B"]
}))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you explain why you've decided to get back to calling checkEncodeStream directly? Did my previous suggestion not work?

str = this.lowSurrogate + str
this.lowSurrogate = ''
}
return Buffer.from(str, this.enc);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you're returning Buffer.from in the end, this is good.
The lastr handling became more messy though. I thought we could do something like this:

InternalEncoderUtf8.prototype.write = function (str) {
    if (!str) return;

    if (this.lowSurrogate) {
        str = this.lowSurrogate + str;
        this.lowSurrogate = '';
    }

    var lastCharCode = str.charCodeAt(str.length-1);
    if (0xD800 < lastCharCode && lastCharCode <= 0x...) {
        this.lowSurrogate = str.slice(str.length - 1);
        str = str.slice(0, str.length - 1)
    }

    return Buffer.from(str, this.enc);
}

@yosion-p
Copy link
Contributor Author

yosion-p commented Sep 18, 2021

Hi Buddy,

I see all the tests have the same structure ....

Now i realize actually you made it very clear, it's me misunderstand it, I'm terribly sorry…

make sure the function does not throw if str is empty.

About what you said,
I have my own interpretation,
I added one condition judgment was also in the first line of InternalEncoderUtf8.prototype.write , then I removed.

I think we still need to consider the incoming string is empty, as internal. js line: 81 InternalEncoder.prototype.write,
otherwise UT will report an error, like I just submitted for the first time.

Here's what I thought I'd do:

if (str) {
    if (this.lowSurrogate) {
        str = this.lowSurrogate + str;
        this.lowSurrogate = '';
    }

    var charCode = str.charCodeAt(str.length - 1);
    if (55296 < charCode && charCode <= 56319) {
        this.lowSurrogate = str[str.length - 1];
        str = str.slice(0, str.length - 1);
    }
}
return Buffer.from(str, this.enc);

But it has nested conditional judgments,
so, I did something like this:

if (!str) return Buffer.from('', this.enc);;

if (this.lowSurrogate) {
    str = this.lowSurrogate + str;
    this.lowSurrogate = '';
}

var charCode = str.charCodeAt(str.length - 1);
if (55296 < charCode && charCode <= 56319) {
    this.lowSurrogate = str[str.length - 1];
    str = str.slice(0, str.length - 1);
}

return Buffer.from(str, this.enc);

Although there are two Buffer.from in it.

@ashtuchkin
Copy link
Contributor

I agree, we need to return an empty buffer, not undefined, when str is empty.

I've made some final minor adjustments (e.g. it's actually a high surrogate, not low :) and will merge. Thank you for your contribution!

@ashtuchkin ashtuchkin merged commit 1d8d89e into pillarjs:master Sep 18, 2021
@yosion-p
Copy link
Contributor Author

yosion-p commented Sep 19, 2021 via email

@ashtuchkin
Copy link
Contributor

You did great, keep it up! It all comes with practice, don't worry too much.

@bjohansebas bjohansebas mentioned this pull request Aug 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants