fix: completing the InternalEncoder for utf8#282
fix: completing the InternalEncoder for utf8#282ashtuchkin merged 10 commits intopillarjs:masterfrom
Conversation
ashtuchkin
left a comment
There was a problem hiding this comment.
Looks pretty good! Could you add more tests please (see comment)?
ashtuchkin
left a comment
There was a problem hiding this comment.
See comments below.
The behavior we want is to be always equivalent to what Node does, independently of where chunk breaks happen. More formally, for every string S and for every split of S into chunks, the merged output of iconv-lite encoding must be equal to Node's encoding of the whole string (Buffer.from(S, "utf8")).
|
I think I got it,
sorry,I thought incomplete surrogate pairs were invalid and could be filtered out. But about that:
I don't really understand, |
|
I was wondering if there was something wrong with the test method,not Encoder. Just like I thought before, If we modify InternalEncoder, That's why I need to go through a loop and check for characters one by one to see if it is a surrogate. |
|
I think I can solve this problem in |
|
Hi bro,It's me, again.🤣 |
ashtuchkin
left a comment
There was a problem hiding this comment.
Getting close, left a couple more comments.
|
have a look😃 |
|
Aha i tried to make it short, but didn't work it out.
|
ashtuchkin
left a comment
There was a problem hiding this comment.
Almost there, thank you for persistence!
Victory lies ahead! |
test/streams-test.js
Outdated
| it("a single string", checkEncodeStream({ | ||
| encoding: "utf8", | ||
| input: ["\uD83D\uDE3B"] | ||
| })) |
There was a problem hiding this comment.
could you explain why you've decided to get back to calling checkEncodeStream directly? Did my previous suggestion not work?
| str = this.lowSurrogate + str | ||
| this.lowSurrogate = '' | ||
| } | ||
| return Buffer.from(str, this.enc); |
There was a problem hiding this comment.
I see you're returning Buffer.from in the end, this is good.
The lastr handling became more messy though. I thought we could do something like this:
InternalEncoderUtf8.prototype.write = function (str) {
if (!str) return;
if (this.lowSurrogate) {
str = this.lowSurrogate + str;
this.lowSurrogate = '';
}
var lastCharCode = str.charCodeAt(str.length-1);
if (0xD800 < lastCharCode && lastCharCode <= 0x...) {
this.lowSurrogate = str.slice(str.length - 1);
str = str.slice(0, str.length - 1)
}
return Buffer.from(str, this.enc);
}
|
Hi Buddy,
Now i realize actually you made it very clear, it's me misunderstand it, I'm terribly sorry…
About what you said, I think we still need to consider the incoming string is empty, as Here's what I thought I'd do: if (str) {
if (this.lowSurrogate) {
str = this.lowSurrogate + str;
this.lowSurrogate = '';
}
var charCode = str.charCodeAt(str.length - 1);
if (55296 < charCode && charCode <= 56319) {
this.lowSurrogate = str[str.length - 1];
str = str.slice(0, str.length - 1);
}
}
return Buffer.from(str, this.enc);But it has nested conditional judgments, if (!str) return Buffer.from('', this.enc);;
if (this.lowSurrogate) {
str = this.lowSurrogate + str;
this.lowSurrogate = '';
}
var charCode = str.charCodeAt(str.length - 1);
if (55296 < charCode && charCode <= 56319) {
this.lowSurrogate = str[str.length - 1];
str = str.slice(0, str.length - 1);
}
return Buffer.from(str, this.enc);Although there are two |
|
I agree, we need to return an empty buffer, not undefined, when str is empty. I've made some final minor adjustments (e.g. it's actually a high surrogate, not low :) and will merge. Thank you for your contribution! |
|
Thank you buddy! It's a milestone for me the first time my code got merged in an open source project.
Even though my pointcut and idea are right since the beginning, but the process was kind of tortuous.
Maybe I need to pay more attention to the format and performance of the code,
and try to get your full point.lol
In short, thanks a lot for your patience and encouragement, respect!
Best Regards,
Yosion.
At 2021-09-19 01:55:57, "Alexander Shtuchkin" ***@***.***> wrote:
I agree, we need to return an empty buffer, not undefined, when str is empty.
I've made some final minor adjustments (e.g. it's actually a high surrogate, not low :) and will merge. Thank you for your contribution!
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or unsubscribe.
Triage notifications on the go with GitHub Mobile for iOS or Android.
|
|
You did great, keep it up! It all comes with practice, don't worry too much. |

Hi bro,It's me, again.🤣
I have completed the
InternalEncoderfor utf8 with surrogates after I read what you meant(#275 ).#275 looked a bit messy, so I submitted a brand new one,
I didn't know much before, but now I know more.
I think surrogates should be in pairs always.
Thank you for your patience.