[SPARK-9238][SQL] Remove two extra useless entries for bytesOfCodePointInUTF8 #7582

zhichao-li · 2015-07-22T04:28:48Z

Only a trial thing, not sure if I understand correctly or not but I guess only 2 entries in bytesOfCodePointInUTF8 for the case of 6 bytes codepoint(1111110x) is enough.
Details can be found from https://en.wikipedia.org/wiki/UTF-8 in "Description" section.

zhichao-li · 2015-07-22T04:29:45Z

cc @chenghao-intel @rxin

chenghao-intel · 2015-07-22T04:46:39Z

LGTM

rxin · 2015-07-22T05:28:48Z

cc @davies

SparkQA · 2015-07-22T06:47:10Z

Test build #38026 has finished for PR 7582 at commit 8bddd01.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2015-07-23T15:40:34Z

@zhichao-li Two entries are enough for correctness. 254 and 255 are invalid, using 6 for them are even confusing. Right now, we didn't handle invalid UTF8 well (right now, it may raise exception), we could improve that, should pay attention to reduce the overhead for correct one.

zhichao-li · 2015-07-24T03:02:50Z

@davies, currently if the first byte is 254 or 255, numBytesForFirstByte would return 6 which somehow hide the invalid case not raising exception. What's your opinion? 1)Just stay the same 2)Explicitly raise exception here? or 3)Delete these two entries which would raise outOfRange exception when facing invalid char?

davies · 2015-07-24T04:52:04Z

I think it's better to raise an exception than parse it in wrong way silently. If we want to have better behavior, then it should be done case by case for every function, it's not trivial to me. So I'd like to peek 3), not to have this two additional entries.

zhichao-li · 2015-07-24T08:31:43Z

yeah, that's what this pr target to, I guess it's ready to be merged?

davies · 2015-07-24T15:34:32Z

LGTM, merging this into master and 1.4!

…intInUTF8 Only a trial thing, not sure if I understand correctly or not but I guess only 2 entries in `bytesOfCodePointInUTF8` for the case of 6 bytes codepoint(1111110x) is enough. Details can be found from https://en.wikipedia.org/wiki/UTF-8 in "Description" section. Author: zhichao.li <[email protected]> Closes #7582 from zhichao-li/utf8 and squashes the following commits: 8bddd01 [zhichao.li] two extra entries (cherry picked from commit 846cf46) Signed-off-by: Davies Liu <[email protected]> Conflicts: unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java

two extra entries

8bddd01

zhichao-li changed the title ~~[SQL]two extra useless entries for bytesOfCodePointInUTF8~~ [SPARK-9238][SQL]two extra useless entries for bytesOfCodePointInUTF8 Jul 22, 2015

zhichao-li changed the title ~~[SPARK-9238][SQL]two extra useless entries for bytesOfCodePointInUTF8~~ [SPARK-9238][SQL] Remove two extra useless entries for bytesOfCodePointInUTF8 Jul 24, 2015

asfgit closed this in 846cf46 Jul 24, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-9238][SQL] Remove two extra useless entries for bytesOfCodePointInUTF8 #7582

[SPARK-9238][SQL] Remove two extra useless entries for bytesOfCodePointInUTF8 #7582

Uh oh!

zhichao-li commented Jul 22, 2015

Uh oh!

zhichao-li commented Jul 22, 2015

Uh oh!

chenghao-intel commented Jul 22, 2015

Uh oh!

rxin commented Jul 22, 2015

Uh oh!

SparkQA commented Jul 22, 2015

Uh oh!

davies commented Jul 23, 2015

Uh oh!

zhichao-li commented Jul 24, 2015

Uh oh!

davies commented Jul 24, 2015

Uh oh!

zhichao-li commented Jul 24, 2015

Uh oh!

davies commented Jul 24, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[SPARK-9238][SQL] Remove two extra useless entries for bytesOfCodePointInUTF8 #7582

[SPARK-9238][SQL] Remove two extra useless entries for bytesOfCodePointInUTF8 #7582

Uh oh!

Conversation

zhichao-li commented Jul 22, 2015

Uh oh!

zhichao-li commented Jul 22, 2015

Uh oh!

chenghao-intel commented Jul 22, 2015

Uh oh!

rxin commented Jul 22, 2015

Uh oh!

SparkQA commented Jul 22, 2015

Uh oh!

davies commented Jul 23, 2015

Uh oh!

zhichao-li commented Jul 24, 2015

Uh oh!

davies commented Jul 24, 2015

Uh oh!

zhichao-li commented Jul 24, 2015

Uh oh!

davies commented Jul 24, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants