Skip to content

Conversation

@zhichao-li
Copy link
Contributor

Only a trial thing, not sure if I understand correctly or not but I guess only 2 entries in bytesOfCodePointInUTF8 for the case of 6 bytes codepoint(1111110x) is enough.
Details can be found from https://en.wikipedia.org/wiki/UTF-8 in "Description" section.

@zhichao-li
Copy link
Contributor Author

cc @chenghao-intel @rxin

@zhichao-li zhichao-li changed the title [SQL]two extra useless entries for bytesOfCodePointInUTF8 [SPARK-9238][SQL]two extra useless entries for bytesOfCodePointInUTF8 Jul 22, 2015
@chenghao-intel
Copy link
Contributor

LGTM

@rxin
Copy link
Contributor

rxin commented Jul 22, 2015

cc @davies

@SparkQA
Copy link

SparkQA commented Jul 22, 2015

Test build #38026 has finished for PR 7582 at commit 8bddd01.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@davies
Copy link
Contributor

davies commented Jul 23, 2015

@zhichao-li Two entries are enough for correctness. 254 and 255 are invalid, using 6 for them are even confusing. Right now, we didn't handle invalid UTF8 well (right now, it may raise exception), we could improve that, should pay attention to reduce the overhead for correct one.

@zhichao-li
Copy link
Contributor Author

@davies, currently if the first byte is 254 or 255, numBytesForFirstByte would return 6 which somehow hide the invalid case not raising exception. What's your opinion? 1)Just stay the same 2)Explicitly raise exception here? or 3)Delete these two entries which would raise outOfRange exception when facing invalid char?

@davies
Copy link
Contributor

davies commented Jul 24, 2015

I think it's better to raise an exception than parse it in wrong way silently. If we want to have better behavior, then it should be done case by case for every function, it's not trivial to me. So I'd like to peek 3), not to have this two additional entries.

@zhichao-li zhichao-li changed the title [SPARK-9238][SQL]two extra useless entries for bytesOfCodePointInUTF8 [SPARK-9238][SQL] Remove two extra useless entries for bytesOfCodePointInUTF8 Jul 24, 2015
@zhichao-li
Copy link
Contributor Author

yeah, that's what this pr target to, I guess it's ready to be merged?

@davies
Copy link
Contributor

davies commented Jul 24, 2015

LGTM, merging this into master and 1.4!

@asfgit asfgit closed this in 846cf46 Jul 24, 2015
asfgit pushed a commit that referenced this pull request Jul 24, 2015
…intInUTF8

Only a trial thing, not sure if I understand correctly or not but I guess only 2 entries in `bytesOfCodePointInUTF8` for the case of 6 bytes codepoint(1111110x) is enough.
Details can be found from https://en.wikipedia.org/wiki/UTF-8 in "Description" section.

Author: zhichao.li <[email protected]>

Closes #7582 from zhichao-li/utf8 and squashes the following commits:

8bddd01 [zhichao.li] two extra entries

(cherry picked from commit 846cf46)
Signed-off-by: Davies Liu <[email protected]>

Conflicts:
	unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants