-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-9238][SQL] Remove two extra useless entries for bytesOfCodePointInUTF8 #7582
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
LGTM |
|
cc @davies |
|
Test build #38026 has finished for PR 7582 at commit
|
|
@zhichao-li Two entries are enough for correctness. 254 and 255 are invalid, using |
|
@davies, currently if the first byte is 254 or 255, |
|
I think it's better to raise an exception than parse it in wrong way silently. If we want to have better behavior, then it should be done case by case for every function, it's not trivial to me. So I'd like to peek 3), not to have this two additional entries. |
|
yeah, that's what this pr target to, I guess it's ready to be merged? |
|
LGTM, merging this into master and 1.4! |
…intInUTF8 Only a trial thing, not sure if I understand correctly or not but I guess only 2 entries in `bytesOfCodePointInUTF8` for the case of 6 bytes codepoint(1111110x) is enough. Details can be found from https://en.wikipedia.org/wiki/UTF-8 in "Description" section. Author: zhichao.li <[email protected]> Closes #7582 from zhichao-li/utf8 and squashes the following commits: 8bddd01 [zhichao.li] two extra entries (cherry picked from commit 846cf46) Signed-off-by: Davies Liu <[email protected]> Conflicts: unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java
Only a trial thing, not sure if I understand correctly or not but I guess only 2 entries in
bytesOfCodePointInUTF8for the case of 6 bytes codepoint(1111110x) is enough.Details can be found from https://en.wikipedia.org/wiki/UTF-8 in "Description" section.