-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-25108][SQL] Fix the show method to display the wide character alignment problem #22048
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 6 commits
1b9b2e7
9aec12f
906c0ba
da37d2e
8737671
697ac04
3d65e6b
363de6b
3649de5
45ac272
52acfd5
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -2794,6 +2794,27 @@ private[spark] object Utils extends Logging { | |
| } | ||
| } | ||
| } | ||
|
|
||
| /** | ||
| * Regular expression matching full width characters | ||
| */ | ||
| private lazy val fullWidthRegex = ("""[""" + | ||
| """\u1100-\u115F""" + | ||
| """\u2E80-\uA4CF""" + | ||
| """\uAC00-\uD7A3""" + | ||
| """\uF900-\uFAFF""" + | ||
| """\uFE10-\uFE19""" + | ||
| """\uFE30-\uFE6F""" + | ||
| """\uFF00-\uFF60""" + | ||
| """\uFFE0-\uFFE6""" + | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A general question.
Can you answer them and post them in the PR description?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I looked at all the 0x0000-0xFFFF characters (unicode) and showed them under Xshell, then found all the full width characters. Get the regular expression.
I generated 1000 strings, each consisting of 1000 characters with a random unicode of 0x0000-0xFFFF. (a total of 1 million characters.)
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Can you describe them there and put a references to a public unicode document?
How about some additional overheads when calling
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
This is a regular expression match using unicode, regardless of the specific encoding. val bytes = Array[Byte](0xd6.toByte, 0xd0.toByte, 0xB9.toByte, 0xFA.toByte)
val s1 = new String(bytes, "gbk")
println(s1) //中国
val fullWidthRegex = ("""[""" +
// scalastyle:off nonascii
"""\u1100-\u115F""" +
"""\u2E80-\uA4CF""" +
"""\uAC00-\uD7A3""" +
"""\uF900-\uFAFF""" +
"""\uFE10-\uFE19""" +
"""\uFE30-\uFE6F""" +
"""\uFF00-\uFF60""" +
"""\uFFE0-\uFFE6""" +
// scalastyle:on nonascii
"""]""").r
println(fullWidthRegex.findAllIn(s1).size) //2
This regular expression is obtained experimentally under a specific font.
I tested a Dataset consisting of 100 rows, each row has two columns, one column is the index (0-99), and the other column is a random string of length 100 characters, and then the showString display is called separately.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this is fine. Just copy a summary of your comments here into the comments in the code. Yes this has nothing to do with UTF8 encoding directly. You are matching UCS2 really, 16bit char values.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do I need to merge the above commited into one commit, |
||
| """]""").r | ||
xuejianbest marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| /** | ||
| * Return the number of half width of a string | ||
| * A full width character occupies two half widths | ||
| */ | ||
xuejianbest marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| def stringHalfWidth(str: String): Int = { | ||
| if(str == null) 0 else str.length + fullWidthRegex.findAllIn(str).size | ||
|
||
| } | ||
| } | ||
|
|
||
| private[util] object CallerContext extends Logging { | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't need to be lazy