-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-28023][SQL] Add trim logic in UTF8String's toInt/toLong to make it consistent with other string-numeric casting #26622
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
master branch non casting logic modified, cast result will be null |
Double and float 's supported without trim both codegen and non-codegen |
| cast(str as int) as c_int 3169 3530 610 1.3 773.6 1.0X | ||
| cast(str as long) as c_long 1812 1881 60 2.3 442.4 1.7X | ||
| cast(str as int) as c_int 2208 4341 1848 1.9 539.0 1.0X | ||
| cast(str as long) as c_long 2039 3450 2146 2.0 497.8 1.1X |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan the result seems not as we expected, I'd increase the cardinality and do another test round. Can you help me to see if I missed something?
| Intel(R) Core(TM) i5-5287U CPU @ 2.90GHz | ||
| Benchmark trim the string: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative | ||
| ------------------------------------------------------------------------------------------------------------------------ | ||
| cast(str as int) as c_int 2208 4341 1848 1.9 539.0 1.0X |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the result of master branch?
| * | ||
| * @return this string with no spaces at the start or end | ||
| */ | ||
| public UTF8String nonCopyTrim() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another way is to embed the trim logic into toInt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will check this
| cast(str as int) as c_int 2478 3669 1046 1.7 604.9 1.0X | ||
| cast(str as long) as c_long 1439 1548 94 2.8 351.4 1.7X | ||
| cast(str as int) as c_int 3169 3530 610 1.3 773.6 1.0X | ||
| cast(str as long) as c_long 1812 1881 60 2.3 442.4 1.7X |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan this is master branch to original trim result.
| cast(str as int) as c_int 7105 8190 945 1.2 867.3 1.0X | ||
| cast(str as long) as c_long 7520 8670 1629 1.1 918.0 0.9X | ||
| cast(str as int) as c_int 6263 8132 NaN 1.3 764.6 1.0X | ||
| cast(str as long) as c_long 8199 9737 NaN 1.0 1000.9 0.8X |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cardinality * 2, (-) is trim (+) is non-copy trim, the cost for copyMemory an int or long value is trial, I guess.
|
Test build #114207 has finished for PR 26622 at commit
|
|
Test build #114211 has finished for PR 26622 at commit
|
|
Test build #114213 has finished for PR 26622 at commit
|
|
Test build #114210 has finished for PR 26622 at commit
|
Java HotSpot(TM) 64-Bit Server VM 1.8.0_231-b11 on Mac OS X 10.15.1
Intel(R) Core(TM) i5-5287U CPU @ 2.90GHz
Benchmark trim the string: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
Cast String to Numeric: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
### nomal trim
cast(str as int) as c_int 6263 8132 NaN 1.3 764.6 1.0X
cast(str as long) as c_long 8199 9737 NaN 1.0 1000.9 0.8X
### inside trim
cast(str as int) as c_int 4015 6735 412 2.0 490.1 1.0X
cast(str as long) as c_long 8251 8597 302 1.0 1007.1 0.5Xcc @cloud-fan |
| val N = 500L << 14 | ||
| val df = spark.range(N) | ||
| val types = Seq("int", "long") | ||
| df.selectExpr(s"concat('${" " * 5}', id, '${" " * 5}') as str") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we only benchmark spaces at right? The parsing logic will return immediately if the first char is a space, so not very useful to benchmark it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sense
normal trim v.s. inside trimThe result shows 10~20% perfomance improvement @cloud-fan [info] Java HotSpot(TM) 64-Bit Server VM 1.8.0_231-b11 on Mac OS X 10.15.1
[info] Intel(R) Core(TM) i5-5287U CPU @ 2.90GHz
[info] Cast String to Numeric: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] cast(trim(str) as int) as c_int 4259 5270 1110 1.9 519.9 1.0X
[info] cast(trim(str) as long) as c_long 4081 5663 1372 2.0 498.2 1.0X
[info] cast(str as int) as c_int 4071 4254 176 2.0 496.9 1.0X
[info] cast(str as long) as c_long 4121 5087 1272 2.0 503.1 1.0X
[info] |
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java
Show resolved
Hide resolved
normal trim v.s . no trim |
normal trim v.s. inside trim in toInt and toLong[info] Java HotSpot(TM) 64-Bit Server VM 1.8.0_231-b11 on Mac OS X 10.15.1
[info] Intel(R) Core(TM) i5-5287U CPU @ 2.90GHz
[info] Cast String to Integral: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] cast(trim(str) as int) as c_int 4026 7674 NaN 2.0 491.5 1.0X
[info] cast(trim(str) as long) as c_long 4022 7887 1345 2.0 491.0 1.0X
[info] cast(str as int) as c_int 4390 6453 1009 1.9 535.9 0.9X
[info] cast(str as long) as c_long 3759 5388 1413 2.2 458.8 1.1X
[info] |
|
@cloud-fan if this works, we may need to fix Decimal separately |
|
I think the overhead is acceptable to make the behavior consistent. what do you think? @gatorsmile @srowen @dongjoon-hyun @MaxGekk @wangyum |
|
+1 To keep the behavior consistent. |
|
Shall we handle control characters for integral type here as approximate numerics can do? #26626 (comment) |
|
Test build #114242 has finished for PR 26622 at commit
|
|
Test build #114246 has finished for PR 26622 at commit
|
| } | ||
| if (this.numBytes == 0) return false; | ||
| int offset = 0; | ||
| while (offset < this.numBytes && getByte(offset) == ' ') offset++; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's handle control characters to be consistent with casting to float/double
| int end = this.numBytes - 1; | ||
| while (end > offset && getByte(end) == ' ') end--; | ||
|
|
||
| int numBytes = end - offset + 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if it's only used once, we can inline it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
|
@yaooqinn let's add a migration guide. I think pretty close now. |
|
Test build #114267 has finished for PR 26622 at commit
|
|
|
||
| byte b = getByte(0); | ||
| int end = this.numBytes - 1; | ||
| while (end > offset && getByte(end) <= ' ') end--; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need to trim from the right explicitly here. Just break inside the loop https://github.com/apache/spark/pull/26622/files#diff-d2b5337b91f684b9e7fd5cc101e93fc8R1104 if b == ' '
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I guess not, how do you know the ' ' is in the middle or end?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see
|
Test build #114275 has finished for PR 26622 at commit
|
|
retest this please |
|
Test build #114284 has finished for PR 26622 at commit
|
|
thanks, merging to master! |
|
thanks for merging |
…t handle non-ASCII characters correctly ### What changes were proposed in this pull request? The trim logic in Cast expression introduced in #26622 trim non-ASCII characters unexpectly. Before this patch  After this patch  ### Why are the changes needed? The behavior described above doesn't make sense, and also doesn't consistent with the behavior when cast a string to double/float, as well as doesn't consistent with the behavior of Hive ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? Added more UT Closes #29375 from WangGuangxin/cast-bugfix. Authored-by: wangguangxin.cn <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
…t handle non-ASCII characters correctly The trim logic in Cast expression introduced in apache#26622 trim non-ASCII characters unexpectly. Before this patch  After this patch  The behavior described above doesn't make sense, and also doesn't consistent with the behavior when cast a string to double/float, as well as doesn't consistent with the behavior of Hive Yes Added more UT Closes apache#29375 from WangGuangxin/cast-bugfix. Authored-by: wangguangxin.cn <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
… did't handle non-ASCII characters correctly ### What changes were proposed in this pull request? This is a backport of #29375 The trim logic in Cast expression introduced in #26622 trim non-ASCII characters unexpectly. Before this patch  After this patch  ### Why are the changes needed? The behavior described above doesn't make sense, and also doesn't consistent with the behavior when cast a string to double/float, as well as doesn't consistent with the behavior of Hive ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? Added more UT Closes #29393 from WangGuangxin/cast-bugfix-branch-3.0. Authored-by: wangguangxin.cn <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
What changes were proposed in this pull request?
Modify
UTF8String.toInt/toLongto support trim spaces for both sides before converting it to byte/short/int/long.With this kind of "cheap" trim can help improve performance for casting string to integrals. The idea is from #24872 (comment)
Why are the changes needed?
make the behavior consistent.
Does this PR introduce any user-facing change?
yes, cast string to an integral type, and binary comparison between string and integrals will trim spaces first. their behavior will be consistent with float and double.
How was this patch tested?
the benchmark is modified based on [SPARK-28023][SQL] Trim the string when cast string type to Boolean/Numeric types #24872 (comment)
benchmark result.
normal trim v.s. trim in toInt/toLong