-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-28023][SQL] Trim the string when cast string type to Boolean/Numeric types #24872
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #106506 has finished for PR 24872 at commit
|
|
Test build #106515 has finished for PR 24872 at commit
|
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastSuite.scala
Outdated
Show resolved
Hide resolved
docs/sql-migration-guide-upgrade.md
Outdated
| {:toc} | ||
|
|
||
| ## Upgrading From Spark SQL 2.4 to 3.0 | ||
| - Since Spark 3.0, trim the string when cast string type to Boolean/Datetime/Numeric types. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you complete this sentence by having a subject?
cc @srowen
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes something like "when casting from string to boolean, date or numeric types, whitespace is trimmed from the ends of the value first"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
| case StringType => | ||
| val result = new LongWrapper() | ||
| buildCast[UTF8String](_, s => if (s.toLong(result)) result.value else null) | ||
| buildCast[UTF8String](_, s => if (s.trim.toLong(result)) result.value else null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be a correct fix. Do we have a possibility of performance regression?
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, please update the title because Datetime is missing there.
Also, I believe we need tests on partition columns.
cc @MaxGekk and @HyukjinKwon since this is related to Datetime casting and partition column, too.
MaxGekk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes will impact on performance, highly likely. I believe we need to benchmark them in any case. And we should consider to put the trimming under a flag.
|
Just in case, how about to introduce new parameter for |
srowen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is a correctness issue, we can't put it under a flag, right? we wouldn't want the behavior to vary with the flag. How would people generally find this, etc. The overhead of trimming should be trivial compared to parsing.
docs/sql-migration-guide-upgrade.md
Outdated
| {:toc} | ||
|
|
||
| ## Upgrading From Spark SQL 2.4 to 3.0 | ||
| - Since Spark 3.0, trim the string when cast string type to Boolean/Datetime/Numeric types. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes something like "when casting from string to boolean, date or numeric types, whitespace is trimmed from the ends of the value first"
Is it really correctness issue? Is the correct behavior described somewhere in Spark's docs?
The flag would be useful to restore performance in the cases when user's input doesn't contain spaces.
As any other flags - in docs and in the migration guide.
The statement should be confirmed by benchmarks, isn't it? |
|
As I can see,
|
|
Taking a step back: is there a correct behavior? does Hive or a SQL standard suggest that " 3.0" should cast correctly to a double? if so, then there is no question that this is a fix, and we shouldn't offer a flag to make the behavior incorrect. Is there not a clear correct behavior? then don't enforce it in Spark. Callers trim() input if needed, or don't if it isn't. No call for a flag there. The perf question probably doesn't matter, then, either way. Oh, this is |
|
@dongjoon-hyun |
|
I agree with @srowen 's opinion regarding performance. When we add a short-cut to avoid copy, we could minimize performance degradation. |
|
Test build #106552 has finished for PR 24872 at commit
|
|
When I talk about a flag, I meant a flag for spark/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java Line 1026 in 17781d7
where input trimming is integrated to the parsing functions: https://github.com/apache/hive/blob/master/serde/src/java/org/apache/hadoop/hive/serde2/lazy/LazyLong.java#L130-L137 I just want to say that the fix (or a feature if it is not defined by SQL standard) should be made in Speaking about SQL config, let's consider 2 use cases:
Additional thoughts, if we wrap |
|
Presumably right now, all input to these functions doesn't have spaces -- otherwise it would fail. If it's not clear whether the input should be trimmed by these functions from a standards perspective, then I'd say don't make this change at all. Just leave behavior without a compelling reason to change it. If there is, then we need to enforce it. You're right, that leaves users with a possibly redundant trim() in their code. If they know enough to know this, they'd just remove the manual trim(), then -- not undo this 'fix' going forward for future usages. Most people won't know about the flag either way, anyway, if one were added. What's the cost? I put together a crude benchmark of 90% strings that have no whitespace at the ends, and 10% that do. It's 20 nanoseconds per call or so. If I add one extra short-circuit to trim() for that common case, it's 6 nanoseconds. We can at least bring the overhead of the common case down a lot, but it's already very small. I'll propose that change separately anyway. If this change is important, I think a flag isn't necessary. But it may just not be the right behavior change anyway. |
I worry mostly regarding the garbage the
Let's confirm that this is required by the standard. At the moment, the trimming seems just a nice feature. |
|
If we would introduce an optimization to return |
|
See #24884 |
# Conflicts: # docs/sql-migration-guide-upgrade.md
|
Test build #106600 has finished for PR 24872 at commit
|
| code""" | ||
| try { | ||
| Decimal $tmp = Decimal.apply(new java.math.BigDecimal($c.toString())); | ||
| Decimal $tmp = Decimal.apply(new java.math.BigDecimal($c.toString().trim())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: $c.trim().toString() may be more efficient?
| {:toc} | ||
|
|
||
| ## Upgrading From Spark SQL 2.4 to 3.0 | ||
| - Since Spark 3.0, trim the string when casting from string to boolean, date, timestamp or numeric types, whitespace is trimmed from the ends of the value first. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about:
Since Spark 3.0, when a string is cast to boolean/date/timestamp/numeric types, it is trimmed before it is parsed.
|
Benchmark and benchmark result. /*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.spark.sql.execution.benchmark
import org.apache.spark.benchmark.Benchmark
/**
* Benchmark trim the string when casting string type to Boolean/Numeric types.
* To run this benchmark:
* {{{
* 1. without sbt:
* bin/spark-submit --class <this class> --jars <spark core test jar> <spark sql test jar>
* 2. build/sbt "sql/test:runMain <this class>"
* 3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain <this class>"
* Results will be written to "benchmarks/CastBenchmark-results.txt".
* }}}
*/
object CastBenchmark extends SqlBasedBenchmark {
override def runBenchmarkSuite(mainArgs: Array[String]): Unit = {
val title = "Benchmark trim the string"
runBenchmark(title) {
withTempPath { dir =>
val N = 500L << 13
val df = spark.range(N)
val withoutWhitespace = "withoutWhitespace"
val withWhitespace = "withWhitespace"
val types = Seq("int", "long", "float", "double", "decimal", "boolean")
df.selectExpr("cast(id as string) as str")
.write.mode("overwrite").parquet(dir + withoutWhitespace)
df.selectExpr(s"concat('${" " * 5}', id, '${" " * 5}') as str")
.write.mode("overwrite").parquet(dir + withWhitespace)
val benchmark = new Benchmark(title, N, minNumIters = 5, output = output)
Seq(withoutWhitespace, withWhitespace).foreach { data =>
Seq(false, true).foreach { isTrimStr =>
val expr =
types.map(t => s"cast(${if (isTrimStr) "trim(str)" else "str"} as $t) as c_$t")
val name = s"$data ${if (isTrimStr) "with" else "without"} trim"
benchmark.addCase(name) { _ =>
spark.read.parquet(dir + data).selectExpr(expr: _*).collect()
}
}
}
benchmark.run()
}
}
}
}Before this pr(after SPARK-28066): After this pr(after SPARK-28066): |
|
Based on the above discussions, let us keep it open. We can revisit it later. |
|
Hi, @gatorsmile . Do you think we can make a decision for this? (There are some other minor PRs depending on this.) |
|
I personally don't see a compelling reason to do this, even though |
|
Thank you for the opinion, @srowen . The original purpose of this issue and PR is to follow PostgreSQL behavior more in Apache Spark 3.0.0. (The umbrella JIRA is https://issues.apache.org/jira/browse/SPARK-27764). Some other PRs are pending because we are waiting for the decision.
I'm okay with both directions. The important thing in this PR is that we need PMC's decision to move forward. Then, do you think we are going to close this PR? |
|
We are unable to 100% match the semantics of PostgreSQL. Adding the extra I think, in the future, we can add the PostgreSQL-compatible mode for supporting all the corner cases. This effort will be big. If we decide to add it, it might need 1000+ JIRAs and PRs. |
|
Thank you for the conclusion, @gatorsmile and @srowen ! @wangyum . Could you resolve the issue as |
|
The problem I see is not about pgsql compatibility, but about internal consistency. Why casting to date/timestamp trims the spaces but casting to numeric does not? BTW for parsing, we don't really need a "safe" trim that copies the data. We can do a "cheap" trim that only changes the |
Yea, aggree. for the reason of type coercion changed between spark versions unexpectedly, now the binary comparator are affected. |
|
@yaooqinn can you send a PR to fix it? We can reuse @wangyum 's benchmark: #24872 (comment) and see if "cheap" trim can help. |
OK |
…e it consistent with other string-numeric casting ### What changes were proposed in this pull request? Modify `UTF8String.toInt/toLong` to support trim spaces for both sides before converting it to byte/short/int/long. With this kind of "cheap" trim can help improve performance for casting string to integrals. The idea is from #24872 (comment) ### Why are the changes needed? make the behavior consistent. ### Does this PR introduce any user-facing change? yes, cast string to an integral type, and binary comparison between string and integrals will trim spaces first. their behavior will be consistent with float and double. ### How was this patch tested? 1. add ut. 2. benchmark tests the benchmark is modified based on #24872 (comment) ```scala /* * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ package org.apache.spark.sql.execution.benchmark import org.apache.spark.benchmark.Benchmark /** * Benchmark trim the string when casting string type to Boolean/Numeric types. * To run this benchmark: * {{{ * 1. without sbt: * bin/spark-submit --class <this class> --jars <spark core test jar> <spark sql test jar> * 2. build/sbt "sql/test:runMain <this class>" * 3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain <this class>" * Results will be written to "benchmarks/CastBenchmark-results.txt". * }}} */ object CastBenchmark extends SqlBasedBenchmark { This conversation was marked as resolved by yaooqinn override def runBenchmarkSuite(mainArgs: Array[String]): Unit = { val title = "Cast String to Integral" runBenchmark(title) { withTempPath { dir => val N = 500L << 14 val df = spark.range(N) val types = Seq("int", "long") (1 to 5).by(2).foreach { i => df.selectExpr(s"concat(id, '${" " * i}') as str") .write.mode("overwrite").parquet(dir + i.toString) } val benchmark = new Benchmark(title, N, minNumIters = 5, output = output) Seq(true, false).foreach { trim => types.foreach { t => val str = if (trim) "trim(str)" else "str" val expr = s"cast($str as $t) as c_$t" (1 to 5).by(2).foreach { i => benchmark.addCase(expr + s" - with $i spaces") { _ => spark.read.parquet(dir + i.toString).selectExpr(expr).collect() } } } } benchmark.run() } } } } ``` #### benchmark result. normal trim v.s. trim in toInt/toLong ```java ================================================================================================ Cast String to Integral ================================================================================================ Java HotSpot(TM) 64-Bit Server VM 1.8.0_231-b11 on Mac OS X 10.15.1 Intel(R) Core(TM) i5-5287U CPU 2.90GHz Cast String to Integral: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ cast(trim(str) as int) as c_int - with 1 spaces 10220 12994 1337 0.8 1247.5 1.0X cast(trim(str) as int) as c_int - with 3 spaces 4763 8356 357 1.7 581.4 2.1X cast(trim(str) as int) as c_int - with 5 spaces 4791 8042 NaN 1.7 584.9 2.1X cast(trim(str) as long) as c_long - with 1 spaces 4014 6755 NaN 2.0 490.0 2.5X cast(trim(str) as long) as c_long - with 3 spaces 4737 6938 NaN 1.7 578.2 2.2X cast(trim(str) as long) as c_long - with 5 spaces 4478 6919 1404 1.8 546.6 2.3X cast(str as int) as c_int - with 1 spaces 4443 6222 NaN 1.8 542.3 2.3X cast(str as int) as c_int - with 3 spaces 3659 3842 170 2.2 446.7 2.8X cast(str as int) as c_int - with 5 spaces 4372 7996 NaN 1.9 533.7 2.3X cast(str as long) as c_long - with 1 spaces 3866 5838 NaN 2.1 471.9 2.6X cast(str as long) as c_long - with 3 spaces 3793 5449 NaN 2.2 463.0 2.7X cast(str as long) as c_long - with 5 spaces 4947 5961 1198 1.7 603.9 2.1X ``` Closes #26622 from yaooqinn/cheapstringtrim. Authored-by: Kent Yao <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
This pr trim the string when cast string type to Boolean/Numeric types for 2 reasons:
PostgreSQL:
Teradata:
Oracle:
DB2:
Vertica:
SQL Server:
MySQL:
Hive fixed this issue by HIVE-17782:
How was this patch tested?
unit tests