Skip to content

Conversation

@wangyum
Copy link
Member

@wangyum wangyum commented Jun 14, 2019

What changes were proposed in this pull request?

This pr trim the string when cast string type to Boolean/Numeric types for 2 reasons:

  1. Our Spark SQL itself is also inconsistent, such as:
spark-sql> select cast('  1 ' as int), cast(' 1 ' as float);
NULL	1.0
  1. The mainstream database trims the string when cast string type to Boolean/Numeric types:
    PostgreSQL:
    image
    Teradata:
    image
    Oracle:
    image
    DB2:
    image
    Vertica:
    image
    SQL Server:
    image
    MySQL:
    image
    Hive fixed this issue by HIVE-17782:
    image

How was this patch tested?

unit tests

@SparkQA
Copy link

SparkQA commented Jun 14, 2019

Test build #106506 has finished for PR 24872 at commit 5ab6ccd.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 14, 2019

Test build #106515 has finished for PR 24872 at commit 9c28e6a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

{:toc}

## Upgrading From Spark SQL 2.4 to 3.0
- Since Spark 3.0, trim the string when cast string type to Boolean/Datetime/Numeric types.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you complete this sentence by having a subject?
cc @srowen

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes something like "when casting from string to boolean, date or numeric types, whitespace is trimmed from the ends of the value first"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

case StringType =>
val result = new LongWrapper()
buildCast[UTF8String](_, s => if (s.toLong(result)) result.value else null)
buildCast[UTF8String](_, s => if (s.trim.toLong(result)) result.value else null)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be a correct fix. Do we have a possibility of performance regression?

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, please update the title because Datetime is missing there.
Also, I believe we need tests on partition columns.

cc @MaxGekk and @HyukjinKwon since this is related to Datetime casting and partition column, too.

Copy link
Member

@MaxGekk MaxGekk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes will impact on performance, highly likely. I believe we need to benchmark them in any case. And we should consider to put the trimming under a flag.

@MaxGekk
Copy link
Member

MaxGekk commented Jun 15, 2019

Just in case, how about to introduce new parameter for toInt, toLong and etc. (to existing or new methods with additional parameter to control trimming). Trimming inside of loops of those methods shouldn't impact on performance so much, and shouldn't produce additional garbage.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is a correctness issue, we can't put it under a flag, right? we wouldn't want the behavior to vary with the flag. How would people generally find this, etc. The overhead of trimming should be trivial compared to parsing.

{:toc}

## Upgrading From Spark SQL 2.4 to 3.0
- Since Spark 3.0, trim the string when cast string type to Boolean/Datetime/Numeric types.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes something like "when casting from string to boolean, date or numeric types, whitespace is trimmed from the ends of the value first"

@MaxGekk
Copy link
Member

MaxGekk commented Jun 15, 2019

If this is a correctness issue, we can't put it under a flag, right?

Is it really correctness issue? Is the correct behavior described somewhere in Spark's docs?

we wouldn't want the behavior to vary with the flag.

The flag would be useful to restore performance in the cases when user's input doesn't contain spaces.

How would people generally find this, etc.

As any other flags - in docs and in the migration guide.

The overhead of trimming should be trivial compared to parsing.

The statement should be confirmed by benchmarks, isn't it?

@MaxGekk
Copy link
Member

MaxGekk commented Jun 15, 2019

As I can see, trim() always does a copy of the original string:

@srowen
Copy link
Member

srowen commented Jun 15, 2019

Taking a step back: is there a correct behavior? does Hive or a SQL standard suggest that " 3.0" should cast correctly to a double? if so, then there is no question that this is a fix, and we shouldn't offer a flag to make the behavior incorrect.

Is there not a clear correct behavior? then don't enforce it in Spark. Callers trim() input if needed, or don't if it isn't. No call for a flag there.

The perf question probably doesn't matter, then, either way.

Oh, this is UTF8String.trim(). Looks like there are some obvious optimizations here to avoid a copy, like String.trim(). It doesn't even optimize for the common case. I'll try that in a separate PR, as that will be worthwhile no matter what.

@wangyum
Copy link
Member Author

wangyum commented Jun 16, 2019

@dongjoon-hyun
Trim the string when casting string to date/timestamp is fixed by #22943.
Partition columns and non-partition columns are the same code path.
@srowen
Hive fixed cast string to numeric types with regards to leading/trailing spaces by HIVE-17782:

@kiszk
Copy link
Member

kiszk commented Jun 16, 2019

I agree with @srowen 's opinion regarding performance. When we add a short-cut to avoid copy, we could minimize performance degradation.

@SparkQA
Copy link

SparkQA commented Jun 16, 2019

Test build #106552 has finished for PR 24872 at commit aff33b2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member

MaxGekk commented Jun 16, 2019

When I talk about a flag, I meant a flag for toLong, toInt and etc. methods as well, see #24872 (comment) . Spark implementation is a copy of Hive's:

* This code is mostly copied from LazyLong.parseLong in Hive.

where input trimming is integrated to the parsing functions:
https://github.com/apache/hive/blob/master/serde/src/java/org/apache/hadoop/hive/serde2/lazy/LazyLong.java#L130-L137

I just want to say that the fix (or a feature if it is not defined by SQL standard) should be made in UTF8String instead of Cast.

Speaking about SQL config, let's consider 2 use cases:

  • User's input doesn't have spaces. In that case, we just introduce additional overhead of trimming strings.
  • User's input does have spaces.
    • And the user aware of them and call trim() explicitly. In that case, we just add additional implicit trimming which cause additional overhead in current implementation. No benefits too but if it is a critical place in user's code, the user has to spend time in troubleshooting the performance problems.
    • If trim() is not called explicitly, the behavior is going to be changed, actually. Here, I think, the SQL flag (in spark.sql.legacy namespace) could be useful to restore previous behavior.

Additional thoughts, if we wrap Cast's string input by the expression StringTrim, we could eliminate double trimming by folding StringTrim.

@srowen
Copy link
Member

srowen commented Jun 16, 2019

Presumably right now, all input to these functions doesn't have spaces -- otherwise it would fail.

If it's not clear whether the input should be trimmed by these functions from a standards perspective, then I'd say don't make this change at all. Just leave behavior without a compelling reason to change it.

If there is, then we need to enforce it. You're right, that leaves users with a possibly redundant trim() in their code. If they know enough to know this, they'd just remove the manual trim(), then -- not undo this 'fix' going forward for future usages. Most people won't know about the flag either way, anyway, if one were added.

What's the cost? I put together a crude benchmark of 90% strings that have no whitespace at the ends, and 10% that do. It's 20 nanoseconds per call or so. If I add one extra short-circuit to trim() for that common case, it's 6 nanoseconds. We can at least bring the overhead of the common case down a lot, but it's already very small. I'll propose that change separately anyway.

If this change is important, I think a flag isn't necessary. But it may just not be the right behavior change anyway.

@MaxGekk
Copy link
Member

MaxGekk commented Jun 16, 2019

We can at least bring the overhead of the common case down a lot, but it's already very small.

I worry mostly regarding the garbage the trim() produces. That could impact on GC pauses.

If it's not clear ... from a standards perspective, then I'd say don't make this change at all.

Let's confirm that this is required by the standard. At the moment, the trimming seems just a nice feature.

@kiszk
Copy link
Member

kiszk commented Jun 16, 2019

If we would introduce an optimization to return this when no space, trim() will generate no garbage for the common case.

@srowen
Copy link
Member

srowen commented Jun 16, 2019

See #24884

# Conflicts:
#	docs/sql-migration-guide-upgrade.md
@SparkQA
Copy link

SparkQA commented Jun 18, 2019

Test build #106600 has finished for PR 24872 at commit e0ae14b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class CollapseCodegenStages(
  • case class AdaptiveSparkPlanExec(
  • case class StageSuccess(stage: QueryStageExec, result: Any) extends StageMaterializationEvent
  • case class StageFailure(stage: QueryStageExec, error: Throwable) extends StageMaterializationEvent
  • case class InsertAdaptiveSparkPlan(session: SparkSession) extends Rule[SparkPlan]
  • case class LogicalQueryStage(
  • case class PlanAdaptiveSubqueries(
  • abstract class QueryStageExec extends LeafExecNode
  • case class ShuffleQueryStageExec(
  • case class BroadcastQueryStageExec(
  • case class ReusedQueryStageExec(
  • class ParquetFilters(
  • class ParquetOutputWriter(path: String, context: TaskAttemptContext)
  • class ParquetReadSupport(val convertTz: Option[TimeZone],
  • case class FileTypes(
  • class ParquetWriteSupport extends WriteSupport[InternalRow] with Logging
  • class ParquetDataSourceV2 extends FileDataSourceV2
  • case class ParquetPartitionReaderFactory(
  • case class ParquetScan(
  • case class ParquetScanBuilder(
  • case class ParquetTable(
  • class ParquetWriteBuilder(
  • case class ArrowEvalPythonExec(udfs: Seq[PythonUDF], resultAttrs: Seq[Attribute], child: SparkPlan,
  • case class SparkListenerSQLAdaptiveExecutionUpdate(

code"""
try {
Decimal $tmp = Decimal.apply(new java.math.BigDecimal($c.toString()));
Decimal $tmp = Decimal.apply(new java.math.BigDecimal($c.toString().trim()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: $c.trim().toString() may be more efficient?

{:toc}

## Upgrading From Spark SQL 2.4 to 3.0
- Since Spark 3.0, trim the string when casting from string to boolean, date, timestamp or numeric types, whitespace is trimmed from the ends of the value first.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about:
Since Spark 3.0, when a string is cast to boolean/date/timestamp/numeric types, it is trimmed before it is parsed.

@wangyum
Copy link
Member Author

wangyum commented Jun 20, 2019

Benchmark and benchmark result.

/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *    http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

package org.apache.spark.sql.execution.benchmark

import org.apache.spark.benchmark.Benchmark

/**
 * Benchmark trim the string when casting string type to Boolean/Numeric types.
 * To run this benchmark:
 * {{{
 *   1. without sbt:
 *      bin/spark-submit --class <this class> --jars <spark core test jar> <spark sql test jar>
 *   2. build/sbt "sql/test:runMain <this class>"
 *   3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain <this class>"
 *      Results will be written to "benchmarks/CastBenchmark-results.txt".
 * }}}
 */
object CastBenchmark extends SqlBasedBenchmark {

  override def runBenchmarkSuite(mainArgs: Array[String]): Unit = {

    val title = "Benchmark trim the string"
    runBenchmark(title) {
      withTempPath { dir =>
        val N = 500L << 13
        val df = spark.range(N)
        val withoutWhitespace = "withoutWhitespace"
        val withWhitespace = "withWhitespace"
        val types = Seq("int", "long", "float", "double", "decimal", "boolean")

        df.selectExpr("cast(id as string) as str")
          .write.mode("overwrite").parquet(dir + withoutWhitespace)
        df.selectExpr(s"concat('${" " * 5}', id, '${" " * 5}') as str")
          .write.mode("overwrite").parquet(dir + withWhitespace)

        val benchmark = new Benchmark(title, N, minNumIters = 5, output = output)
        Seq(withoutWhitespace, withWhitespace).foreach { data =>
          Seq(false, true).foreach { isTrimStr =>
            val expr =
              types.map(t => s"cast(${if (isTrimStr) "trim(str)" else "str"} as $t) as c_$t")
            val name = s"$data ${if (isTrimStr) "with" else "without"} trim"
            benchmark.addCase(name) { _ =>
              spark.read.parquet(dir + data).selectExpr(expr: _*).collect()
            }
          }
        }
        benchmark.run()
      }
    }
  }
}

Before this pr(after SPARK-28066):

[info] Java HotSpot(TM) 64-Bit Server VM 1.8.0_191-b12 on Mac OS X 10.12.6
[info] Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
[info] Benchmark trim the string:                Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] withoutWhitespace without trim                     7452           7945         529          0.5        1819.2       1.0X
[info] withoutWhitespace with trim                        8013           8289         277          0.5        1956.3       0.9X
[info] withWhitespace without trim                       12502          13474        1062          0.3        3052.3       0.6X
[info] withWhitespace with trim                           7652           8544         688          0.5        1868.1       1.0X

After this pr(after SPARK-28066):

[info] Java HotSpot(TM) 64-Bit Server VM 1.8.0_191-b12 on Mac OS X 10.12.6
[info] Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
[info] Benchmark trim the string:                Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] withoutWhitespace without trim                     7467           7849         336          0.5        1823.0       1.0X
[info] withoutWhitespace with trim                        7445           8006         702          0.6        1817.6       1.0X
[info] withWhitespace without trim                        5665           6889        1295          0.7        1383.0       1.3X
[info] withWhitespace with trim                           5627           6201         630          0.7        1373.8       1.3X

@gatorsmile
Copy link
Member

Based on the above discussions, let us keep it open. We can revisit it later.

@dongjoon-hyun
Copy link
Member

Hi, @gatorsmile . Do you think we can make a decision for this? (There are some other minor PRs depending on this.)

@srowen
Copy link
Member

srowen commented Jul 8, 2019

I personally don't see a compelling reason to do this, even though trim() has been optimized somewhat.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Jul 8, 2019

Thank you for the opinion, @srowen .

The original purpose of this issue and PR is to follow PostgreSQL behavior more in Apache Spark 3.0.0. (The umbrella JIRA is https://issues.apache.org/jira/browse/SPARK-27764). Some other PRs are pending because we are waiting for the decision.

  • If we are going to do this, the other PRs will rebase up on this and go further toward PostgreSQL.
  • If we are going to close this as Won't Fix, the other issues will ignore this difference and proceed in anyway.

I'm okay with both directions. The important thing in this PR is that we need PMC's decision to move forward.

Then, do you think we are going to close this PR?

@gatorsmile
Copy link
Member

We are unable to 100% match the semantics of PostgreSQL. Adding the extra trim for the Boolean/Numeric type casting will impact the perf of a common operation, although it is not high. Based on the above discussions, it might not be worth it to support this corner case.

I think, in the future, we can add the PostgreSQL-compatible mode for supporting all the corner cases. This effort will be big. If we decide to add it, it might need 1000+ JIRAs and PRs.

@dongjoon-hyun
Copy link
Member

Thank you for the conclusion, @gatorsmile and @srowen !

@wangyum . Could you resolve the issue as Later and close this PR?
And, let's proceed the other PRs with this difference.

@cloud-fan
Copy link
Contributor

cloud-fan commented Nov 21, 2019

The problem I see is not about pgsql compatibility, but about internal consistency. Why casting to date/timestamp trims the spaces but casting to numeric does not?

BTW for parsing, we don't really need a "safe" trim that copies the data. We can do a "cheap" trim that only changes the UTF8String offsets. The trimmed UTF8String instance will only be used in the parsing method and won't be shared by others. I don't think it will have a big perf overhead.

@yaooqinn
Copy link
Member

The problem I see is not about pgsql compatibility, but about internal consistency. Why casting to date/timestamp trims the spaces but casting to numeric does not?

BTW for parsing, we don't really need a "safe" trim that copies the data. We can do a "cheap" trim that only changes the UTF8String offsets. The trimmed UTF8String instance will only be used in the parsing method and won't be shared by others. I don't think it will have a big perf overhead.

Yea, aggree. for the reason of type coercion changed between spark versions unexpectedly, now the binary comparator are affected.

@cloud-fan
Copy link
Contributor

@yaooqinn can you send a PR to fix it? We can reuse @wangyum 's benchmark: #24872 (comment) and see if "cheap" trim can help.

@yaooqinn
Copy link
Member

@yaooqinn can you send a PR to fix it? We can reuse @wangyum 's benchmark: #24872 (comment) and see if "cheap" trim can help.

OK

cloud-fan pushed a commit that referenced this pull request Nov 22, 2019
…e it consistent with other string-numeric casting

### What changes were proposed in this pull request?

Modify `UTF8String.toInt/toLong` to support trim spaces for both sides before converting it to byte/short/int/long.

With this kind of "cheap" trim can help improve performance for casting string to integrals. The idea is from #24872 (comment)

### Why are the changes needed?

make the behavior consistent.

### Does this PR introduce any user-facing change?
yes, cast string to an integral type, and binary comparison between string and integrals will trim spaces first. their behavior will be consistent with float and double.
### How was this patch tested?
1. add ut.
2. benchmark tests
 the benchmark is modified based on #24872 (comment)

```scala
/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *    http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

package org.apache.spark.sql.execution.benchmark

import org.apache.spark.benchmark.Benchmark

/**
 * Benchmark trim the string when casting string type to Boolean/Numeric types.
 * To run this benchmark:
 * {{{
 *   1. without sbt:
 *      bin/spark-submit --class <this class> --jars <spark core test jar> <spark sql test jar>
 *   2. build/sbt "sql/test:runMain <this class>"
 *   3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain <this class>"
 *      Results will be written to "benchmarks/CastBenchmark-results.txt".
 * }}}
 */
object CastBenchmark extends SqlBasedBenchmark {
This conversation was marked as resolved by yaooqinn

  override def runBenchmarkSuite(mainArgs: Array[String]): Unit = {
    val title = "Cast String to Integral"
    runBenchmark(title) {
      withTempPath { dir =>
        val N = 500L << 14
        val df = spark.range(N)
        val types = Seq("int", "long")
        (1 to 5).by(2).foreach { i =>
          df.selectExpr(s"concat(id, '${" " * i}') as str")
            .write.mode("overwrite").parquet(dir + i.toString)
        }

        val benchmark = new Benchmark(title, N, minNumIters = 5, output = output)
        Seq(true, false).foreach { trim =>
          types.foreach { t =>
            val str = if (trim) "trim(str)" else "str"
            val expr = s"cast($str as $t) as c_$t"
            (1 to 5).by(2).foreach { i =>
              benchmark.addCase(expr + s" - with $i spaces") { _ =>
                spark.read.parquet(dir + i.toString).selectExpr(expr).collect()
              }
            }
          }
        }
        benchmark.run()
      }
    }
  }
}
```
#### benchmark result.
normal trim v.s. trim in toInt/toLong
```java
================================================================================================
Cast String to Integral
================================================================================================

Java HotSpot(TM) 64-Bit Server VM 1.8.0_231-b11 on Mac OS X 10.15.1
Intel(R) Core(TM) i5-5287U CPU  2.90GHz
Cast String to Integral:                  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
cast(trim(str) as int) as c_int - with 1 spaces          10220          12994        1337          0.8        1247.5       1.0X
cast(trim(str) as int) as c_int - with 3 spaces           4763           8356         357          1.7         581.4       2.1X
cast(trim(str) as int) as c_int - with 5 spaces           4791           8042         NaN          1.7         584.9       2.1X
cast(trim(str) as long) as c_long - with 1 spaces           4014           6755         NaN          2.0         490.0       2.5X
cast(trim(str) as long) as c_long - with 3 spaces           4737           6938         NaN          1.7         578.2       2.2X
cast(trim(str) as long) as c_long - with 5 spaces           4478           6919        1404          1.8         546.6       2.3X
cast(str as int) as c_int - with 1 spaces           4443           6222         NaN          1.8         542.3       2.3X
cast(str as int) as c_int - with 3 spaces           3659           3842         170          2.2         446.7       2.8X
cast(str as int) as c_int - with 5 spaces           4372           7996         NaN          1.9         533.7       2.3X
cast(str as long) as c_long - with 1 spaces           3866           5838         NaN          2.1         471.9       2.6X
cast(str as long) as c_long - with 3 spaces           3793           5449         NaN          2.2         463.0       2.7X
cast(str as long) as c_long - with 5 spaces           4947           5961        1198          1.7         603.9       2.1X
```

Closes #26622 from yaooqinn/cheapstringtrim.

Authored-by: Kent Yao <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.