Skip to content

Conversation

@yaooqinn
Copy link
Member

@yaooqinn yaooqinn commented Nov 21, 2019

What changes were proposed in this pull request?

Modify UTF8String.toInt/toLong to support trim spaces for both sides before converting it to byte/short/int/long.

With this kind of "cheap" trim can help improve performance for casting string to integrals. The idea is from #24872 (comment)

Why are the changes needed?

make the behavior consistent.

Does this PR introduce any user-facing change?

yes, cast string to an integral type, and binary comparison between string and integrals will trim spaces first. their behavior will be consistent with float and double.

How was this patch tested?

  1. add ut.
  2. benchmark tests
    the benchmark is modified based on [SPARK-28023][SQL] Trim the string when cast string type to Boolean/Numeric types #24872 (comment)
/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *    http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

package org.apache.spark.sql.execution.benchmark

import org.apache.spark.benchmark.Benchmark

/**
 * Benchmark trim the string when casting string type to Boolean/Numeric types.
 * To run this benchmark:
 * {{{
 *   1. without sbt:
 *      bin/spark-submit --class <this class> --jars <spark core test jar> <spark sql test jar>
 *   2. build/sbt "sql/test:runMain <this class>"
 *   3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain <this class>"
 *      Results will be written to "benchmarks/CastBenchmark-results.txt".
 * }}}
 */
object CastBenchmark extends SqlBasedBenchmark {
This conversation was marked as resolved by yaooqinn

  override def runBenchmarkSuite(mainArgs: Array[String]): Unit = {
    val title = "Cast String to Integral"
    runBenchmark(title) {
      withTempPath { dir =>
        val N = 500L << 14
        val df = spark.range(N)
        val types = Seq("int", "long")
        (1 to 5).by(2).foreach { i =>
          df.selectExpr(s"concat(id, '${" " * i}') as str")
            .write.mode("overwrite").parquet(dir + i.toString)
        }

        val benchmark = new Benchmark(title, N, minNumIters = 5, output = output)
        Seq(true, false).foreach { trim =>
          types.foreach { t =>
            val str = if (trim) "trim(str)" else "str"
            val expr = s"cast($str as $t) as c_$t"
            (1 to 5).by(2).foreach { i =>
              benchmark.addCase(expr + s" - with $i spaces") { _ =>
                spark.read.parquet(dir + i.toString).selectExpr(expr).collect()
              }
            }
          }
        }
        benchmark.run()
      }
    }
  }
}

benchmark result.

normal trim v.s. trim in toInt/toLong

================================================================================================
Cast String to Integral
================================================================================================

Java HotSpot(TM) 64-Bit Server VM 1.8.0_231-b11 on Mac OS X 10.15.1
Intel(R) Core(TM) i5-5287U CPU @ 2.90GHz
Cast String to Integral:                  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
cast(trim(str) as int) as c_int - with 1 spaces          10220          12994        1337          0.8        1247.5       1.0X
cast(trim(str) as int) as c_int - with 3 spaces           4763           8356         357          1.7         581.4       2.1X
cast(trim(str) as int) as c_int - with 5 spaces           4791           8042         NaN          1.7         584.9       2.1X
cast(trim(str) as long) as c_long - with 1 spaces           4014           6755         NaN          2.0         490.0       2.5X
cast(trim(str) as long) as c_long - with 3 spaces           4737           6938         NaN          1.7         578.2       2.2X
cast(trim(str) as long) as c_long - with 5 spaces           4478           6919        1404          1.8         546.6       2.3X
cast(str as int) as c_int - with 1 spaces           4443           6222         NaN          1.8         542.3       2.3X
cast(str as int) as c_int - with 3 spaces           3659           3842         170          2.2         446.7       2.8X
cast(str as int) as c_int - with 5 spaces           4372           7996         NaN          1.9         533.7       2.3X
cast(str as long) as c_long - with 1 spaces           3866           5838         NaN          2.1         471.9       2.6X
cast(str as long) as c_long - with 3 spaces           3793           5449         NaN          2.2         463.0       2.7X
cast(str as long) as c_long - with 5 spaces           4947           5961        1198          1.7         603.9       2.1X

@yaooqinn
Copy link
Member Author

master branch non casting logic modified, cast result will be null

================================================================================================
Benchmark trim the string
================================================================================================

Java HotSpot(TM) 64-Bit Server VM 1.8.0_231-b11 on Mac OS X 10.15.1
Intel(R) Core(TM) i5-5287U CPU @ 2.90GHz
Benchmark trim the string:                Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
cast(str as long) as c_long                        3969           4401         402          1.0         968.9       1.0X
cast(str as double) as c_double                    3048           4448        1985          1.3         744.1       1.3X
cast(str as decimal) as c_decimal                 10042          11716         NaN          0.4        2451.8       0.4X

@yaooqinn
Copy link
Member Author

scala> java.lang.Double.valueOf(" 234 ")
res1: Double = 234.0


scala> " 234 ".toDouble
res3: Double = 234.0

Double and float 's supported without trim both codegen and non-codegen

Comment on lines 9 to 10
cast(str as int) as c_int 3169 3530 610 1.3 773.6 1.0X
cast(str as long) as c_long 1812 1881 60 2.3 442.4 1.7X
cast(str as int) as c_int 2208 4341 1848 1.9 539.0 1.0X
cast(str as long) as c_long 2039 3450 2146 2.0 497.8 1.1X
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan the result seems not as we expected, I'd increase the cardinality and do another test round. Can you help me to see if I missed something?

Intel(R) Core(TM) i5-5287U CPU @ 2.90GHz
Benchmark trim the string: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
cast(str as int) as c_int 2208 4341 1848 1.9 539.0 1.0X
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the result of master branch?

*
* @return this string with no spaces at the start or end
*/
public UTF8String nonCopyTrim() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another way is to embed the trim logic into toInt.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will check this

Comment on lines 9 to 10
cast(str as int) as c_int 2478 3669 1046 1.7 604.9 1.0X
cast(str as long) as c_long 1439 1548 94 2.8 351.4 1.7X
cast(str as int) as c_int 3169 3530 610 1.3 773.6 1.0X
cast(str as long) as c_long 1812 1881 60 2.3 442.4 1.7X
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan this is master branch to original trim result.

Comment on lines 9 to 10
cast(str as int) as c_int 7105 8190 945 1.2 867.3 1.0X
cast(str as long) as c_long 7520 8670 1629 1.1 918.0 0.9X
cast(str as int) as c_int 6263 8132 NaN 1.3 764.6 1.0X
cast(str as long) as c_long 8199 9737 NaN 1.0 1000.9 0.8X
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cardinality * 2, (-) is trim (+) is non-copy trim, the cost for copyMemory an int or long value is trial, I guess.

@SparkQA
Copy link

SparkQA commented Nov 21, 2019

Test build #114207 has finished for PR 26622 at commit 05ab098.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 21, 2019

Test build #114211 has finished for PR 26622 at commit 4fa535b.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 21, 2019

Test build #114213 has finished for PR 26622 at commit cce5a55.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 21, 2019

Test build #114210 has finished for PR 26622 at commit ee94f98.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yaooqinn
Copy link
Member Author

Java HotSpot(TM) 64-Bit Server VM 1.8.0_231-b11 on Mac OS X 10.15.1
Intel(R) Core(TM) i5-5287U CPU @ 2.90GHz
Benchmark trim the string:                Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
Cast String to Numeric:                   Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
### nomal trim
cast(str as int) as c_int                          6263           8132         NaN          1.3         764.6       1.0X
cast(str as long) as c_long                        8199           9737         NaN          1.0        1000.9       0.8X
### inside trim
cast(str as int) as c_int                          4015           6735         412          2.0         490.1       1.0X
cast(str as long) as c_long                        8251           8597         302          1.0        1007.1       0.5X

cc @cloud-fan

val N = 500L << 14
val df = spark.range(N)
val types = Seq("int", "long")
df.selectExpr(s"concat('${" " * 5}', id, '${" " * 5}') as str")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we only benchmark spaces at right? The parsing logic will return immediately if the first char is a space, so not very useful to benchmark it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense

@yaooqinn
Copy link
Member Author

normal trim v.s. inside trim

The result shows 10~20% perfomance improvement @cloud-fan

[info] Java HotSpot(TM) 64-Bit Server VM 1.8.0_231-b11 on Mac OS X 10.15.1
[info] Intel(R) Core(TM) i5-5287U CPU @ 2.90GHz
[info] Cast String to Numeric:                   Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] cast(trim(str) as int) as c_int                    4259           5270        1110          1.9         519.9       1.0X
[info] cast(trim(str) as long) as c_long                  4081           5663        1372          2.0         498.2       1.0X
[info] cast(str as int) as c_int                          4071           4254         176          2.0         496.9       1.0X
[info] cast(str as long) as c_long                        4121           5087        1272          2.0         503.1       1.0X
[info]

@yaooqinn
Copy link
Member Author

normal trim v.s . no trim

[info] Java HotSpot(TM) 64-Bit Server VM 1.8.0_231-b11 on Mac OS X 10.15.1
[info] Intel(R) Core(TM) i5-5287U CPU @ 2.90GHz
[info] Cast String to Integral:                  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] cast(trim(str) as int) as c_int                    4174           6090        1434          2.0         509.6       1.0X
[info] cast(trim(str) as long) as c_long                  5940           6829        1113          1.4         725.1       0.7X
[info] cast(str as int) as c_int                          2932           4191        1242          2.8         357.9       1.4X
[info] cast(str as long) as c_long                        3202           4767         NaN          2.6         390.8       1.3X
[info]

@yaooqinn
Copy link
Member Author

normal trim v.s. inside trim in toInt and toLong

[info] Java HotSpot(TM) 64-Bit Server VM 1.8.0_231-b11 on Mac OS X 10.15.1
[info] Intel(R) Core(TM) i5-5287U CPU @ 2.90GHz
[info] Cast String to Integral:                  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] cast(trim(str) as int) as c_int                    4026           7674         NaN          2.0         491.5       1.0X
[info] cast(trim(str) as long) as c_long                  4022           7887        1345          2.0         491.0       1.0X
[info] cast(str as int) as c_int                          4390           6453        1009          1.9         535.9       0.9X
[info] cast(str as long) as c_long                        3759           5388        1413          2.2         458.8       1.1X
[info]

@yaooqinn
Copy link
Member Author

@cloud-fan if this works, we may need to fix Decimal separately

@cloud-fan
Copy link
Contributor

I think the overhead is acceptable to make the behavior consistent. what do you think? @gatorsmile @srowen @dongjoon-hyun @MaxGekk @wangyum

@wangyum
Copy link
Member

wangyum commented Nov 21, 2019

+1 To keep the behavior consistent.

@yaooqinn
Copy link
Member Author

Shall we handle control characters for integral type here as approximate numerics can do? #26626 (comment)

@SparkQA
Copy link

SparkQA commented Nov 21, 2019

Test build #114242 has finished for PR 26622 at commit 09f38f7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yaooqinn yaooqinn changed the title [WIP][SPARK-28023][Test] Cheap UTF8String Trim [SPARK-28023][SQL] Add trim logic in UTF8String's toInt/toLong to make it consistent with other string-numeric casting Nov 21, 2019
@SparkQA
Copy link

SparkQA commented Nov 21, 2019

Test build #114246 has finished for PR 26622 at commit 76639b7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

}
if (this.numBytes == 0) return false;
int offset = 0;
while (offset < this.numBytes && getByte(offset) == ' ') offset++;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's handle control characters to be consistent with casting to float/double

int end = this.numBytes - 1;
while (end > offset && getByte(end) == ' ') end--;

int numBytes = end - offset + 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it's only used once, we can inline it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@cloud-fan
Copy link
Contributor

@yaooqinn let's add a migration guide. I think pretty close now.

@SparkQA
Copy link

SparkQA commented Nov 22, 2019

Test build #114267 has finished for PR 26622 at commit f37f467.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


byte b = getByte(0);
int end = this.numBytes - 1;
while (end > offset && getByte(end) <= ' ') end--;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to trim from the right explicitly here. Just break inside the loop https://github.com/apache/spark/pull/26622/files#diff-d2b5337b91f684b9e7fd5cc101e93fc8R1104 if b == ' '

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I guess not, how do you know the ' ' is in the middle or end?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see

@SparkQA
Copy link

SparkQA commented Nov 22, 2019

Test build #114275 has finished for PR 26622 at commit d5c2a40.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yaooqinn
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Nov 22, 2019

Test build #114284 has finished for PR 26622 at commit d5c2a40.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan cloud-fan closed this in 2dd6807 Nov 22, 2019
@cloud-fan
Copy link
Contributor

thanks, merging to master!

@yaooqinn yaooqinn deleted the cheapstringtrim branch November 22, 2019 11:34
@yaooqinn
Copy link
Member Author

thanks for merging

cloud-fan pushed a commit that referenced this pull request Aug 7, 2020
…t handle non-ASCII characters correctly

### What changes were proposed in this pull request?
The trim logic in Cast expression introduced in #26622 trim non-ASCII characters unexpectly.

Before this patch
![image](https://user-images.githubusercontent.com/1312321/89513154-caad9b80-d806-11ea-9ebe-17c9e7d1b5b3.png)

After this patch
![image](https://user-images.githubusercontent.com/1312321/89513196-d731f400-d806-11ea-959c-6a7dc29dcd49.png)

### Why are the changes needed?
The behavior described above doesn't make sense, and also doesn't consistent with the behavior when cast a string to double/float, as well as doesn't consistent with the behavior of Hive

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
Added more UT

Closes #29375 from WangGuangxin/cast-bugfix.

Authored-by: wangguangxin.cn <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
WangGuangxin added a commit to WangGuangxin/spark that referenced this pull request Aug 7, 2020
…t handle non-ASCII characters correctly

The trim logic in Cast expression introduced in apache#26622 trim non-ASCII characters unexpectly.

Before this patch
![image](https://user-images.githubusercontent.com/1312321/89513154-caad9b80-d806-11ea-9ebe-17c9e7d1b5b3.png)

After this patch
![image](https://user-images.githubusercontent.com/1312321/89513196-d731f400-d806-11ea-959c-6a7dc29dcd49.png)

The behavior described above doesn't make sense, and also doesn't consistent with the behavior when cast a string to double/float, as well as doesn't consistent with the behavior of Hive

Yes

Added more UT

Closes apache#29375 from WangGuangxin/cast-bugfix.

Authored-by: wangguangxin.cn <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
dongjoon-hyun pushed a commit that referenced this pull request Aug 9, 2020
… did't handle non-ASCII characters correctly

### What changes were proposed in this pull request?

This is a backport of #29375
The trim logic in Cast expression introduced in #26622 trim non-ASCII characters unexpectly.

Before this patch
![image](https://user-images.githubusercontent.com/1312321/89513154-caad9b80-d806-11ea-9ebe-17c9e7d1b5b3.png)

After this patch
![image](https://user-images.githubusercontent.com/1312321/89513196-d731f400-d806-11ea-959c-6a7dc29dcd49.png)

### Why are the changes needed?
The behavior described above doesn't make sense, and also doesn't consistent with the behavior when cast a string to double/float, as well as doesn't consistent with the behavior of Hive

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
Added more UT

Closes #29393 from WangGuangxin/cast-bugfix-branch-3.0.

Authored-by: wangguangxin.cn <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants