[SPARK-29533][SQL][TEST] Benchmark casting strings to intervals #26189

MaxGekk · 2019-10-21T08:48:52Z

What changes were proposed in this pull request?

Added new benchmark IntervalBenchmark to measure performance of interval related functions. In the PR, I added benchmarks for casting strings to interval. In particular, interval strings with interval prefix and without it because there is special code for this

spark/common/unsafe/src/main/java/org/apache/spark/unsafe/types/CalendarInterval.java

Lines 100 to 103 in da576a7

    
           if (!intervalStr.regionMatches(true, 0, prefix, 0, prefix.length())) { 
        
             // Prepend `interval` if it does not present because 
        
             // the regular expression strictly require it. 
        
             intervalStr = prefix + " " + trimmed;

. And also I added benchmarks for different number of units in interval strings, for example 1 unit is interval 10 years, 2 units w/o interval is 10 years 5 months, and etc.

Why are the changes needed?

To find out current performance issues in casting to intervals
The benchmark can be used while refactoring/re-implementing CalendarInterval.fromString() or CalendarInterval.fromCaseInsensitiveString().

Does this PR introduce any user-facing change?

No

How was this patch tested?

By running the benchmark via the command:

SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.IntervalBenchmark"

MaxGekk · 2019-10-21T10:18:45Z

ping @cloud-fan @HyukjinKwon @dongjoon-hyun

HyukjinKwon · 2019-10-21T10:44:43Z

sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/IntervalBenchmark.scala

+
+  private def buildString(withPrefix: Boolean, units: Seq[String] = Seq.empty): String = {
+    val sep = if (units.length > 0) ", " else ""
+    val otherUnits = sep + s"'${units.mkString(" ")}'"


nit: I would use string interpolation "$sep'${units.mkString(" ")}'"

HyukjinKwon · 2019-10-21T10:46:55Z

sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/IntervalBenchmark.scala

+    val sep = if (units.length > 0) ", " else ""
+    val otherUnits = sep + s"'${units.mkString(" ")}'"
+    val prefix = if (withPrefix) "'interval'" else "''"
+    s"concat_ws(' ', ${prefix}, cast(id % 10000 AS string), 'years'${otherUnits})"


Out of curiosity, why do you use string instead of Scala API functions? I personally find it's better to use them in such cases.

I think we should construct the string manually and only benchmark string literal to interval. Otherwise the benchmark result might be affected by the concat_ws function.

Overhead of preparing benchmark input is non-zero in most cases. That's why I always measure the input preparation, see the first lines in the results: https://github.com/apache/spark/pull/26189/files#diff-586487fac2b9b1303aaf80adf8fa37abR5-R6 . So, we can subtract time for preparation from other numbers.

I think we should construct the string manually and only benchmark string literal to interval.

Could you explain, please, what do you mean by "manually". And how this will make the overhead for preparation insignificant.

OK I see that id column is used to construct the interval string, so we must use concat_ws function.

Agree with @HyukjinKwon that it's more readable to use dataframe Column API instead of bare SQL string.

HyukjinKwon

While it seems fine otherwise, looks to me it targets too narrow case.
Maybe it's time to think about what we should add into benchmark ...

SparkQA · 2019-10-21T12:39:22Z

Test build #112374 has finished for PR 26189 at commit 4eaae97.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-10-21T13:21:30Z

^^ reopened https://jira.apache.org/jira/browse/SPARK-25923 cc @viirya

MaxGekk · 2019-10-21T14:26:36Z

While it seems fine otherwise, looks to me it targets too narrow case.

This is narrow case but it is hot now. There is current implementation, @cloud-fan implementation #26190 and mine #26180 . I do believe this benchmark can be useful in comparisons of those implementations.

dongjoon-hyun

Thank you, All.
It's okay to have the MacOS result, but let's not forget to generate JDK11 result together.

SparkQA · 2019-10-21T18:13:13Z

Test build #112397 has finished for PR 26189 at commit b35439a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/IntervalBenchmark.scala

dongjoon-hyun · 2019-10-21T21:20:31Z

sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/IntervalBenchmark.scala

+      ($"id" % 10000).cast("string") ::
+      lit("years") :: Nil
+
+    concat_ws(" ", (init ++ units.map(lit)): _*)


Can we avoid concat_ws cost in order to focus interval benchmark more? Due to id column, this seems to be not a foldable expression.

Oops. Sorry, this is already mentioned comment.

Could you add more description about how the result should be interpreted?
It's not clear that the first two results should be subtracted from the other results.

SparkQA · 2019-10-22T00:28:45Z

Test build #112411 has finished for PR 26189 at commit 165ee36.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-10-22T00:44:34Z

Test build #112412 has finished for PR 26189 at commit 1772543.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-10-22T01:46:29Z

Merged to master.

dongjoon-hyun · 2019-10-23T19:04:40Z

Hi, All.
This was good but I'll make a follow up to regenerate on EC2 in order to compare with the other PR.

### What changes were proposed in this pull request? This is a follow-up of #26189 to regenerate the result on EC2. ### Why are the changes needed? This will be used for the other PR reviews. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? N/A. Closes #26233 from dongjoon-hyun/SPARK-29533. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: DB Tsai <[email protected]>

MaxGekk added 2 commits October 21, 2019 11:36

Add IntervalBenchmark

66b4288

Generate results

4eaae97

MaxGekk mentioned this pull request Oct 21, 2019

[SPARK-29532][SQL] Simplify interval string parsing #26190

Closed

HyukjinKwon reviewed Oct 21, 2019

View reviewed changes

HyukjinKwon approved these changes Oct 21, 2019

View reviewed changes

Use string interpolation

b35439a

dongjoon-hyun requested changes Oct 21, 2019

View reviewed changes

MaxGekk added 2 commits October 21, 2019 23:26

Use Scala API

836eb9f

Generate results for jdk11

165ee36

dongjoon-hyun reviewed Oct 21, 2019

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/IntervalBenchmark.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Oct 21, 2019

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/IntervalBenchmark.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Oct 21, 2019

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/IntervalBenchmark.scala Outdated Show resolved Hide resolved

remove s"

1772543

dongjoon-hyun added the TESTS label Oct 21, 2019

dongjoon-hyun reviewed Oct 21, 2019

View reviewed changes

dongjoon-hyun approved these changes Oct 21, 2019

View reviewed changes

HyukjinKwon closed this in 6ffec5e Oct 22, 2019

dongjoon-hyun mentioned this pull request Oct 23, 2019

[SPARK-29533][SQL][TESTS][FOLLOWUP] Regenerate the result on EC2 #26233

Closed

MaxGekk deleted the interval-from-string-benchmark branch June 5, 2020 19:40

	if (!intervalStr.regionMatches(true, 0, prefix, 0, prefix.length())) {
	// Prepend `interval` if it does not present because
	// the regular expression strictly require it.
	intervalStr = prefix + " " + trimmed;

[SPARK-29533][SQL][TEST] Benchmark casting strings to intervals #26189

[SPARK-29533][SQL][TEST] Benchmark casting strings to intervals #26189

Uh oh!

Conversation

MaxGekk commented Oct 21, 2019

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

MaxGekk commented Oct 21, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 21, 2019

Uh oh!

HyukjinKwon commented Oct 21, 2019

Uh oh!

MaxGekk commented Oct 21, 2019

Uh oh!

dongjoon-hyun left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 21, 2019

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 22, 2019

Uh oh!

SparkQA commented Oct 22, 2019

Uh oh!

HyukjinKwon commented Oct 22, 2019

Uh oh!

dongjoon-hyun commented Oct 23, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

dongjoon-hyun left a comment •

edited

Loading