Skip to content

Conversation

@MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Oct 21, 2019

What changes were proposed in this pull request?

Added new benchmark IntervalBenchmark to measure performance of interval related functions. In the PR, I added benchmarks for casting strings to interval. In particular, interval strings with interval prefix and without it because there is special code for this

if (!intervalStr.regionMatches(true, 0, prefix, 0, prefix.length())) {
// Prepend `interval` if it does not present because
// the regular expression strictly require it.
intervalStr = prefix + " " + trimmed;
. And also I added benchmarks for different number of units in interval strings, for example 1 unit is interval 10 years, 2 units w/o interval is 10 years 5 months, and etc.

Why are the changes needed?

  • To find out current performance issues in casting to intervals
  • The benchmark can be used while refactoring/re-implementing CalendarInterval.fromString() or CalendarInterval.fromCaseInsensitiveString().

Does this PR introduce any user-facing change?

No

How was this patch tested?

By running the benchmark via the command:

SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.IntervalBenchmark"

@MaxGekk
Copy link
Member Author

MaxGekk commented Oct 21, 2019


private def buildString(withPrefix: Boolean, units: Seq[String] = Seq.empty): String = {
val sep = if (units.length > 0) ", " else ""
val otherUnits = sep + s"'${units.mkString(" ")}'"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I would use string interpolation "$sep'${units.mkString(" ")}'"

val sep = if (units.length > 0) ", " else ""
val otherUnits = sep + s"'${units.mkString(" ")}'"
val prefix = if (withPrefix) "'interval'" else "''"
s"concat_ws(' ', ${prefix}, cast(id % 10000 AS string), 'years'${otherUnits})"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, why do you use string instead of Scala API functions? I personally find it's better to use them in such cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should construct the string manually and only benchmark string literal to interval. Otherwise the benchmark result might be affected by the concat_ws function.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overhead of preparing benchmark input is non-zero in most cases. That's why I always measure the input preparation, see the first lines in the results: https://github.com/apache/spark/pull/26189/files#diff-586487fac2b9b1303aaf80adf8fa37abR5-R6 . So, we can subtract time for preparation from other numbers.

I think we should construct the string manually and only benchmark string literal to interval.

Could you explain, please, what do you mean by "manually". And how this will make the overhead for preparation insignificant.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I see that id column is used to construct the interval string, so we must use concat_ws function.

Agree with @HyukjinKwon that it's more readable to use dataframe Column API instead of bare SQL string.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While it seems fine otherwise, looks to me it targets too narrow case.
Maybe it's time to think about what we should add into benchmark ...

@SparkQA
Copy link

SparkQA commented Oct 21, 2019

Test build #112374 has finished for PR 26189 at commit 4eaae97.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

^^ reopened https://jira.apache.org/jira/browse/SPARK-25923 cc @viirya

@MaxGekk
Copy link
Member Author

MaxGekk commented Oct 21, 2019

While it seems fine otherwise, looks to me it targets too narrow case.

This is narrow case but it is hot now. There is current implementation, @cloud-fan implementation #26190 and mine #26180 . I do believe this benchmark can be useful in comparisons of those implementations.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, All.
It's okay to have the MacOS result, but let's not forget to generate JDK11 result together.

@SparkQA
Copy link

SparkQA commented Oct 21, 2019

Test build #112397 has finished for PR 26189 at commit b35439a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

($"id" % 10000).cast("string") ::
lit("years") :: Nil

concat_ws(" ", (init ++ units.map(lit)): _*)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid concat_ws cost in order to focus interval benchmark more? Due to id column, this seems to be not a foldable expression.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops. Sorry, this is already mentioned comment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add more description about how the result should be interpreted?
It's not clear that the first two results should be subtracted from the other results.

@SparkQA
Copy link

SparkQA commented Oct 22, 2019

Test build #112411 has finished for PR 26189 at commit 165ee36.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 22, 2019

Test build #112412 has finished for PR 26189 at commit 1772543.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Merged to master.

@dongjoon-hyun
Copy link
Member

Hi, All.
This was good but I'll make a follow up to regenerate on EC2 in order to compare with the other PR.

dbtsai pushed a commit that referenced this pull request Oct 23, 2019
### What changes were proposed in this pull request?

This is a follow-up of #26189 to regenerate the result on EC2.

### Why are the changes needed?

This will be used for the other PR reviews.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

N/A.

Closes #26233 from dongjoon-hyun/SPARK-29533.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: DB Tsai <[email protected]>
@MaxGekk MaxGekk deleted the interval-from-string-benchmark branch June 5, 2020 19:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants