-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-29533][SQL][TEST] Benchmark casting strings to intervals #26189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
||
| private def buildString(withPrefix: Boolean, units: Seq[String] = Seq.empty): String = { | ||
| val sep = if (units.length > 0) ", " else "" | ||
| val otherUnits = sep + s"'${units.mkString(" ")}'" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I would use string interpolation "$sep'${units.mkString(" ")}'"
| val sep = if (units.length > 0) ", " else "" | ||
| val otherUnits = sep + s"'${units.mkString(" ")}'" | ||
| val prefix = if (withPrefix) "'interval'" else "''" | ||
| s"concat_ws(' ', ${prefix}, cast(id % 10000 AS string), 'years'${otherUnits})" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, why do you use string instead of Scala API functions? I personally find it's better to use them in such cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should construct the string manually and only benchmark string literal to interval. Otherwise the benchmark result might be affected by the concat_ws function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overhead of preparing benchmark input is non-zero in most cases. That's why I always measure the input preparation, see the first lines in the results: https://github.com/apache/spark/pull/26189/files#diff-586487fac2b9b1303aaf80adf8fa37abR5-R6 . So, we can subtract time for preparation from other numbers.
I think we should construct the string manually and only benchmark string literal to interval.
Could you explain, please, what do you mean by "manually". And how this will make the overhead for preparation insignificant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I see that id column is used to construct the interval string, so we must use concat_ws function.
Agree with @HyukjinKwon that it's more readable to use dataframe Column API instead of bare SQL string.
HyukjinKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While it seems fine otherwise, looks to me it targets too narrow case.
Maybe it's time to think about what we should add into benchmark ...
|
Test build #112374 has finished for PR 26189 at commit
|
|
^^ reopened https://jira.apache.org/jira/browse/SPARK-25923 cc @viirya |
This is narrow case but it is hot now. There is current implementation, @cloud-fan implementation #26190 and mine #26180 . I do believe this benchmark can be useful in comparisons of those implementations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, All.
It's okay to have the MacOS result, but let's not forget to generate JDK11 result together.
|
Test build #112397 has finished for PR 26189 at commit
|
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/IntervalBenchmark.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/IntervalBenchmark.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/IntervalBenchmark.scala
Outdated
Show resolved
Hide resolved
| ($"id" % 10000).cast("string") :: | ||
| lit("years") :: Nil | ||
|
|
||
| concat_ws(" ", (init ++ units.map(lit)): _*) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we avoid concat_ws cost in order to focus interval benchmark more? Due to id column, this seems to be not a foldable expression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops. Sorry, this is already mentioned comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add more description about how the result should be interpreted?
It's not clear that the first two results should be subtracted from the other results.
|
Test build #112411 has finished for PR 26189 at commit
|
|
Test build #112412 has finished for PR 26189 at commit
|
|
Merged to master. |
|
Hi, All. |
### What changes were proposed in this pull request? This is a follow-up of #26189 to regenerate the result on EC2. ### Why are the changes needed? This will be used for the other PR reviews. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? N/A. Closes #26233 from dongjoon-hyun/SPARK-29533. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: DB Tsai <[email protected]>
What changes were proposed in this pull request?
Added new benchmark
IntervalBenchmarkto measure performance of interval related functions. In the PR, I added benchmarks for casting strings to interval. In particular, interval strings withintervalprefix and without it because there is special code for thisspark/common/unsafe/src/main/java/org/apache/spark/unsafe/types/CalendarInterval.java
Lines 100 to 103 in da576a7
interval 10 years, 2 units w/o interval is10 years 5 months, and etc.Why are the changes needed?
CalendarInterval.fromString()orCalendarInterval.fromCaseInsensitiveString().Does this PR introduce any user-facing change?
No
How was this patch tested?
By running the benchmark via the command:
SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.IntervalBenchmark"