[SPARK-29190][SQL] Optimize `extract`/`date_part` for the milliseconds `field` #25871

MaxGekk · 2019-09-20T18:16:26Z

What changes were proposed in this pull request?

Changed the DateTimeUtils.getMilliseconds() by avoiding the decimal division, and replacing it by setting scale and precision while converting microseconds to the decimal type.

Why are the changes needed?

This improves performance of extract and date_part() by more than 50 times:
Before:

Invoke extract for timestamp:             Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative	Invoke extract for timestamp:             Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
cast to timestamp                                   397            428          45         25.2          39.7       1.0X
MILLISECONDS of timestamp                         36723          36761          63          0.3        3672.3       0.0X

After:

Invoke extract for timestamp:             Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
cast to timestamp                                   278            284           6         36.0          27.8       1.0X
MILLISECONDS of timestamp                           592            606          13         16.9          59.2       0.5X

Does this PR introduce any user-facing change?

No

How was this patch tested?

By existing test suite - DateExpressionsSuite

SparkQA · 2019-09-20T22:10:45Z

Test build #111084 has finished for PR 25871 at commit 53e57ef.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2019-09-21T09:13:32Z

@dongjoon-hyun @cloud-fan @HyukjinKwon Please, take a look at this PR.

HyukjinKwon · 2019-09-21T14:45:29Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala

  def getMilliseconds(timestamp: SQLTimestamp, timeZone: TimeZone): Decimal = {
-    val micros = Decimal(getMicroseconds(timestamp, timeZone))
-    (micros / Decimal(MICROS_PER_MILLIS)).toPrecision(8, 3)
+    Decimal(getMicroseconds(timestamp, timeZone), 8, 3)


@MaxGekk, is it safe? Seems previously it could return null but now we cannot return null in some conditions. For instance:

scala> Decimal(9223372036854775L) res27: org.apache.spark.sql.types.Decimal = 9223372036854775 scala> Decimal(9223372036854775L, 8, 3) java.lang.ArithmeticException: Unscaled value too large for precision at org.apache.spark.sql.types.Decimal.set(Decimal.scala:79) at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:564) ... 49 elided

whereas toPrecision seems able to return null for overflow.

I think so, getMicroseconds returns an int in the range [0, 60000000) for which Decimal(..., 8, 3) is always valid, for example:

scala> Decimal(60000000, 8, 3) res1: org.apache.spark.sql.types.Decimal = 60000.000

Oh, yes. I checked other cases too and seems fine.

HyukjinKwon

Looks good but it would be great if I or somebody else double checks.

dongjoon-hyun · 2019-09-21T22:09:18Z

Looks feasible. Let me run the benchmark for you, @MaxGekk .

dongjoon-hyun · 2019-09-21T23:47:11Z

Hi, @MaxGekk .
I made a PR with both JDK8/JDK11 result to you. Could you review that?

Regenerate JDK8/11 result MaxGekk/spark#23

dongjoon-hyun · 2019-09-21T23:48:16Z

BTW, after this PR, we need to repeat this for #25881 once more.

Regenerate JDK8/11 result

dongjoon-hyun

+1, LGTM. Thank you, @MaxGekk and @HyukjinKwon .
Since the last two commits are about benchmark result text file, I'll merge this.
This PR already passed the Jenkins.

Merged to master.

SparkQA · 2019-09-22T07:05:01Z

Test build #111141 has finished for PR 25871 at commit fb64679.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk added 2 commits September 20, 2019 22:35

Avoid decimal div in getting milliseconds

7cf1082

Regen results of ExtractBenchmark

53e57ef

dongjoon-hyun added the SQL label Sep 21, 2019

HyukjinKwon reviewed Sep 21, 2019

View reviewed changes

HyukjinKwon approved these changes Sep 21, 2019

View reviewed changes

dongjoon-hyun added 2 commits September 21, 2019 23:08

Add jdk8 result

99673d9

Add jdk11 result

1e125cf

Merge pull request #23 from dongjoon-hyun/PR-25871

fb64679

Regenerate JDK8/11 result

dongjoon-hyun approved these changes Sep 22, 2019

View reviewed changes

dongjoon-hyun closed this in 3be5741 Sep 22, 2019

MaxGekk deleted the optimize-epoch-millis branch October 5, 2019 19:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-29190][SQL] Optimize `extract`/`date_part` for the milliseconds `field` #25871

[SPARK-29190][SQL] Optimize `extract`/`date_part` for the milliseconds `field` #25871

Uh oh!

MaxGekk commented Sep 20, 2019

Uh oh!

SparkQA commented Sep 20, 2019

Uh oh!

MaxGekk commented Sep 21, 2019

Uh oh!

HyukjinKwon Sep 21, 2019

Uh oh!

MaxGekk Sep 21, 2019

Uh oh!

HyukjinKwon Sep 21, 2019

Uh oh!

HyukjinKwon left a comment

Uh oh!

dongjoon-hyun commented Sep 21, 2019 •

edited

Loading

Uh oh!

dongjoon-hyun commented Sep 21, 2019

Uh oh!

dongjoon-hyun commented Sep 21, 2019

Uh oh!

dongjoon-hyun left a comment •

edited

Loading

Uh oh!

SparkQA commented Sep 22, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-29190][SQL] Optimize extract/date_part for the milliseconds field #25871

[SPARK-29190][SQL] Optimize extract/date_part for the milliseconds field #25871

Uh oh!

Conversation

MaxGekk commented Sep 20, 2019

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Sep 20, 2019

Uh oh!

MaxGekk commented Sep 21, 2019

Uh oh!

HyukjinKwon Sep 21, 2019

Choose a reason for hiding this comment

Uh oh!

MaxGekk Sep 21, 2019

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Sep 21, 2019

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Sep 21, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Sep 21, 2019

Uh oh!

dongjoon-hyun commented Sep 21, 2019

Uh oh!

dongjoon-hyun left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 22, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-29190][SQL] Optimize `extract`/`date_part` for the milliseconds `field` #25871

[SPARK-29190][SQL] Optimize `extract`/`date_part` for the milliseconds `field` #25871

dongjoon-hyun commented Sep 21, 2019 •

edited

Loading

dongjoon-hyun left a comment •

edited

Loading