Skip to content

Conversation

@MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Sep 20, 2019

What changes were proposed in this pull request?

Changed the DateTimeUtils.getMilliseconds() by avoiding the decimal division, and replacing it by setting scale and precision while converting microseconds to the decimal type.

Why are the changes needed?

This improves performance of extract and date_part() by more than 50 times:
Before:

Invoke extract for timestamp:             Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative	Invoke extract for timestamp:             Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
cast to timestamp                                   397            428          45         25.2          39.7       1.0X
MILLISECONDS of timestamp                         36723          36761          63          0.3        3672.3       0.0X

After:

Invoke extract for timestamp:             Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
cast to timestamp                                   278            284           6         36.0          27.8       1.0X
MILLISECONDS of timestamp                           592            606          13         16.9          59.2       0.5X

Does this PR introduce any user-facing change?

No

How was this patch tested?

By existing test suite - DateExpressionsSuite

@SparkQA
Copy link

SparkQA commented Sep 20, 2019

Test build #111084 has finished for PR 25871 at commit 53e57ef.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Sep 21, 2019

@dongjoon-hyun @cloud-fan @HyukjinKwon Please, take a look at this PR.

def getMilliseconds(timestamp: SQLTimestamp, timeZone: TimeZone): Decimal = {
val micros = Decimal(getMicroseconds(timestamp, timeZone))
(micros / Decimal(MICROS_PER_MILLIS)).toPrecision(8, 3)
Decimal(getMicroseconds(timestamp, timeZone), 8, 3)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MaxGekk, is it safe? Seems previously it could return null but now we cannot return null in some conditions. For instance:

scala> Decimal(9223372036854775L)
res27: org.apache.spark.sql.types.Decimal = 9223372036854775

scala> Decimal(9223372036854775L, 8, 3)
java.lang.ArithmeticException: Unscaled value too large for precision
  at org.apache.spark.sql.types.Decimal.set(Decimal.scala:79)
  at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:564)
  ... 49 elided

whereas toPrecision seems able to return null for overflow.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so, getMicroseconds returns an int in the range [0, 60000000) for which Decimal(..., 8, 3) is always valid, for example:

scala> Decimal(60000000, 8, 3)
res1: org.apache.spark.sql.types.Decimal = 60000.000

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, yes. I checked other cases too and seems fine.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good but it would be great if I or somebody else double checks.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Sep 21, 2019

Looks feasible. Let me run the benchmark for you, @MaxGekk .

@dongjoon-hyun
Copy link
Member

Hi, @MaxGekk .
I made a PR with both JDK8/JDK11 result to you. Could you review that?

@dongjoon-hyun
Copy link
Member

BTW, after this PR, we need to repeat this for #25881 once more.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @MaxGekk and @HyukjinKwon .
Since the last two commits are about benchmark result text file, I'll merge this.
This PR already passed the Jenkins.

Merged to master.

@SparkQA
Copy link

SparkQA commented Sep 22, 2019

Test build #111141 has finished for PR 25871 at commit fb64679.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk MaxGekk deleted the optimize-epoch-millis branch October 5, 2019 19:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants