-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-29190][SQL] Optimize extract/date_part for the milliseconds field
#25871
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #111084 has finished for PR 25871 at commit
|
|
@dongjoon-hyun @cloud-fan @HyukjinKwon Please, take a look at this PR. |
| def getMilliseconds(timestamp: SQLTimestamp, timeZone: TimeZone): Decimal = { | ||
| val micros = Decimal(getMicroseconds(timestamp, timeZone)) | ||
| (micros / Decimal(MICROS_PER_MILLIS)).toPrecision(8, 3) | ||
| Decimal(getMicroseconds(timestamp, timeZone), 8, 3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MaxGekk, is it safe? Seems previously it could return null but now we cannot return null in some conditions. For instance:
scala> Decimal(9223372036854775L)
res27: org.apache.spark.sql.types.Decimal = 9223372036854775
scala> Decimal(9223372036854775L, 8, 3)
java.lang.ArithmeticException: Unscaled value too large for precision
at org.apache.spark.sql.types.Decimal.set(Decimal.scala:79)
at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:564)
... 49 elidedwhereas toPrecision seems able to return null for overflow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so, getMicroseconds returns an int in the range [0, 60000000) for which Decimal(..., 8, 3) is always valid, for example:
scala> Decimal(60000000, 8, 3)
res1: org.apache.spark.sql.types.Decimal = 60000.000There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, yes. I checked other cases too and seems fine.
HyukjinKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good but it would be great if I or somebody else double checks.
|
Looks feasible. Let me run the benchmark for you, @MaxGekk . |
|
Hi, @MaxGekk . |
|
BTW, after this PR, we need to repeat this for #25881 once more. |
Regenerate JDK8/11 result
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Thank you, @MaxGekk and @HyukjinKwon .
Since the last two commits are about benchmark result text file, I'll merge this.
This PR already passed the Jenkins.
Merged to master.
|
Test build #111141 has finished for PR 25871 at commit
|
What changes were proposed in this pull request?
Changed the
DateTimeUtils.getMilliseconds()by avoiding the decimal division, and replacing it by setting scale and precision while converting microseconds to the decimal type.Why are the changes needed?
This improves performance of
extractanddate_part()by more than 50 times:Before:
After:
Does this PR introduce any user-facing change?
No
How was this patch tested?
By existing test suite -
DateExpressionsSuite