-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-29328][SQL] Fix calculation of mean seconds per month #25998
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 13 commits
65385a1
9b58059
b0f765a
1019645
285af30
f59e006
e97f419
71dc2c0
e7c9920
6125a6b
7e30ece
b4ecba4
f444123
2e33203
00a1988
5256ff4
c59443f
93a680b
df2d97a
9d78910
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -17,9 +17,8 @@ | |
|
|
||
| package org.apache.spark.sql.catalyst.plans.logical | ||
|
|
||
| import java.util.concurrent.TimeUnit | ||
|
|
||
| import org.apache.spark.sql.catalyst.expressions.Attribute | ||
| import org.apache.spark.sql.catalyst.util.DateTimeUtils.MILLIS_PER_MONTH | ||
| import org.apache.spark.sql.types.MetadataBuilder | ||
| import org.apache.spark.unsafe.types.CalendarInterval | ||
|
|
||
|
|
@@ -28,9 +27,7 @@ object EventTimeWatermark { | |
| val delayKey = "spark.watermarkDelayMs" | ||
|
|
||
| def getDelayMs(delay: CalendarInterval): Long = { | ||
| // We define month as `31 days` to simplify calculation. | ||
| val millisPerMonth = TimeUnit.MICROSECONDS.toMillis(CalendarInterval.MICROS_PER_DAY) * 31 | ||
| delay.milliseconds + delay.months * millisPerMonth | ||
| delay.milliseconds + delay.months * MILLIS_PER_MONTH | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What we need here is seconds per month, not seconds per year. I think we should still assume 31 days per month here.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you point out where we need seconds/days per year in the codebase?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I believe any place including this one when we need a duration (in seconds or its fractions). The difference between
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
It's rare that someone would specify "1 month" here, let alone "10 years" right? or am I missing something? these are things like watermark intervals. Not that it means the semantics don't matter, it's just quite a corner case. I therefore just don't feel strongly either way about it. We don't need to match |
||
| } | ||
| } | ||
|
|
||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -18,9 +18,9 @@ | |
| package org.apache.spark.sql.execution.streaming | ||
|
|
||
| import java.sql.Date | ||
| import java.util.concurrent.TimeUnit | ||
|
|
||
| import org.apache.spark.sql.catalyst.plans.logical.{EventTimeTimeout, ProcessingTimeTimeout} | ||
| import org.apache.spark.sql.catalyst.util.DateTimeUtils.MILLIS_PER_MONTH | ||
| import org.apache.spark.sql.execution.streaming.GroupStateImpl._ | ||
| import org.apache.spark.sql.streaming.{GroupState, GroupStateTimeout} | ||
| import org.apache.spark.unsafe.types.CalendarInterval | ||
|
|
@@ -164,8 +164,7 @@ private[sql] class GroupStateImpl[S] private( | |
| throw new IllegalArgumentException(s"Provided duration ($duration) is not positive") | ||
| } | ||
|
|
||
| val millisPerMonth = TimeUnit.MICROSECONDS.toMillis(CalendarInterval.MICROS_PER_DAY) * 31 | ||
| cal.milliseconds + cal.months * millisPerMonth | ||
| cal.milliseconds + cal.months * MILLIS_PER_MONTH | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's interesting since this change doesn't affect our tests. |
||
| } | ||
|
|
||
| private def checkTimeoutTimestampAllowed(): Unit = { | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As an aside, I would have expected
months_betweenreturns an integer, like just the difference in months ignoring day, but that's not what other DBs do. However browsing some links like https://www.ibm.com/support/knowledgecenter/SSCRJT_5.0.1/com.ibm.swg.im.bigsql.commsql.doc/doc/r0053631.html and https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Functions/Date-Time/MONTHS_BETWEEN.htm I see that some implementations just assume all months including Feb have 31 days (!?) .I agree that this is more accurate, but is it less consistent with Hive or other DBs? maybe it's already not consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The
months_betweenis only particular case where the fix impacts on 3rd digit in the fractional part. I think it is more important to have more precise year/month duration in conversions of an interval to its duration in seconds (msec, usec) in other cases. Using 372 days per year leads to significant and visible errors in such conversions.