-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-29328][SQL] Fix calculation of mean seconds per month #25998
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #111688 has finished for PR 25998 at commit
|
|
cc @gatorsmile |
|
Test build #111694 has finished for PR 25998 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MaxGekk . Since the query result is changed, we can consider this is a kind of correctness issue which fixes the behavior at 3.0.0. I added some of previous Spark versions on SPARK-29328. In addition to that, I'd like to recommend you the following.
- Label the JIRA issue as
correctnessto give more visibility. - Compare with the other DBMS for that queries (especially Hive).
- Add a migration document.
- We may need to add conf (if needed).
cc @cloud-fan, @gatorsmile , @jiangxb1987 (3.0.0 release manager)
| The result is rounded off to 8 digits unless `roundOff` is set to `False`. | ||
| >>> df = spark.createDataFrame([('1997-02-28 10:30:00', '1996-10-30')], ['date1', 'date2']) | ||
| >>> df.select(months_between(df.date1, df.date2).alias('months')).collect() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As an aside, I would have expected months_between returns an integer, like just the difference in months ignoring day, but that's not what other DBs do. However browsing some links like https://www.ibm.com/support/knowledgecenter/SSCRJT_5.0.1/com.ibm.swg.im.bigsql.commsql.doc/doc/r0053631.html and https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Functions/Date-Time/MONTHS_BETWEEN.htm I see that some implementations just assume all months including Feb have 31 days (!?) .
I agree that this is more accurate, but is it less consistent with Hive or other DBs? maybe it's already not consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The months_between is only particular case where the fix impacts on 3rd digit in the fractional part. I think it is more important to have more precise year/month duration in conversions of an interval to its duration in seconds (msec, usec) in other cases. Using 372 days per year leads to significant and visible errors in such conversions.
I can compare only |
|
Yeah, that's my worry. Now we wouldn't match Hive, or I presume, many databases. Apparently databases have long used the current definition, even if sounds odd. I'm not sure we want to do this then? |
|
What is more important for us - consistency across of Spark API or consistency with other DBMS? I mean if we have 31 days per months or 372 days per year in |
|
Do the mainstream databases assume 31 days per month in many places, or just |
|
My premise is that The only other relevant function is |
Sure, we can define this function as it operates on 31 days per month. Looking at the |
|
And this PR can fix other parts not related to |
|
shall we consistently assume 31 days per month in all the places in Spark? If that's also what other mainstream databases do. |
PostgreSQL has global constant: |
|
The pgsql code you referred to is interesting. It assumes 30 days per month and 365.25 days per year. I think the idea makes sense, days per year doesn't have to be 12 * days per month. That said, I think we should keep assuming 31 days per month in |
ok. I will revert |
|
Test build #111749 has finished for PR 25998 at commit
|
|
To be clear, the change here now affects streaming operations that specify intervals like "2 months"? OK I can imagine trying to be more precise in this case. It's really a corner case, as I don't expect many streaming operations to have a duration in months anyway. |
| // We define month as `31 days` to simplify calculation. | ||
| val millisPerMonth = TimeUnit.MICROSECONDS.toMillis(CalendarInterval.MICROS_PER_DAY) * 31 | ||
| delay.milliseconds + delay.months * millisPerMonth | ||
| delay.milliseconds + delay.months * MILLIS_PER_MONTH |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What we need here is seconds per month, not seconds per year. I think we should still assume 31 days per month here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you point out where we need seconds/days per year in the codebase?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe any place including this one when we need a duration (in seconds or its fractions). The difference between months_between() and this place is months_between uses month length to calculate fraction of month, and 28 or 31 days per months don't really matter because it impacts on 2nd or 3rd digit in fractions but here we operate on bigger numbers when months form years. And it becomes matter how much days we use per year. Let's say we calculate the duration of 10 years which 120 months. If we use 31 days per months, this duration is 31 * 120 = 10 * 372 = 3720 days but if one year is 365.2425 than 10 years = 3652 days. The difference is 3720 - 3652 = 68 days or the calculation error is more than 2 months. That's matter I believe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
months_between is sort of a special case because "31 days per month" is (it seems) actually how it is supposed to work, correctly.
It's rare that someone would specify "1 month" here, let alone "10 years" right? or am I missing something? these are things like watermark intervals. Not that it means the semantics don't matter, it's just quite a corner case.
I therefore just don't feel strongly either way about it. We don't need to match months_between semantics. More precision is nice, but surely it almost never comes up anyway? I don't mind the change, as a result.
|
|
||
| val millisPerMonth = TimeUnit.MICROSECONDS.toMillis(CalendarInterval.MICROS_PER_DAY) * 31 | ||
| cal.milliseconds + cal.months * millisPerMonth | ||
| cal.milliseconds + cal.months * MILLIS_PER_MONTH |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's interesting since this change doesn't affect our tests.
|
Can I do something else in the PR? |
|
I'm OK with this change, mostly on the grounds that it will almost never matter. I left it open for a while for comments. |
If we would have this, we could calculate duration of any interval, and make the maxim=# select interval '1 month 1 microsecond' > interval '1 month';
?column?
----------
t
(1 row)I even don't speak about this PR #25981 which is blocked by this somehow. |
|
That sounds like a third issue, what "1 month" means as an interval and how to support it. This doesn't change that, right? what it does change seems OK, if it's narrowly about what a streaming interval of "1 month" means or something. Or have I really missed the point? |
This doesn't.
Here you didn't miss anything but in the PR #25981, I need to calculate interval length in |
|
Merged to master |
|
I took a look at this one. The change is incorrect. Both This just reminds me that we don't have any document or test for this behavior. |
What changes were proposed in this pull request?
I introduced new constants
SECONDS_PER_MONTHandMILLIS_PER_MONTH, and reused it in calculations of seconds/milliseconds per month.SECONDS_PER_MONTHis 2629746 because the average year of the Gregorian calendar is 365.2425 days long or 60 * 60 * 24 * 365.2425 = 31556952.0 = 12 * 2629746 seconds per year.Why are the changes needed?
Spark uses the proleptic Gregorian calendar (see https://issues.apache.org/jira/browse/SPARK-26651) in which the average year is 365.2425 days (see https://en.wikipedia.org/wiki/Gregorian_calendar) but existing implementation assumes 31 days per months or 12 * 31 = 372 days. That's far away from the the truth.
Does this PR introduce any user-facing change?
Yes, the changes affect at least 3 methods in
GroupStateImpl,EventTimeWatermarkandMonthsBetween. For example, themonth_between()function will return different result in some cases.Before:
After:
How was this patch tested?
By existing test suite
DateTimeUtilsSuite,DateFunctionsSuiteandDateExpressionsSuite.