Skip to content

Conversation

@MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Oct 2, 2019

What changes were proposed in this pull request?

I introduced new constants SECONDS_PER_MONTH and MILLIS_PER_MONTH, and reused it in calculations of seconds/milliseconds per month. SECONDS_PER_MONTH is 2629746 because the average year of the Gregorian calendar is 365.2425 days long or 60 * 60 * 24 * 365.2425 = 31556952.0 = 12 * 2629746 seconds per year.

Why are the changes needed?

Spark uses the proleptic Gregorian calendar (see https://issues.apache.org/jira/browse/SPARK-26651) in which the average year is 365.2425 days (see https://en.wikipedia.org/wiki/Gregorian_calendar) but existing implementation assumes 31 days per months or 12 * 31 = 372 days. That's far away from the the truth.

Does this PR introduce any user-facing change?

Yes, the changes affect at least 3 methods in GroupStateImpl, EventTimeWatermark and MonthsBetween. For example, the month_between() function will return different result in some cases.

Before:

spark-sql> select months_between('2019-09-15', '1970-01-01');
596.4516129

After:

spark-sql> select months_between('2019-09-15', '1970-01-01');
596.45996838

How was this patch tested?

By existing test suite DateTimeUtilsSuite, DateFunctionsSuite and DateExpressionsSuite.

@SparkQA
Copy link

SparkQA commented Oct 2, 2019

Test build #111688 has finished for PR 25998 at commit 7e30ece.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

cc @gatorsmile

@SparkQA
Copy link

SparkQA commented Oct 2, 2019

Test build #111694 has finished for PR 25998 at commit f444123.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MaxGekk . Since the query result is changed, we can consider this is a kind of correctness issue which fixes the behavior at 3.0.0. I added some of previous Spark versions on SPARK-29328. In addition to that, I'd like to recommend you the following.

  • Label the JIRA issue as correctness to give more visibility.
  • Compare with the other DBMS for that queries (especially Hive).
  • Add a migration document.
  • We may need to add conf (if needed).

cc @cloud-fan, @gatorsmile , @jiangxb1987 (3.0.0 release manager)

The result is rounded off to 8 digits unless `roundOff` is set to `False`.
>>> df = spark.createDataFrame([('1997-02-28 10:30:00', '1996-10-30')], ['date1', 'date2'])
>>> df.select(months_between(df.date1, df.date2).alias('months')).collect()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an aside, I would have expected months_between returns an integer, like just the difference in months ignoring day, but that's not what other DBs do. However browsing some links like https://www.ibm.com/support/knowledgecenter/SSCRJT_5.0.1/com.ibm.swg.im.bigsql.commsql.doc/doc/r0053631.html and https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Functions/Date-Time/MONTHS_BETWEEN.htm I see that some implementations just assume all months including Feb have 31 days (!?) .

I agree that this is more accurate, but is it less consistent with Hive or other DBs? maybe it's already not consistent.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The months_between is only particular case where the fix impacts on 3rd digit in the fractional part. I think it is more important to have more precise year/month duration in conversions of an interval to its duration in seconds (msec, usec) in other cases. Using 372 days per year leads to significant and visible errors in such conversions.

@MaxGekk
Copy link
Member Author

MaxGekk commented Oct 3, 2019

Compare with the other DBMS for that queries (especially Hive).

I can compare only months_between because there are no streaming related functionality in Hive. Hive assumes 31 days per months as well that we can see in its code:
https://github.com/apache/hive/blob/287e5d5e4c43beb2bc84a80e342f897494e32c6c/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFMonthsBetween.java#L143-L145

@srowen
Copy link
Member

srowen commented Oct 3, 2019

Yeah, that's my worry. Now we wouldn't match Hive, or I presume, many databases. Apparently databases have long used the current definition, even if sounds odd. I'm not sure we want to do this then?

@MaxGekk
Copy link
Member Author

MaxGekk commented Oct 3, 2019

What is more important for us - consistency across of Spark API or consistency with other DBMS? I mean if we have 31 days per months or 372 days per year in months_between, 365.25 days per year in date_part('epoch', ...) (https://github.com/apache/spark/pull/25981/files#diff-eba257f41b49f470321579875f054f00R85-R89), and 365.2425 days per year in other places, we may have different results using different Spark API.

@cloud-fan
Copy link
Contributor

Do the mainstream databases assume 31 days per month in many places, or just month_between?

@srowen
Copy link
Member

srowen commented Oct 3, 2019

My premise is that months_between appears to be defined as "the number of 31-day periods between two dates". If that's the definition then it's currently correct. Is that the definition? well, according to Hive, Oracle, IBM, Vertica at least. I believe we want to stick with that definition, then.

The only other relevant function is add_months but it doesn't quite have this same question about semantics.

@MaxGekk
Copy link
Member Author

MaxGekk commented Oct 3, 2019

I believe we want to stick with that definition, then.

Sure, we can define this function as it operates on 31 days per month.

Looking at the date_part('epoch', ...) in the PR #25981 , we can say the function uses 365.25 days per year by its definition, and we should not care about consistency with other Spark functions. Am I right?

@MaxGekk
Copy link
Member Author

MaxGekk commented Oct 3, 2019

And this PR can fix other parts not related to months_between?

@cloud-fan
Copy link
Contributor

shall we consistently assume 31 days per month in all the places in Spark? If that's also what other mainstream databases do.

@MaxGekk
Copy link
Member Author

MaxGekk commented Oct 3, 2019

If that's also what other mainstream databases do.

PostgreSQL has global constant:
https://github.com/postgres/postgres/blob/97c39498e5ca9208d3de5a443a2282923619bf91/src/include/datatype/timestamp.h#L77. I don't think they have this so much brave assumption about 31 days per month in other places.

@cloud-fan
Copy link
Contributor

The pgsql code you referred to is interesting. It assumes 30 days per month and 365.25 days per year. I think the idea makes sense, days per year doesn't have to be 12 * days per month.

That said, I think we should keep assuming 31 days per month in months_between. For other places that need days per year, I agree that 365.2425 is more accurate.

@MaxGekk
Copy link
Member Author

MaxGekk commented Oct 3, 2019

That said, I think we should keep assuming 31 days per month in months_between.

ok. I will revert months_between related changes here, and keeps other changes as is.

@SparkQA
Copy link

SparkQA commented Oct 3, 2019

Test build #111749 has finished for PR 25998 at commit 9d78910.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Oct 3, 2019

To be clear, the change here now affects streaming operations that specify intervals like "2 months"? OK I can imagine trying to be more precise in this case. It's really a corner case, as I don't expect many streaming operations to have a duration in months anyway.

// We define month as `31 days` to simplify calculation.
val millisPerMonth = TimeUnit.MICROSECONDS.toMillis(CalendarInterval.MICROS_PER_DAY) * 31
delay.milliseconds + delay.months * millisPerMonth
delay.milliseconds + delay.months * MILLIS_PER_MONTH
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What we need here is seconds per month, not seconds per year. I think we should still assume 31 days per month here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you point out where we need seconds/days per year in the codebase?

Copy link
Member Author

@MaxGekk MaxGekk Oct 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe any place including this one when we need a duration (in seconds or its fractions). The difference between months_between() and this place is months_between uses month length to calculate fraction of month, and 28 or 31 days per months don't really matter because it impacts on 2nd or 3rd digit in fractions but here we operate on bigger numbers when months form years. And it becomes matter how much days we use per year. Let's say we calculate the duration of 10 years which 120 months. If we use 31 days per months, this duration is 31 * 120 = 10 * 372 = 3720 days but if one year is 365.2425 than 10 years = 3652 days. The difference is 3720 - 3652 = 68 days or the calculation error is more than 2 months. That's matter I believe.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

months_between is sort of a special case because "31 days per month" is (it seems) actually how it is supposed to work, correctly.

It's rare that someone would specify "1 month" here, let alone "10 years" right? or am I missing something? these are things like watermark intervals. Not that it means the semantics don't matter, it's just quite a corner case.

I therefore just don't feel strongly either way about it. We don't need to match months_between semantics. More precision is nice, but surely it almost never comes up anyway? I don't mind the change, as a result.


val millisPerMonth = TimeUnit.MICROSECONDS.toMillis(CalendarInterval.MICROS_PER_DAY) * 31
cal.milliseconds + cal.months * millisPerMonth
cal.milliseconds + cal.months * MILLIS_PER_MONTH
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's interesting since this change doesn't affect our tests.

@MaxGekk
Copy link
Member Author

MaxGekk commented Oct 5, 2019

Can I do something else in the PR?

@srowen
Copy link
Member

srowen commented Oct 5, 2019

I'm OK with this change, mostly on the grounds that it will almost never matter. I left it open for a while for comments.

@MaxGekk
Copy link
Member Author

MaxGekk commented Oct 6, 2019

... mostly on the grounds that it will almost never matter.

If we would have this, we could calculate duration of any interval, and make the INTERVAL type comparable as PostgreSQL has:

maxim=# select interval '1 month 1 microsecond' > interval '1 month';
 ?column? 
----------
 t
(1 row)

I even don't speak about this PR #25981 which is blocked by this somehow.

@srowen
Copy link
Member

srowen commented Oct 6, 2019

That sounds like a third issue, what "1 month" means as an interval and how to support it. This doesn't change that, right? what it does change seems OK, if it's narrowly about what a streaming interval of "1 month" means or something. Or have I really missed the point?

@MaxGekk
Copy link
Member Author

MaxGekk commented Oct 6, 2019

This doesn't change that, right?

This doesn't.

Or have I really missed the point?

Here you didn't miss anything but in the PR #25981, I need to calculate interval length in date_part('EPOCH', interval). PostgreSQL defines it as for interval values, the total number of seconds in the interval (https://www.postgresql.org/docs/11/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT). And months & year duration will be matter there.

@srowen srowen closed this in 18b7ad2 Oct 7, 2019
@srowen
Copy link
Member

srowen commented Oct 7, 2019

Merged to master

@zsxwing
Copy link
Member

zsxwing commented Oct 12, 2019

I took a look at this one. The change is incorrect. Both GroupStateImpl and EventTimeWatermark pick up 31 days intentionally. Please see the discussion here: #16304 (comment)

This just reminds me that we don't have any document or test for this behavior.

@MaxGekk
Copy link
Member Author

MaxGekk commented Oct 12, 2019

@zsxwing Thank you. Here is the PR to revert the changes: #26101

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants