-
Notifications
You must be signed in to change notification settings - Fork 29k
SPARK-31447 Fix issue in ExtractIntervalPart expression #28222
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Can one of the admins verify this patch? |
|
@cloud-fan @MaxGekk @yaooqinn I am looking for help to review this PR created 2 weeks ago. Since you guys were involved in PR related to simillar change (https://issues.apache.org/jira/browse/SPARK-31469), I am tagging you guys to see if you can help it review it. Since it is my first PR, please bear with me if I missed anything. I am happy to get guidance to improve it. |
|
@sathyaprakashg Please, take a look at the PRs |
Thanks @MaxGekk for prompt reply. CalendarInterval change is not required to fix the issue. I can revert the proposed change for CalendarInterval change. How does my proposed change for ExtractIntervalPart looks? If it looks good, I will update my PR to include only ExtractIntervalPart change We need to change ExtractIntervalPart so that below query returns output as 14 instead of 0. Please refer Why are the changes needed? for more information SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) - cast('2020-01-01 00:00:00' as timestamp))) |
|
cc @yaooqinn can you take a look? This seems like a hard problem as we have a non-standard interval definition. It's interesting to see what results other systems return, like presto, hive, snowflake, etc. |
|
Checked PostgresSQL(not ANSI interval type) and presto(ANSI), both of them return proper days |
|
Presto returns day-time intervals only with timestamp subtractions, I think we can handle this change with our own CalendarInterval. BTW, it seems both interval extraction and timestamp subtraction are newly added in 3.0.0? I am +1 for return both days and microseconds |
|
I think extracting days should return "interval.days + interval.microseconds / micros_per_day", to simulate the day-time interval semantic. But we shouldn't consider months in this case. @yaooqinn can you send a PR to make this happen? @sathyaprakashg we are trying hard to avoid assuming 1 month = 30 days. This makes interval not comparable, but we need to see how other systems implement it. I know pgsql assumes 1 month = 30 days, can you check other systems like presto, hive, snowflake, etc.? |
|
@cloud-fan, nice suggestion, I get your point. I will make a PR then. |
|
@cloud-fan Here are the output of month difference between two timestamps returned by different databases SQL Server: Statement: Statement: Statement: Presto: Statement: Postgres: Statement: MySQL: Statement: In summary it looks none of the above database is assuming 30 days per day, since 365 days is not giving 12 months as result. SQL Server takes the difference in month value of date regardless of day values. But all other databases seems to be checking whether day of date1 is higher than or equal to day of date2 to check whether to consider that month or not. Oracle and Hive have only MONTHS_BETWEEN function so I didn't include in analysis |
|
@cloud-fan Simillar to getDays we need to adjust getMonths function to consider Any suggestions? |
62eee34 to
183d9df
Compare
|
Can we use |
|
@cloud-fan Unfortunatly, there is no date_diff to find month difference in hive so I used MONTHS_BETWEEN function |
…ays + days in interval.microsecond ### What changes were proposed in this pull request? With suggestion from cloud-fan #28222 (comment) I Checked with both Presto and PostgresSQL, one is implemented intervals with ANSI style year-month/day-time, and the other is mixed and Non-ANSI. They both add the exceeded days in interval time part to the total days of the operation which extracts day from interval values. ```sql presto> SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) - cast('2020-01-01 00:00:00' as timestamp))); _col0 ------- 14 (1 row) Query 20200428_135239_00000_ahn7x, FINISHED, 1 node Splits: 17 total, 17 done (100.00%) 0:01 [0 rows, 0B] [0 rows/s, 0B/s] presto> SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) - cast('2020-01-01 00:00:01' as timestamp))); _col0 ------- 13 (1 row) Query 20200428_135246_00001_ahn7x, FINISHED, 1 node Splits: 17 total, 17 done (100.00%) 0:00 [0 rows, 0B] [0 rows/s, 0B/s] presto> ``` ```sql postgres=# SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) - cast('2020-01-01 00:00:00' as timestamp))); date_part ----------- 14 (1 row) postgres=# SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) - cast('2020-01-01 00:00:01' as timestamp))); date_part ----------- 13 ``` ``` spark-sql> SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) - cast('2020-01-01 00:00:01' as timestamp))); 0 spark-sql> SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) - cast('2020-01-01 00:00:00' as timestamp))); 0 ``` In ANSI standard, the day is exact 24 hours, so we don't need to worry about the conceptual day for interval extraction. The meaning of the conceptual day only takes effect when we add it to a zoned timestamp value. ### Why are the changes needed? Both satisfy the ANSI standard and common use cases in modern SQL platforms ### Does this PR introduce any user-facing change? No, it new in 3.0 ### How was this patch tested? add more uts Closes #28396 from yaooqinn/SPARK-31597. Authored-by: Kent Yao <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
…ays + days in interval.microsecond ### What changes were proposed in this pull request? With suggestion from cloud-fan #28222 (comment) I Checked with both Presto and PostgresSQL, one is implemented intervals with ANSI style year-month/day-time, and the other is mixed and Non-ANSI. They both add the exceeded days in interval time part to the total days of the operation which extracts day from interval values. ```sql presto> SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) - cast('2020-01-01 00:00:00' as timestamp))); _col0 ------- 14 (1 row) Query 20200428_135239_00000_ahn7x, FINISHED, 1 node Splits: 17 total, 17 done (100.00%) 0:01 [0 rows, 0B] [0 rows/s, 0B/s] presto> SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) - cast('2020-01-01 00:00:01' as timestamp))); _col0 ------- 13 (1 row) Query 20200428_135246_00001_ahn7x, FINISHED, 1 node Splits: 17 total, 17 done (100.00%) 0:00 [0 rows, 0B] [0 rows/s, 0B/s] presto> ``` ```sql postgres=# SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) - cast('2020-01-01 00:00:00' as timestamp))); date_part ----------- 14 (1 row) postgres=# SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) - cast('2020-01-01 00:00:01' as timestamp))); date_part ----------- 13 ``` ``` spark-sql> SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) - cast('2020-01-01 00:00:01' as timestamp))); 0 spark-sql> SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) - cast('2020-01-01 00:00:00' as timestamp))); 0 ``` In ANSI standard, the day is exact 24 hours, so we don't need to worry about the conceptual day for interval extraction. The meaning of the conceptual day only takes effect when we add it to a zoned timestamp value. ### Why are the changes needed? Both satisfy the ANSI standard and common use cases in modern SQL platforms ### Does this PR introduce any user-facing change? No, it new in 3.0 ### How was this patch tested? add more uts Closes #28396 from yaooqinn/SPARK-31597. Authored-by: Kent Yao <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit ea525fe) Signed-off-by: Wenchen Fan <[email protected]>
|
Spark has Can you look at MySQL, Oracle, SQL server, etc. if they support |
183d9df to
62eee34
Compare
|
@cloud-fan I updated SQL Server and MySQL also in my original comment. Hive and Oracle only support MONTHS_BETWEEN function. Please let me know your suggestion on how to handle this in spark. If possible, I would like to take this opportunity to contribute to get this fixed in spark :) |
|
OK seems only pgsql has |
|
Below is the output of postgresql.
Output translates to 420 days, 10 hours, 9 minutes and 7 seconds. |
|
The same query fails in Preso: I don't have a good idea here as Spark's interval type is non-standard. Returning 0 seems reasonable as well. |
I think the problem is not about the The If the The spark project currently doesn't have an Then we can get the proper results as you did above test("age") {
checkAnswer(
sql("select age(timestamp '2001-04-10', timestamp '1957-06-13')"),
Seq(Row(new CalendarInterval(525, 28, 0)))
)
checkAnswer(
sql(
"""
|SELECT EXTRACT(YEAR FROM age) year_part,
|EXTRACT(MONTH FROM age) month_part,
|EXTRACT(DAY FROM age) day_part,
|EXTRACT(hour FROM age) hour_part,
|EXTRACT(minute FROM age) minute_part,
|EXTRACT(second FROM age) second_part
|FROM values (age(TIMESTAMP '2021-03-16 10:09:07', TIMESTAMP '2020-01-15 00:00:00')) AS t(age)
|""".stripMargin),
Seq(Row(1, 2, 1, 10, 9, Decimal(7000000, 8, 6).toJavaBigDecimal))
)
checkAnswer(
sql("select age(date 'today')"),
Seq(Row(new CalendarInterval(0, 0, 0)))
)
}In other words, if the One more thing, if you really want to justify a month with 30 days, there are |
|
@cloud-fan @yaooqinn for great discussion on this. @yaooqinn has fixed getDays in the below PR and for getMonths we can agree on keeping the existing behavior as appropriate. I have another quick question, before we close this PR. SubtractTimestamps sems to be always populating only microseconds of CalendarInterval Line 2242 in b7cde42
Whereas SubtractDates seems to be populating months along with days field. Line 2261 in b7cde42
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala Line 940 in ea525fe
My question is whether we should change SubtractTimestamps also to populate appropraiate months and days fields as well in CalendarInterval or whether you guys feel existing implementation is appropriate |
|
@cloud-fan @yaooqinn Based on that we can make a decision to close this PR |
|
Hi @sathyaprakashg |
What changes were proposed in this pull request?
ExtractIntervalPart change
ExtractIntervalPartexpression get the interval parts information fromCalendarInterval, which has three fields months, days and microseconds.To get days interval
ExtractIntervalDaysexpression simply returns the days field of providedCalendarIntervalinput.If input is
CalendarInterval(months=2, days=10, microseconds=0), it returns output as 10, but instead it should return total days in the interval. In this it has to return as 70 days.CalendarInterval change
Another change I am proposing is equality comparision between two CalendarInterval object.
Right now,
CalendarInterval(months=1, days=0, microseconds=0)andCalendarInterval(months=0, days=30, microseconds=0)are not equal.I understand a month could have 30 or 31 days, but there should be way to compare two intervals even if they have different values in three fields but if both have same epoch it should be considered as equal
Change to equals method in
CalendarIntervalis not required to fix the issue inExtractIntervalPart, but I feel we should fix how two CalendarInterval objects are compared in equals method.Why are the changes needed?
Below sql statement return output as 0 instead of 14 days
SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) - cast('2020-01-01 00:00:00' as timestamp)))Below is why it returns wrong output
val start = Instant.parse("2019-01-01T00:00:00.000000Z")val end = Instant.parse("2019-01-15T00:00:00.000000Z")SubtractTimestamps(Literal(end), Literal(start))expression returnsCalendarInterval(months=0, days=0, microseconds=1209600000)So, when evaluating below expression
ExtractIntervalDaysreturns the value in days field in interval, which is zero, but correct answer is 14 daysExtractIntervalDays(SubtractTimestamps(Literal(end), Literal(start)))Does this PR introduce any user-facing change?
No
How was this patch tested?