-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-29679][SQL] Make interval type comparable and orderable #26337
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #112997 has finished for PR 26337 at commit
|
|
retest this please |
MaxGekk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't think CalendarIntervalType values could be comparable while the microseconds field is not limited.
| return -1; | ||
| } | ||
| } else { | ||
| return mc; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about interval 1 month 120 days and interval 2 month?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're right, will fix this, thanks.
|
Test build #113008 has finished for PR 26337 at commit
|
|
retest this please |
|
Test build #113001 has finished for PR 26337 at commit
|
|
retest this please |
|
Test build #113014 has finished for PR 26337 at commit
|
|
Test build #113012 has finished for PR 26337 at commit
|
|
Test build #113024 has finished for PR 26337 at commit
|
|
retest this please |
|
Test build #113029 has finished for PR 26337 at commit
|
|
retest this please |
|
Test build #113065 has finished for PR 26337 at commit
|
|
Test build #113103 has finished for PR 26337 at commit
|
|
ping @cloud-fan @maropu @HyukjinKwon @dongjoon-hyun, thanks in advance. |
|
also cc @maropu |
| public static final long MICROS_PER_HOUR = MICROS_PER_MINUTE * 60; | ||
| public static final long MICROS_PER_DAY = MICROS_PER_HOUR * 24; | ||
| public static final long MICROS_PER_WEEK = MICROS_PER_DAY * 7; | ||
| public static final long MICROS_PER_MONTH = MICROS_PER_DAY * 30; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where do we use it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I leave it here by mistake, will delete it
common/unsafe/src/main/java/org/apache/spark/unsafe/types/CalendarInterval.java
Show resolved
Hide resolved
common/unsafe/src/main/java/org/apache/spark/unsafe/types/CalendarInterval.java
Show resolved
Hide resolved
|
|
||
| @Override | ||
| public int compare(CalendarInterval that) { | ||
| long thisAdjustDays = this.microseconds / MICROS_PER_DAY + this.days + this.months * 30; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MaxGekk IIRC we have a different definition of days-per-month in Spark?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/IntervalUtils.scala
Line 38 in a4382f7
| final val DAYS_PER_MONTH: Byte = 30 |
I just find one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
31 days per months is used at the moment, see
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/IntervalUtils.scala
Lines 333 to 343 in a4382f7
| * @param daysPerMonth The number of days per one month. The default value is 31 days | |
| * per month. This value was taken as the default because it is used | |
| * in Structured Streaming for watermark calculations. Having 31 days | |
| * per month, we can guarantee that events are not dropped before | |
| * the end of any month (February with 29 days or January with 31 days). | |
| * @return Duration in the specified time units | |
| */ | |
| def getDuration( | |
| interval: CalendarInterval, | |
| targetUnit: TimeUnit, | |
| daysPerMonth: Int = 31): Long = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
30 is for PostgreSQL compatibility in
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/IntervalUtils.scala
Lines 92 to 98 in a4382f7
| def getEpoch(interval: CalendarInterval): Decimal = { | |
| var result = interval.microseconds | |
| result += DateTimeUtils.MICROS_PER_DAY * interval.days | |
| result += MICROS_PER_YEAR * (interval.months / MONTHS_PER_YEAR) | |
| result += MICROS_PER_MONTH * (interval.months % MONTHS_PER_YEAR) | |
| Decimal(result, 18, 6) | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok let's use 30 here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about to extract the code:
var result = interval.microseconds result += DateTimeUtils.MICROS_PER_DAY * interval.days result += MICROS_PER_YEAR * (interval.months / MONTHS_PER_YEAR) result += MICROS_PER_MONTH * (interval.months % MONTHS_PER_YEAR)and reuse it in the comparison? At least, we will have consistent behavior.
It not accurate enough for MICROS_PER_YEAR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your code is not accurate too. 30 * 12 = 360 is slightly far away from average days per year - 365.2425, see #25998
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be accurate, we need to convert everything to microseconds and compare them, right? i.e. interval 30 days is smaller than internal 1 month.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
postgres=# select interval '1 year' = interval '360 days';
?column?
----------
t
(1 row)
postgres=# select interval '1 mon' = interval '30 days';
?column?
----------
t
(1 row)
postgres=# select interval '1 mon' = interval '31 days';
?column?
----------
f
(1 row)
postgres=# select interval '1 year' = interval '365 days';
?column?
----------
f
(1 row)
If we are more likely to keep compatibility with pg, so a year is 360 days not 365 days, a month is 30 days
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Postgres, interval comparison will treat year as 12 months if comparing with month, 360 days if comparing day and 8640 hours if comparing with hours. This is different from date add/subtract, where '1 year' is based on what kind of year it is added to.
This is same as interval division
postgres=# select interval '1 year' / 360;
?column?
-----------
1 0:00:00
(1 row)
postgres=# select interval '1 year' / 365;
?column?
---------------
23:40:16.4064
(1 row)|
Test build #113234 has finished for PR 26337 at commit
|
|
(I looked over this pr and I have no comment except for the current discussing ones) |
|
@maropu thank you very much for your review. |
|
presto> select interval '30' day = interval '1' month;
Query 20191107_145330_00007_f239d failed: line 1:26: '=' cannot be applied to interval day to second, interval year to month
select interval '30' day = interval '1' month
presto> select interval '30' day < interval '1' month;
Query 20191107_150903_00008_f239d failed: line 1:26: '<' cannot be applied to interval day to second, interval year to month
select interval '30' day < interval '1' month
presto>
presto> select interval '30' day < cast(interval '1' month as interval day to second);
Query 20191107_153514_00009_f239d failed: line 1:28: Cannot cast interval year to month to interval day to second
select interval '30' day < cast(interval '1' month as interval day to second)As we are more likely to define interval types and parse them as consistent as PostgreSQL’s approach not other dbs. Option 1: Use a year with a average value of 365.25 days and a month of 30 days to adjust. @cloud-fan @MaxGekk @maropu I hope we can reach an agreement here soon, thanks for your time. |
|
so it's mostly about years. 1) 1 year = 12 month = 360 days, or 2) 1 year = 362.25 days I think option 1 is more consistent. What does pgsql do? |
|
Postgres use option 1 that a year is 360 days and a month is 30 day in interval binary comparison. As the test case I listed here #26337 (comment) |
|
let's go option 1 then |
|
Test build #113392 has finished for PR 26337 at commit
|
| * The internal representation of interval type. | ||
| */ | ||
| public final class CalendarInterval implements Serializable { | ||
| public final class CalendarInterval implements Serializable, Ordered<CalendarInterval> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it possible to use java orderable? it's a java class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should be able to use java comparable. I will try this.
| SubtractTimestamps(Cast(l, TimestampType), r) | ||
|
|
||
| case b @ BinaryOperator(l @ CalendarIntervalType(), r @ NullType()) => | ||
| b.withNewChildren(Seq(l, Cast(r, CalendarIntervalType))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we move it closer to other CalendarIntervalType cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
|
Test build #113428 has finished for PR 26337 at commit
|
|
retest this please |
2 similar comments
|
retest this please |
|
retest this please |
|
|
||
| -- less than or equal | ||
| select interval '1 minutes' < interval '1 hour'; | ||
| select interval '-1 day' >= interval '-23 hour'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: <= to match the comment
|
Test build #113454 has finished for PR 26337 at commit
|
|
Test build #113459 has finished for PR 26337 at commit
|
| case Divide(l @ CalendarIntervalType(), r @ NumericType()) => | ||
| DivideInterval(l, r) | ||
|
|
||
| case b @ BinaryOperator(l @ CalendarIntervalType(), r @ NullType()) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a little hacky. Maybe we should introduce UnresolvedMultiply and UnresolvedDivide, so that we don't need to hack the type coercion rules.
We can try it in followup.
| @@ -0,0 +1,43 @@ | |||
| -- test for intervals | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now we have a dedicated sql test file for interval, maybe we should put all interval related tests here. We can do it in followup
|
thanks, merging to master! |
### What changes were proposed in this pull request? As we are not going to follow ANSI to implement year-month and day-time interval types, it is weird to compare the year-month part to the day-time part for our current implementation of interval type now. Additionally, the current ordering logic comes from PostgreSQL where the implementation of the interval is messy. And we are not aiming PostgreSQL compliance at all. THIS PR will revert #26681 and #26337 ### Why are the changes needed? make interval type more future-proofing ### Does this PR introduce any user-facing change? there are new in 3.0, so no ### How was this patch tested? existing uts shall work Closes #27262 from yaooqinn/SPARK-30551. Authored-by: Kent Yao <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
interval type support >, >=, <, <=, =, <=>, order by, min,max..
Why are the changes needed?
Part of SPARK-27764 Feature Parity between PostgreSQL and Spark
Does this PR introduce any user-facing change?
yes, we now support compare intervals
How was this patch tested?
add ut