Skip to content

Conversation

@yaooqinn
Copy link
Member

@yaooqinn yaooqinn commented Oct 31, 2019

What changes were proposed in this pull request?

interval type support >, >=, <, <=, =, <=>, order by, min,max..

Why are the changes needed?

Part of SPARK-27764 Feature Parity between PostgreSQL and Spark

Does this PR introduce any user-facing change?

yes, we now support compare intervals

How was this patch tested?

add ut

@yaooqinn yaooqinn changed the title [SPARK-29679][SQL] Make interval type support binary comparator [SPARK-29679][SQL] Make interval type comparable and orderable Oct 31, 2019
@SparkQA
Copy link

SparkQA commented Oct 31, 2019

Test build #112997 has finished for PR 26337 at commit 74ce207.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yaooqinn
Copy link
Member Author

retest this please

Copy link
Member

@MaxGekk MaxGekk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think CalendarIntervalType values could be comparable while the microseconds field is not limited.

return -1;
}
} else {
return mc;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about interval 1 month 120 days and interval 2 month?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're right, will fix this, thanks.

@SparkQA
Copy link

SparkQA commented Oct 31, 2019

Test build #113008 has finished for PR 26337 at commit 39d0665.

  • This patch fails build dependency tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yaooqinn
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Oct 31, 2019

Test build #113001 has finished for PR 26337 at commit 74ce207.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yaooqinn
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Oct 31, 2019

Test build #113014 has finished for PR 26337 at commit 39d0665.

  • This patch fails build dependency tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 31, 2019

Test build #113012 has finished for PR 26337 at commit 39d0665.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 31, 2019

Test build #113024 has finished for PR 26337 at commit 4bba921.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yaooqinn
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Oct 31, 2019

Test build #113029 has finished for PR 26337 at commit 4bba921.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yaooqinn
Copy link
Member Author

yaooqinn commented Nov 1, 2019

retest this please

@SparkQA
Copy link

SparkQA commented Nov 1, 2019

Test build #113065 has finished for PR 26337 at commit 4bba921.

  • This patch fails build dependency tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 1, 2019

Test build #113103 has finished for PR 26337 at commit 9c842ec.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yaooqinn
Copy link
Member Author

yaooqinn commented Nov 4, 2019

ping @cloud-fan @maropu @HyukjinKwon @dongjoon-hyun, thanks in advance.

@yaooqinn
Copy link
Member Author

yaooqinn commented Nov 5, 2019

also cc @maropu

public static final long MICROS_PER_HOUR = MICROS_PER_MINUTE * 60;
public static final long MICROS_PER_DAY = MICROS_PER_HOUR * 24;
public static final long MICROS_PER_WEEK = MICROS_PER_DAY * 7;
public static final long MICROS_PER_MONTH = MICROS_PER_DAY * 30;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where do we use it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I leave it here by mistake, will delete it


@Override
public int compare(CalendarInterval that) {
long thisAdjustDays = this.microseconds / MICROS_PER_DAY + this.days + this.months * 30;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MaxGekk IIRC we have a different definition of days-per-month in Spark?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


I just find one

Copy link
Member

@MaxGekk MaxGekk Nov 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

31 days per months is used at the moment, see

* @param daysPerMonth The number of days per one month. The default value is 31 days
* per month. This value was taken as the default because it is used
* in Structured Streaming for watermark calculations. Having 31 days
* per month, we can guarantee that events are not dropped before
* the end of any month (February with 29 days or January with 31 days).
* @return Duration in the specified time units
*/
def getDuration(
interval: CalendarInterval,
targetUnit: TimeUnit,
daysPerMonth: Int = 31): Long = {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

30 is for PostgreSQL compatibility in

def getEpoch(interval: CalendarInterval): Decimal = {
var result = interval.microseconds
result += DateTimeUtils.MICROS_PER_DAY * interval.days
result += MICROS_PER_YEAR * (interval.months / MONTHS_PER_YEAR)
result += MICROS_PER_MONTH * (interval.months % MONTHS_PER_YEAR)
Decimal(result, 18, 6)
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok let's use 30 here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about to extract the code:

   var result = interval.microseconds 
   result += DateTimeUtils.MICROS_PER_DAY * interval.days 
   result += MICROS_PER_YEAR * (interval.months / MONTHS_PER_YEAR) 
   result += MICROS_PER_MONTH * (interval.months % MONTHS_PER_YEAR)

and reuse it in the comparison? At least, we will have consistent behavior.

It not accurate enough for MICROS_PER_YEAR

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your code is not accurate too. 30 * 12 = 360 is slightly far away from average days per year - 365.2425, see #25998

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be accurate, we need to convert everything to microseconds and compare them, right? i.e. interval 30 days is smaller than internal 1 month.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

postgres=# select interval '1 year' = interval '360 days';
 ?column?
----------
 t
(1 row)

postgres=# select interval '1 mon' = interval '30 days';
 ?column?
----------
 t
(1 row)

postgres=# select interval '1 mon' = interval '31 days';
 ?column?
----------
 f
(1 row)

postgres=# select interval '1 year' = interval '365 days';
 ?column?
----------
 f
(1 row)

If we are more likely to keep compatibility with pg, so a year is 360 days not 365 days, a month is 30 days

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Postgres, interval comparison will treat year as 12 months if comparing with month, 360 days if comparing day and 8640 hours if comparing with hours. This is different from date add/subtract, where '1 year' is based on what kind of year it is added to.

This is same as interval division

postgres=# select interval '1 year' / 360;
 ?column?
-----------
 1 0:00:00
(1 row)

postgres=# select interval '1 year' / 365;
   ?column?
---------------
 23:40:16.4064
(1 row)

@SparkQA
Copy link

SparkQA commented Nov 5, 2019

Test build #113234 has finished for PR 26337 at commit 34c794a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Nov 5, 2019

(I looked over this pr and I have no comment except for the current discussing ones)

@yaooqinn
Copy link
Member Author

yaooqinn commented Nov 5, 2019

@maropu thank you very much for your review.

@yaooqinn
Copy link
Member Author

yaooqinn commented Nov 7, 2019

Oracle, mssql, presto etc. have two interval types which are the interval year to month and interval day to second, the binary comparison operators can not be applied to these two different types. I have it tested with presto with the following result as you can see. even with expilict casting.

presto> select interval '30' day = interval '1' month;
Query 20191107_145330_00007_f239d failed: line 1:26: '=' cannot be applied to interval day to second, interval year to month
select interval '30' day = interval '1' month

presto> select interval '30' day < interval '1' month;
Query 20191107_150903_00008_f239d failed: line 1:26: '<' cannot be applied to interval day to second, interval year to month
select interval '30' day < interval '1' month

presto>
presto> select interval '30' day < cast(interval '1' month as interval day to second);
Query 20191107_153514_00009_f239d failed: line 1:28: Cannot cast interval year to month to interval day to second
select interval '30' day < cast(interval '1' month as interval day to second)

As we are more likely to define interval types and parse them as consistent as PostgreSQL’s approach not other dbs.
So option 1: when we do a binary comparison between two intervals where we need to adjust year-month to day-second part, we follow Postgres as #26337 (comment) and #26337 (comment) to keep feature parity with postgres. This is also current pr's approach. It's kind of "accurate" when only intervals themselves participate in calculating. We may need to a doc to notice users for such behavior

Option 1: Use a year with a average value of 365.25 days and a month of 30 days to adjust.

@cloud-fan @MaxGekk @maropu I hope we can reach an agreement here soon, thanks for your time.

@cloud-fan
Copy link
Contributor

so it's mostly about years. 1) 1 year = 12 month = 360 days, or 2) 1 year = 362.25 days

I think option 1 is more consistent. What does pgsql do?

@yaooqinn
Copy link
Member Author

yaooqinn commented Nov 7, 2019

Postgres use option 1 that a year is 360 days and a month is 30 day in interval binary comparison. As the test case I listed here #26337 (comment)

@cloud-fan
Copy link
Contributor

let's go option 1 then

@SparkQA
Copy link

SparkQA commented Nov 7, 2019

Test build #113392 has finished for PR 26337 at commit 2f3d233.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • public final class CalendarInterval implements Serializable, Ordered<CalendarInterval>

* The internal representation of interval type.
*/
public final class CalendarInterval implements Serializable {
public final class CalendarInterval implements Serializable, Ordered<CalendarInterval> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to use java orderable? it's a java class

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should be able to use java comparable. I will try this.

SubtractTimestamps(Cast(l, TimestampType), r)

case b @ BinaryOperator(l @ CalendarIntervalType(), r @ NullType()) =>
b.withNewChildren(Seq(l, Cast(r, CalendarIntervalType)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we move it closer to other CalendarIntervalType cases?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@SparkQA
Copy link

SparkQA commented Nov 8, 2019

Test build #113428 has finished for PR 26337 at commit d9cc157.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • public final class CalendarInterval implements Serializable, Comparable<CalendarInterval>

@yaooqinn
Copy link
Member Author

yaooqinn commented Nov 8, 2019

retest this please

2 similar comments
@maropu
Copy link
Member

maropu commented Nov 8, 2019

retest this please

@cloud-fan
Copy link
Contributor

retest this please


-- less than or equal
select interval '1 minutes' < interval '1 hour';
select interval '-1 day' >= interval '-23 hour';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: <= to match the comment

@SparkQA
Copy link

SparkQA commented Nov 8, 2019

Test build #113454 has finished for PR 26337 at commit d9cc157.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • public final class CalendarInterval implements Serializable, Comparable<CalendarInterval>

@SparkQA
Copy link

SparkQA commented Nov 8, 2019

Test build #113459 has finished for PR 26337 at commit 5404d70.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

case Divide(l @ CalendarIntervalType(), r @ NumericType()) =>
DivideInterval(l, r)

case b @ BinaryOperator(l @ CalendarIntervalType(), r @ NullType()) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a little hacky. Maybe we should introduce UnresolvedMultiply and UnresolvedDivide, so that we don't need to hack the type coercion rules.

We can try it in followup.

@@ -0,0 +1,43 @@
-- test for intervals
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now we have a dedicated sql test file for interval, maybe we should put all interval related tests here. We can do it in followup

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in e026412 Nov 8, 2019
@yaooqinn yaooqinn deleted the SPARK-29679 branch November 27, 2019 11:45
cloud-fan pushed a commit that referenced this pull request Jan 19, 2020
### What changes were proposed in this pull request?

As we are not going to follow ANSI to implement year-month and day-time interval types, it is weird to compare the year-month part to the day-time part for our current implementation of interval type now.

Additionally, the current ordering logic comes from PostgreSQL where the implementation of the interval is messy. And we are not aiming PostgreSQL compliance at all.

THIS PR will revert #26681 and #26337

### Why are the changes needed?

make interval type more future-proofing

### Does this PR introduce any user-facing change?

there are new in 3.0, so no

### How was this patch tested?

existing uts shall work

Closes #27262 from yaooqinn/SPARK-30551.

Authored-by: Kent Yao <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants