Skip to content

Conversation

@sathyaprakashg
Copy link

What changes were proposed in this pull request?

ExtractIntervalPart change

ExtractIntervalPart expression get the interval parts information from CalendarInterval, which has three fields months, days and microseconds.

To get days interval ExtractIntervalDays expression simply returns the days field of provided CalendarInterval input.

If input is CalendarInterval(months=2, days=10, microseconds=0), it returns output as 10, but instead it should return total days in the interval. In this it has to return as 70 days.

CalendarInterval change

Another change I am proposing is equality comparision between two CalendarInterval object.
Right now, CalendarInterval(months=1, days=0, microseconds=0) and CalendarInterval(months=0, days=30, microseconds=0) are not equal.

I understand a month could have 30 or 31 days, but there should be way to compare two intervals even if they have different values in three fields but if both have same epoch it should be considered as equal

Change to equals method in CalendarInterval is not required to fix the issue in ExtractIntervalPart, but I feel we should fix how two CalendarInterval objects are compared in equals method.

Why are the changes needed?

Below sql statement return output as 0 instead of 14 days
SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) - cast('2020-01-01 00:00:00' as timestamp)))

Below is why it returns wrong output

val start = Instant.parse("2019-01-01T00:00:00.000000Z")
val end = Instant.parse("2019-01-15T00:00:00.000000Z")

SubtractTimestamps(Literal(end), Literal(start)) expression returns CalendarInterval(months=0, days=0, microseconds=1209600000)

So, when evaluating below expression ExtractIntervalDays returns the value in days field in interval, which is zero, but correct answer is 14 days
ExtractIntervalDays(SubtractTimestamps(Literal(end), Literal(start)))

Does this PR introduce any user-facing change?

No

How was this patch tested?

  • Additional + existing unit tests

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@sathyaprakashg
Copy link
Author

@cloud-fan @MaxGekk @yaooqinn I am looking for help to review this PR created 2 weeks ago. Since you guys were involved in PR related to simillar change (https://issues.apache.org/jira/browse/SPARK-31469), I am tagging you guys to see if you can help it review it.

Since it is my first PR, please bear with me if I missed anything. I am happy to get guidance to improve it.

@MaxGekk
Copy link
Member

MaxGekk commented Apr 28, 2020

@sathyaprakashg Please, take a look at the PRs
#26337
#27262

@sathyaprakashg
Copy link
Author

@sathyaprakashg Please, take a look at the PRs
#26337
#27262

Thanks @MaxGekk for prompt reply. CalendarInterval change is not required to fix the issue. I can revert the proposed change for CalendarInterval change.

How does my proposed change for ExtractIntervalPart looks? If it looks good, I will update my PR to include only ExtractIntervalPart change

We need to change ExtractIntervalPart so that below query returns output as 14 instead of 0. Please refer Why are the changes needed? for more information

SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) - cast('2020-01-01 00:00:00' as timestamp)))

@cloud-fan
Copy link
Contributor

cc @yaooqinn can you take a look? This seems like a hard problem as we have a non-standard interval definition. It's interesting to see what results other systems return, like presto, hive, snowflake, etc.

@yaooqinn
Copy link
Member

Checked PostgresSQL(not ANSI interval type) and presto(ANSI), both of them return proper days

@yaooqinn
Copy link
Member

Presto returns day-time intervals only with timestamp subtractions, I think we can handle this change with our own CalendarInterval.

BTW, it seems both interval extraction and timestamp subtraction are newly added in 3.0.0? I am +1 for return both days and microseconds

@cloud-fan
Copy link
Contributor

I think extracting days should return "interval.days + interval.microseconds / micros_per_day", to simulate the day-time interval semantic. But we shouldn't consider months in this case.

@yaooqinn can you send a PR to make this happen?

@sathyaprakashg we are trying hard to avoid assuming 1 month = 30 days. This makes interval not comparable, but we need to see how other systems implement it. I know pgsql assumes 1 month = 30 days, can you check other systems like presto, hive, snowflake, etc.?

@yaooqinn
Copy link
Member

@cloud-fan, nice suggestion, I get your point. I will make a PR then.

@sathyaprakashg
Copy link
Author

sathyaprakashg commented Apr 29, 2020

@cloud-fan Here are the output of month difference between two timestamps returned by different databases

SQL Server:
Statement: SELECT DATEDIFF(month, '2020-01-15 00:00:00', '2021-01-15 00:00:00')
Result : 12

Statement: SELECT DATEDIFF(month, '2020-01-16 00:00:00', '2021-01-15 00:00:00')
Result : 12

Statement: SELECT DATEDIFF(month, '2020-01-31 00:00:00', '2021-01-15 00:00:00')
Result : 12

Statement: SELECT DATEDIFF(month, '2020-02-01 00:00:00', '2021-01-15 00:00:00')
Result : 11

Presto:
Statement: SELECT date_diff('month', cast('2020-01-15 00:00:00' as timestamp),cast('2021-01-15 00:00:00' as timestamp) )
Result : 12

Statement: SELECT date_diff('month', cast('2020-01-16 00:00:00' as timestamp),cast('2021-01-15 00:00:00' as timestamp) )
Result : 11

Postgres:
Statement: SELECT EXTRACT(YEAR FROM age) * 12 + EXTRACT(MONTH FROM age) AS months_between FROM age(TIMESTAMP '2021-01-15 00:00:00', TIMESTAMP '2020-01-15 00:00:00') AS t(age)
Result : 12

Statement: SELECT EXTRACT(YEAR FROM age) * 12 + EXTRACT(MONTH FROM age) AS months_between FROM age(TIMESTAMP '2021-01-15 00:00:00', TIMESTAMP '2020-01-15 00:00:00') AS t(age)
Result : 11

MySQL:
Statement: SELECT TIMESTAMPDIFF(MONTH, '2020-01-15 00:00:00', '2021-01-15 00:00:00')
Result : 12

Statement: SELECT TIMESTAMPDIFF(MONTH, '2020-01-16 00:00:00', '2021-01-15 00:00:00')
Result : 11

In summary it looks none of the above database is assuming 30 days per day, since 365 days is not giving 12 months as result.

SQL Server takes the difference in month value of date regardless of day values.

But all other databases seems to be checking whether day of date1 is higher than or equal to day of date2 to check whether to consider that month or not.

Oracle and Hive have only MONTHS_BETWEEN function so I didn't include in analysis

@sathyaprakashg
Copy link
Author

sathyaprakashg commented Apr 29, 2020

@cloud-fan
Spark gives 0 for the same query.
SELECT EXTRACT(MONTH FROM (cast('2021-01-15 00:00:00' as timestamp) - cast('2020-01-15 00:00:00' as timestamp)))

Simillar to getDays we need to adjust getMonths function to consider interval.microseconds as well. But assuming 30 DAYS per Month would wrong result.

Any suggestions?

@cloud-fan
Copy link
Contributor

Can we use EXTRACT function consistenly in other systems? Otherwise it's hard to justify if the behavior of EXTRACT should follow MONTHS_BETWEEN or date_diff.

@sathyaprakashg
Copy link
Author

@cloud-fan Unfortunatly, there is no date_diff to find month difference in hive so I used MONTHS_BETWEEN function

cloud-fan pushed a commit that referenced this pull request Apr 29, 2020
…ays + days in interval.microsecond

### What changes were proposed in this pull request?

With suggestion from cloud-fan #28222 (comment)

I Checked with both Presto and PostgresSQL, one is implemented intervals with ANSI style year-month/day-time, and the other is mixed and Non-ANSI. They both add the exceeded days in interval time part to the total days of the operation which extracts day from interval values.

```sql

presto> SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) - cast('2020-01-01 00:00:00' as timestamp)));
_col0
-------
14
(1 row)

Query 20200428_135239_00000_ahn7x, FINISHED, 1 node
Splits: 17 total, 17 done (100.00%)
0:01 [0 rows, 0B] [0 rows/s, 0B/s]

presto> SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) - cast('2020-01-01 00:00:01' as timestamp)));
_col0
-------
13
(1 row)

Query 20200428_135246_00001_ahn7x, FINISHED, 1 node
Splits: 17 total, 17 done (100.00%)
0:00 [0 rows, 0B] [0 rows/s, 0B/s]

presto>

```

```sql

postgres=# SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) - cast('2020-01-01 00:00:00' as timestamp)));
date_part
-----------
14
(1 row)

postgres=# SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) - cast('2020-01-01 00:00:01' as timestamp)));
date_part
-----------
13

```

```
spark-sql> SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) - cast('2020-01-01 00:00:01' as timestamp)));
0
spark-sql> SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) - cast('2020-01-01 00:00:00' as timestamp)));
0
```

In ANSI standard, the day is exact 24 hours, so we don't need to worry about the conceptual day for interval extraction. The meaning of the conceptual day only takes effect when we add it to a zoned timestamp value.

### Why are the changes needed?

Both satisfy the ANSI standard and common use cases in modern SQL platforms

### Does this PR introduce any user-facing change?

No, it new in 3.0
### How was this patch tested?

add more uts

Closes #28396 from yaooqinn/SPARK-31597.

Authored-by: Kent Yao <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
cloud-fan pushed a commit that referenced this pull request Apr 29, 2020
…ays + days in interval.microsecond

### What changes were proposed in this pull request?

With suggestion from cloud-fan #28222 (comment)

I Checked with both Presto and PostgresSQL, one is implemented intervals with ANSI style year-month/day-time, and the other is mixed and Non-ANSI. They both add the exceeded days in interval time part to the total days of the operation which extracts day from interval values.

```sql

presto> SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) - cast('2020-01-01 00:00:00' as timestamp)));
_col0
-------
14
(1 row)

Query 20200428_135239_00000_ahn7x, FINISHED, 1 node
Splits: 17 total, 17 done (100.00%)
0:01 [0 rows, 0B] [0 rows/s, 0B/s]

presto> SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) - cast('2020-01-01 00:00:01' as timestamp)));
_col0
-------
13
(1 row)

Query 20200428_135246_00001_ahn7x, FINISHED, 1 node
Splits: 17 total, 17 done (100.00%)
0:00 [0 rows, 0B] [0 rows/s, 0B/s]

presto>

```

```sql

postgres=# SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) - cast('2020-01-01 00:00:00' as timestamp)));
date_part
-----------
14
(1 row)

postgres=# SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) - cast('2020-01-01 00:00:01' as timestamp)));
date_part
-----------
13

```

```
spark-sql> SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) - cast('2020-01-01 00:00:01' as timestamp)));
0
spark-sql> SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) - cast('2020-01-01 00:00:00' as timestamp)));
0
```

In ANSI standard, the day is exact 24 hours, so we don't need to worry about the conceptual day for interval extraction. The meaning of the conceptual day only takes effect when we add it to a zoned timestamp value.

### Why are the changes needed?

Both satisfy the ANSI standard and common use cases in modern SQL platforms

### Does this PR introduce any user-facing change?

No, it new in 3.0
### How was this patch tested?

add more uts

Closes #28396 from yaooqinn/SPARK-31597.

Authored-by: Kent Yao <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit ea525fe)
Signed-off-by: Wenchen Fan <[email protected]>
@cloud-fan
Copy link
Contributor

Spark has MONTHS_BETWEEN function as well, and the behavior is the same as Hive's.

Can you look at MySQL, Oracle, SQL server, etc. if they support EXTRACT?

@sathyaprakashg
Copy link
Author

@cloud-fan I updated SQL Server and MySQL also in my original comment. Hive and Oracle only support MONTHS_BETWEEN function.

Please let me know your suggestion on how to handle this in spark. If possible, I would like to take this opportunity to contribute to get this fixed in spark :)

@cloud-fan
Copy link
Contributor

OK seems only pgsql has extract. Can we check the behavior of extract(month ...) in pgsql? I can't tell it from SELECT EXTRACT(YEAR FROM age) * 12 + EXTRACT(MONTH FROM age) AS months_between FROM ...

@sathyaprakashg
Copy link
Author

sathyaprakashg commented Apr 29, 2020

Below is the output of postgresql.

SELECT EXTRACT(YEAR FROM age) year_part, 
EXTRACT(MONTH FROM age) month_part,
EXTRACT(DAY FROM age) day_part,
EXTRACT(hour FROM age) hour_part,
EXTRACT(minute FROM age) minute_part,
EXTRACT(second FROM age) second_part
FROM age(TIMESTAMP '2021-03-16 10:09:07', TIMESTAMP '2020-01-15 00:00:00') AS t(age)

  | year_part | month_part | day_part | hour_part | minute_part | second_part
| 1 | 2 | 1 | 10 | 9 | 7

select age(TIMESTAMP '2021-03-15 10:09:07', TIMESTAMP '2020-01-15 00:00:00')

age
420.10:09:07

Output translates to 420 days, 10 hours, 9 minutes and 7 seconds.

@cloud-fan
Copy link
Contributor

The same query fails in Preso:

presto> select extract(month from interval '40' hour);
Query 20200428_193926_00006_ahn7x failed: Unexpected parameters (interval day to second) for function month. Expected: month(timestamp) , month(timestamp with time zone) , month(date) , month(interval year to month)

I don't have a good idea here as Spark's interval type is non-standard. Returning 0 seems reasonable as well.

@yaooqinn
Copy link
Member

Below is the output of postgresql.

SELECT EXTRACT(YEAR FROM age) year_part, 
EXTRACT(MONTH FROM age) month_part,
EXTRACT(DAY FROM age) day_part,
EXTRACT(hour FROM age) hour_part,
EXTRACT(minute FROM age) minute_part,
EXTRACT(second FROM age) second_part
FROM age(TIMESTAMP '2021-03-16 10:09:07', TIMESTAMP '2020-01-15 00:00:00') AS t(age)

| year_part | month_part | day_part | hour_part | minute_part | second_part
| 1 | 2 | 1 | 10 | 9 | 7

select age(TIMESTAMP '2021-03-15 10:09:07', TIMESTAMP '2020-01-15 00:00:00')

age
420.10:09:07

Output translates to 420 days, 10 hours, 9 minutes and 7 seconds.

I think the problem is not about the extract function but how the age expression it relies on.

The extract function itself should follow the natural rules of the Georgian Calendar and achieve its semantics based on the output of extract-source.

If the age expression can result in a proper CalendarInterval value that contains the right months, days, and microseconds, then the extract function can work as you expected.

The spark project currently doesn't have an age function here, so I have it quickly implemented in my personal project https://github.com/yaooqinn/spark-func-extras

Then we can get the proper results as you did above

https://github.com/yaooqinn/spark-func-extras/blob/master/src/test/scala/org/apache/spark/sql/extra/PostgreSQLExtensionsTest.scala#L127-L150

test("age") {
    checkAnswer(
      sql("select age(timestamp '2001-04-10',  timestamp '1957-06-13')"),
      Seq(Row(new CalendarInterval(525, 28, 0)))
    )

    checkAnswer(
      sql(
        """
          |SELECT EXTRACT(YEAR FROM age) year_part,
          |EXTRACT(MONTH FROM age) month_part,
          |EXTRACT(DAY FROM age) day_part,
          |EXTRACT(hour FROM age) hour_part,
          |EXTRACT(minute FROM age) minute_part,
          |EXTRACT(second FROM age) second_part
          |FROM values (age(TIMESTAMP '2021-03-16 10:09:07', TIMESTAMP '2020-01-15 00:00:00')) AS t(age)
          |""".stripMargin),
        Seq(Row(1, 2, 1, 10, 9, Decimal(7000000, 8, 6).toJavaBigDecimal))
    )

    checkAnswer(
      sql("select age(date 'today')"),
      Seq(Row(new CalendarInterval(0, 0, 0)))
    )
  }

In other words, if the age function has a result type of Integer just represents the year, it will quickly fail the extract with analysis error here.

One more thing, if you really want to justify a month with 30 days, there are justifyDays, justifyHours and justifyInterval functions in that project

@sathyaprakashg
Copy link
Author

@cloud-fan @yaooqinn for great discussion on this.

@yaooqinn has fixed getDays in the below PR and for getMonths we can agree on keeping the existing behavior as appropriate.

#28396

I have another quick question, before we close this PR.

SubtractTimestamps sems to be always populating only microseconds of CalendarInterval

new CalendarInterval(0, 0, end.asInstanceOf[Long] - start.asInstanceOf[Long])

Whereas SubtractDates seems to be populating months along with days field.

DateTimeUtils.subtractDates(leftDays.asInstanceOf[Int], rightDays.asInstanceOf[Int])

My question is whether we should change SubtractTimestamps also to populate appropraiate months and days fields as well in CalendarInterval or whether you guys feel existing implementation is appropriate

@sathyaprakashg
Copy link
Author

@cloud-fan @yaooqinn
I would like to get your opinion on my question related to SubtractTimestamps in above discussion.

Based on that we can make a decision to close this PR

@yaooqinn
Copy link
Member

Hi @sathyaprakashg
This could be an API change and may cause performance regression for this operator. IMHO, we may need much stronger evidence to make the change, e.g. SQL standard, behaviors for timestamp - timestamp from other modern DBMS systems as much as possible. Otherwise, returning only micros part may be spark-specific but efficient and causes no ambiguity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants