[SPARK-38195][SQL] Add the `TIMESTAMPADD()` function by MaxGekk · Pull Request #35502 · apache/spark

MaxGekk · 2022-02-13T16:27:46Z

What changes were proposed in this pull request?

In the PR, I propose to add new function TIMESTAMPADD with the following parameters:

unit - specifies an unit of interval. It can be a string or an identifier. Supported the following values (case-insensitive):
- YEAR
- QUARTER
- MONTH
- WEEK
- DAY, DAYOFYEAR
- HOUR
- MINUTE
- SECOND
- MILLISECOND
- MICROSECOND
quantity - the amount of units to add. It has the INT type. It can be positive or negative.
timestamp - a timestamp (w/ or w/o timezone) to which you want to add.

The function returns the original timestamp plus the given interval. The result has the same type as the input timestamp (for timestamp_ntz, it returns timestamp_ntz and for timestamp_ltz -> timestamp_ltz).

For example:

scala> val df = sql("select timestampadd(YEAR, 1, timestamp_ltz'2022-02-16 01:02:03') as ts1, timestampadd(YEAR, 1, timestamp_ntz'2022-02-16 01:02:03') as ts2")
df: org.apache.spark.sql.DataFrame = [ts1: timestamp, ts2: timestamp_ntz]

scala> df.printSchema
root
 |-- ts1: timestamp (nullable = false)
 |-- ts2: timestamp_ntz (nullable = false)


scala> df.show(false)
+-------------------+-------------------+
|ts1                |ts2                |
+-------------------+-------------------+
|2023-02-16 01:02:03|2023-02-16 01:02:03|
+-------------------+-------------------+

Note: if the timestamp has the type timestamp_ltz, and unit is:

YEAR, QUARTER, MONTH - the input timestamp is converted to a local timestamp at the session time (see spark.sql.session.timeZone). And after that, the function adds the amount of months to the local timestamp, and converts the result to a timestamp_ltz at the same session time zone.
WEEK, DAY - in similar way as above, the function adds the total amount of days to the timestamp at the session time zone.
HOUR, MINUTE, SECOND, MILLISECOND, MICROSECOND - the functions converts the interval to the total amount of microseconds, and adds them to the given timestamp (expressed as an offset from the epoch).

For example, Sun 13-Mar-2022 at 02:00:00 A.M. is a daylight saving time in the America/Los_Angeles time zone:

spark-sql> set spark.sql.session.timeZone=America/Los_Angeles;
spark.sql.session.timeZone	America/Los_Angeles
spark-sql> select timestampadd(HOUR, 4, timestamp_ltz'2022-03-13 00:00:00'), timestampadd(HOUR, 4, timestamp_ntz'2022-03-13 00:00:00');
2022-03-13 05:00:00	2022-03-13 04:00:00
spark-sql> select timestampadd(DAY, 1, timestamp_ltz'2022-03-13 00:00:00'), timestampadd(DAY, 1, timestamp_ntz'2022-03-13 00:00:00');
2022-03-14 00:00:00	2022-03-14 00:00:00
spark-sql> select timestampadd(Month, -1, timestamp_ltz'2022-03-13 00:00:00'), timestampadd(month, -1, timestamp_ntz'2022-03-13 00:00:00');
2022-02-13 00:00:00	2022-02-13 00:00:00

In fact, such behavior is similar to adding an ANSI interval to a timestamp.

The function also supports implicit conversion of the input date to a timestamp according the general rules of Spark SQL. By default, Spark SQL converts dates to timestamp (which is timestamp_ltz by default).

Why are the changes needed?

To make the migration process from other systems to Spark SQL easier.
To achieve feature parity with other DBMSs.

Does this PR introduce any user-facing change?

No. This is new feature.

How was this patch tested?

By running new tests:

$ build/sbt "test:testOnly *QueryExecutionErrorsSuite"
$ build/sbt "test:testOnly *DateTimeUtilsSuite"
$ build/sbt "sql/test:testOnly org.apache.spark.sql.expressions.ExpressionInfoSuite"
$ build/sbt "sql/testOnly *ExpressionsSchemaSuite"
$ build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z timestamp.sql"
$ build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z timestamp-ansi.sql"
$ build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z datetime-legacy.sql"
$ build/sbt "test:testOnly *DateExpressionsSuite"
$ build/sbt "test:testOnly *SQLKeywordSuite"

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala

MaxGekk · 2022-02-17T12:50:01Z

@srielau @entong @superdupershant Any feedback is welcome.

gengliangwang

LGTM

superdupershant · 2022-02-18T02:08:21Z

docs/sql-ref-ansi-compliance.md

 |THEN|reserved|non-reserved|reserved|
 |TIME|reserved|non-reserved|reserved|
 |TIMESTAMP|non-reserved|non-reserved|non-reserved|
+|TIMESTAMPADD|non-reserved|non-reserved|non-reserved|


Do you want to add DATEADD as an alias for the same function in this PR?

Let's discuss overloading of DATEADD in a separate JIRA. This is arguable, and need to reach a consensus, from my point of view.

MaxGekk · 2022-02-18T05:50:23Z

Merging to master. Thank you, @gengliangwang @HyukjinKwon and @superdupershant for review.

cloud-fan · 2022-02-18T05:54:56Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala

+      case "YEAR" =>
+        timestampAddMonths(micros, quantity * MONTHS_PER_YEAR, zoneId)
+      case _ =>
+        throw QueryExecutionErrors.invalidUnitInTimestampAdd(unit)


Shall we check the unit names in the parser? To fail earlier.

We can but I just wonder what kind of problem would the earlier check solve? Parser and compiler do a lot of work for now, adding more unnecessary things should be motivated somehow, from my point of view.

Failing earlier is a pretty strong reason, right? It's a waste of resource if we submit a Spark job which fails with wrong unit name.

It's a waste of resource if we submit a Spark job which fails with wrong unit name.

Not sure. Can you imagine a cluster of 1000 executors waiting for the driver that is still processing a query because we eagerly want to check everything even when user's queries and data don't have any issues. This is real waste of user's resources.

Also I would like to add, what you are taking is about a mistake in a query actually, like:

SELECT timestampadd(YEER, 1, timestampColumn);

Such kind of mistakes are not permanent, and usually users fix them during the debug stage. There are no so much reasons to double check such mistakes at parsing and in runtime (we must do that since unit can be non-foldable).

I don't understand why we suddenly want to stop doing it from this PR.

The unit param can be non-foldable. I made it generic intentionally. If you wonder why, I will answer to that separately.

As unit can be non-foldable, we need the runtime check.

If we add checks in parser, we will do checks twice at parsing and at execution... which is not necessary because

We can handle foldable unit in codegen as an optimization where we (of course) have to check unit values at the optimization phase.

As summary, taking into account that we will optimize foldable unit in codegen in the near future where we validate correctness of unit, there is no need to do that in parser as you proposed.

Example: EXTRACT, TO_BINARY, TO_NUMBER

The expressions require one of their param (format, field and etc) to be always foldable. In the case, of TIMESTAMPADD() is unnecessary restriction, I believe. I have faced to the situation a few times in my life when some code was deployed in the production after testing, and need to increase precision of intervals. Let's say we had:

select timestampadd(SECOND, tbl.quantity, tbl.ts1)

, and we wants to bump precision of tbl.quantity to milliseconds. Since quantity is a column in a table, we can just multiply it by 1000 during a maintenance time but we should do with SECOND? We have to re-deploy to code, including pass whole release cycle, only because a Spark dev forced us to hard-code the SECOND in our code, for some unclear reasons.

I'm totally OK with your decision if the unit parameter can be unfoldable, but it seems not the case? We even added a special parse rule for this function so that the unit parameter is an identifier.

so that the unit parameter is an identifier.

It can be an identifier or a string column, see:

spark-sql> create table tbl as select 'SECOND' as u, 1 as q, timestamp'now' as t; spark-sql> select * from tbl; SECOND 1 2022-02-18 18:33:34.939 spark-sql> select timestampadd(tbl.u, q, t) from tbl; 2022-02-18 18:33:35.939

or

spark-sql> select timestampadd('HOUR', 1, timestamp'now'); 2022-02-18 19:38:54.817

Technically speaking the first argument, unit, should be a datetime interval type as in what's used with EXTRACT. It isn't meant to be a string if that makes things any simpler.

the first argument, unit, should be a datetime interval type

I didn't get your point. How it could be the interval type?

... as in what's used with EXTRACT

Just wonder why do you linked TIMESTAMPADD to EXTRACT but not to TIMESTAMPDIFF, for example. Anyway technically specking the type of the first argument is the same - string type.

... makes things any simpler.

This PR achieve this goal, I believe. It makes the migration process to Spark SQL simpler, and gives additional benefits of using Spark SQL in the real production (see my comment above).

cloud-fan · 2022-02-18T05:56:14Z

sql/core/src/test/resources/sql-tests/inputs/timestamp.sql

+select timestampadd('MONTH', -1, timestamp'2022-02-14 01:02:03');
+select timestampadd(MINUTE, 58, timestamp'2022-02-14 01:02:03');
+select timestampadd(YEAR, 1, date'2022-02-15');
+select timestampadd('SECOND', -1, date'2022-02-15');


can we have some negative tests? e.g. invalid unit name, overflow, etc.

The test for invalid unit name is in this PR, see the test for error class.

Regarding to overflow, actually I reused methods of adding ANSI intervals to timestamps. I think we should test overflow for both ANSI intervals and for timestampadd(). I will open an JIRA for that.

cloud-fan · 2022-02-21T03:22:04Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala

+   */
+  override def visitTimestampadd(ctx: TimestampaddContext): Expression = withOrigin(ctx) {
+    val arguments = Seq(
+      Literal(ctx.unit.getText),


@MaxGekk I think this indicates the unit parameter must be foldable?

This is not the single enter point for timestampadd, so, the must word is not applicable. See the example above #35502 (comment), how it should work if the unit parameter must be foldable. BTW, I will open an JIRA to implement the optimization when unit is foldable.

Ah I see, then I think the current implementation is fine.

…TIMESTAMPADD()` ### What changes were proposed in this pull request? In the PR, I propose to add two aliases for the `TIMESTAMPADD()` function introduced by #35502: - `DATEADD()` - `DATE_ADD()` ### Why are the changes needed? 1. To make the migration process from other systems to Spark SQL easier. 2. To achieve feature parity with other DBMSs. ### Does this PR introduce _any_ user-facing change? No. The new aliases just extend Spark SQL API. ### How was this patch tested? 1. By running the existing test suites: ``` $ build/sbt "test:testOnly *SQLKeywordSuite" ``` 3. and new checks: ``` $ build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z date.sql" ``` Closes #35661 from MaxGekk/dateadd. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>

MaxGekk added 4 commits February 11, 2022 21:02

Add the TimestampAdd expression

d40d93c

Add timestampAdd() to DateTimeUtils

2b2653a

Throw exception for invalid unit

08e9edd

Add a test for error classes.

d9416b1

github-actions bot added the SQL label Feb 13, 2022

MaxGekk marked this pull request as draft February 13, 2022 16:32

MaxGekk added 17 commits February 14, 2022 10:36

Re-gen sql-expression-schema.md

a81441d

Add a comment to timestampAdd()

c5dc75f

Add checks for timestampAdd()

15a4f4a

Merge remote-tracking branch 'origin/master' into timestampadd

e7b7049

Update the description of TimestampAdd

8e38738

End-to-end tests

d840317

Re-gen timestampNTZ/timestamp-ansi.sql.out

2cde0c2

Re-gen datetime-legacy.sql.out

414d68a

Add a test for the TimestampAdd expression

b3c59df

Add more test to timestamp.sql

aa1b064

Test corner cases

02bf01c

Add some examples

15a888d

Merge remote-tracking branch 'origin/master' into timestampadd

02ebffa

Re-gen ansi/timestamp.sql.out

a1d3e1d

Support keywords

9c7b2f1

Re-gen golden files

4d4413e

Document keywords

c6e437a

github-actions bot added the DOCS label Feb 16, 2022

MaxGekk changed the title ~~[WIP][SPARK-38195][SQL] Add the TIMESTAMPADD() function~~ [SPARK-38195][SQL] Add the TIMESTAMPADD() function Feb 16, 2022

MaxGekk marked this pull request as ready for review February 16, 2022 11:22

MaxGekk requested review from HyukjinKwon, cloud-fan and gengliangwang February 16, 2022 11:22

HyukjinKwon reviewed Feb 17, 2022

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala Outdated Show resolved Hide resolved

MaxGekk added 3 commits February 17, 2022 12:19

This -> this

ace083f

Remove the SQL_TSI_ units

bb8cb4a

interval -> quantity

a81b7de

gengliangwang approved these changes Feb 17, 2022

View reviewed changes

superdupershant reviewed Feb 18, 2022

View reviewed changes

MaxGekk closed this in 3a7eafd Feb 18, 2022

cloud-fan reviewed Feb 18, 2022

View reviewed changes

cloud-fan reviewed Feb 21, 2022

View reviewed changes

MaxGekk mentioned this pull request Feb 25, 2022

[SPARK-38332][SQL] Add the DATEADD() and DATE_ADD() aliases for TIMESTAMPADD() #35661

Closed

Conversation

MaxGekk commented Feb 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Uh oh!

MaxGekk commented Feb 17, 2022

Uh oh!

gengliangwang left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MaxGekk commented Feb 18, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MaxGekk Feb 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MaxGekk Feb 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MaxGekk Feb 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

MaxGekk commented Feb 13, 2022 •

edited

Loading

MaxGekk Feb 18, 2022 •

edited

Loading

MaxGekk Feb 18, 2022 •

edited

Loading

MaxGekk Feb 21, 2022 •

edited

Loading