[SPARK-38195][SQL] Add the TIMESTAMPADD() function#35502
[SPARK-38195][SQL] Add the TIMESTAMPADD() function#35502MaxGekk wants to merge 24 commits intoapache:masterfrom
TIMESTAMPADD() function#35502Conversation
TIMESTAMPADD() functionTIMESTAMPADD() function
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala
Outdated
Show resolved
Hide resolved
|
@srielau @entong @superdupershant Any feedback is welcome. |
| |THEN|reserved|non-reserved|reserved| | ||
| |TIME|reserved|non-reserved|reserved| | ||
| |TIMESTAMP|non-reserved|non-reserved|non-reserved| | ||
| |TIMESTAMPADD|non-reserved|non-reserved|non-reserved| |
There was a problem hiding this comment.
Do you want to add DATEADD as an alias for the same function in this PR?
There was a problem hiding this comment.
Let's discuss overloading of DATEADD in a separate JIRA. This is arguable, and need to reach a consensus, from my point of view.
|
Merging to master. Thank you, @gengliangwang @HyukjinKwon and @superdupershant for review. |
| case "YEAR" => | ||
| timestampAddMonths(micros, quantity * MONTHS_PER_YEAR, zoneId) | ||
| case _ => | ||
| throw QueryExecutionErrors.invalidUnitInTimestampAdd(unit) |
There was a problem hiding this comment.
Shall we check the unit names in the parser? To fail earlier.
There was a problem hiding this comment.
We can but I just wonder what kind of problem would the earlier check solve? Parser and compiler do a lot of work for now, adding more unnecessary things should be motivated somehow, from my point of view.
There was a problem hiding this comment.
Failing earlier is a pretty strong reason, right? It's a waste of resource if we submit a Spark job which fails with wrong unit name.
There was a problem hiding this comment.
It's a waste of resource if we submit a Spark job which fails with wrong unit name.
Not sure. Can you imagine a cluster of 1000 executors waiting for the driver that is still processing a query because we eagerly want to check everything even when user's queries and data don't have any issues. This is real waste of user's resources.
There was a problem hiding this comment.
Also I would like to add, what you are taking is about a mistake in a query actually, like:
SELECT timestampadd(YEER, 1, timestampColumn);Such kind of mistakes are not permanent, and usually users fix them during the debug stage. There are no so much reasons to double check such mistakes at parsing and in runtime (we must do that since unit can be non-foldable).
There was a problem hiding this comment.
I don't understand why we suddenly want to stop doing it from this PR.
- The
unitparam can be non-foldable. I made it generic intentionally. If you wonder why, I will answer to that separately. - As
unitcan be non-foldable, we need the runtime check. - If we add checks in parser, we will do checks twice at parsing and at execution... which is not necessary because
- We can handle foldable
unitin codegen as an optimization where we (of course) have to checkunitvalues at the optimization phase.
As summary, taking into account that we will optimize foldable unit in codegen in the near future where we validate correctness of unit, there is no need to do that in parser as you proposed.
Example: EXTRACT, TO_BINARY, TO_NUMBER
The expressions require one of their param (format, field and etc) to be always foldable. In the case, of TIMESTAMPADD() is unnecessary restriction, I believe. I have faced to the situation a few times in my life when some code was deployed in the production after testing, and need to increase precision of intervals. Let's say we had:
select timestampadd(SECOND, tbl.quantity, tbl.ts1) , and we wants to bump precision of tbl.quantity to milliseconds. Since quantity is a column in a table, we can just multiply it by 1000 during a maintenance time but we should do with SECOND? We have to re-deploy to code, including pass whole release cycle, only because a Spark dev forced us to hard-code the SECOND in our code, for some unclear reasons.
There was a problem hiding this comment.
I'm totally OK with your decision if the unit parameter can be unfoldable, but it seems not the case? We even added a special parse rule for this function so that the unit parameter is an identifier.
There was a problem hiding this comment.
so that the unit parameter is an identifier.
It can be an identifier or a string column, see:
spark-sql> create table tbl as select 'SECOND' as u, 1 as q, timestamp'now' as t;
spark-sql> select * from tbl;
SECOND 1 2022-02-18 18:33:34.939
spark-sql> select timestampadd(tbl.u, q, t) from tbl;
2022-02-18 18:33:35.939or
spark-sql> select timestampadd('HOUR', 1, timestamp'now');
2022-02-18 19:38:54.817There was a problem hiding this comment.
Technically speaking the first argument, unit, should be a datetime interval type as in what's used with EXTRACT. It isn't meant to be a string if that makes things any simpler.
There was a problem hiding this comment.
the first argument, unit, should be a datetime interval type
I didn't get your point. How it could be the interval type?
... as in what's used with EXTRACT
Just wonder why do you linked TIMESTAMPADD to EXTRACT but not to TIMESTAMPDIFF, for example. Anyway technically specking the type of the first argument is the same - string type.
... makes things any simpler.
This PR achieve this goal, I believe. It makes the migration process to Spark SQL simpler, and gives additional benefits of using Spark SQL in the real production (see my comment above).
| select timestampadd('MONTH', -1, timestamp'2022-02-14 01:02:03'); | ||
| select timestampadd(MINUTE, 58, timestamp'2022-02-14 01:02:03'); | ||
| select timestampadd(YEAR, 1, date'2022-02-15'); | ||
| select timestampadd('SECOND', -1, date'2022-02-15'); |
There was a problem hiding this comment.
can we have some negative tests? e.g. invalid unit name, overflow, etc.
There was a problem hiding this comment.
The test for invalid unit name is in this PR, see the test for error class.
There was a problem hiding this comment.
Regarding to overflow, actually I reused methods of adding ANSI intervals to timestamps. I think we should test overflow for both ANSI intervals and for timestampadd(). I will open an JIRA for that.
| */ | ||
| override def visitTimestampadd(ctx: TimestampaddContext): Expression = withOrigin(ctx) { | ||
| val arguments = Seq( | ||
| Literal(ctx.unit.getText), |
There was a problem hiding this comment.
@MaxGekk I think this indicates the unit parameter must be foldable?
There was a problem hiding this comment.
This is not the single enter point for timestampadd, so, the must word is not applicable. See the example above #35502 (comment), how it should work if the unit parameter must be foldable. BTW, I will open an JIRA to implement the optimization when unit is foldable.
There was a problem hiding this comment.
Ah I see, then I think the current implementation is fine.
…TIMESTAMPADD()` ### What changes were proposed in this pull request? In the PR, I propose to add two aliases for the `TIMESTAMPADD()` function introduced by #35502: - `DATEADD()` - `DATE_ADD()` ### Why are the changes needed? 1. To make the migration process from other systems to Spark SQL easier. 2. To achieve feature parity with other DBMSs. ### Does this PR introduce _any_ user-facing change? No. The new aliases just extend Spark SQL API. ### How was this patch tested? 1. By running the existing test suites: ``` $ build/sbt "test:testOnly *SQLKeywordSuite" ``` 3. and new checks: ``` $ build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z date.sql" ``` Closes #35661 from MaxGekk/dateadd. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>
What changes were proposed in this pull request?
In the PR, I propose to add new function
TIMESTAMPADDwith the following parameters:unit- specifies an unit of interval. It can be a string or an identifier. Supported the following values (case-insensitive):quantity- the amount ofunits to add. It has theINTtype. It can be positive or negative.timestamp- a timestamp (w/ or w/o timezone) to which you want to add.The function returns the original timestamp plus the given interval. The result has the same type as the input
timestamp(fortimestamp_ntz, it returnstimestamp_ntzand fortimestamp_ltz->timestamp_ltz).For example:
Note: if the
timestamphas the typetimestamp_ltz, andunitis:spark.sql.session.timeZone). And after that, the function adds the amount of months to the local timestamp, and converts the result to atimestamp_ltzat the same session time zone.WEEK,DAY- in similar way as above, the function adds the total amount of days to the timestamp at the session time zone.HOUR,MINUTE,SECOND,MILLISECOND,MICROSECOND- the functions converts the interval to the total amount of microseconds, and adds them to the given timestamp (expressed as an offset from the epoch).For example, Sun 13-Mar-2022 at 02:00:00 A.M. is a daylight saving time in the
America/Los_Angelestime zone:In fact, such behavior is similar to adding an ANSI interval to a timestamp.
The function also supports implicit conversion of the input date to a timestamp according the general rules of Spark SQL. By default, Spark SQL converts dates to timestamp (which is timestamp_ltz by default).
Why are the changes needed?
Does this PR introduce any user-facing change?
No. This is new feature.
How was this patch tested?
By running new tests: