-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-31982][SQL]Function sequence doesn't handle date increments that cross DST #28856
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
04aabd8
2a7caf7
d23d155
d732984
076ebce
19d8c48
c88efec
6a341bf
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -2589,6 +2589,8 @@ object Sequence { | |
| } | ||
| } | ||
|
|
||
| // To generate time sequences, we use scale 1 in TemporalSequenceImpl | ||
| // for `TimestampType`, while MICROS_PER_DAY for `DateType` | ||
| private class TemporalSequenceImpl[T: ClassTag] | ||
| (dt: IntegralType, scale: Long, fromLong: Long => T, zoneId: ZoneId) | ||
| (implicit num: Integral[T]) extends SequenceImpl { | ||
|
|
@@ -2623,8 +2625,16 @@ object Sequence { | |
| // about a month length in days and a day length in microseconds | ||
| val intervalStepInMicros = | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. it looks like the step should not be physical seconds, but logical interval. cc @MaxGekk
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The current implementation is strange mix, it seems. There are following options:
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. hi @cloud-fan,as @MaxGekk explain here, I am not sure if this patch looks ok,I am willing to |
||
| stepMicros + stepMonths * microsPerMonth + stepDays * microsPerDay | ||
| val startMicros: Long = num.toLong(start) * scale | ||
| val stopMicros: Long = num.toLong(stop) * scale | ||
|
|
||
| // Date to timestamp is not equal from GMT and Chicago timezones | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why do these codes depend on few specific timezones? Also, other comments look valid here. We should take the timezone into account for timestamps too
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Seems the date if different from west to east, when it is date, we might need to consider to zone info to convert to time stamp, if it is already a time stamp, not a date here, we may ignore the zone because the time stamp is already consider it. |
||
| val (startMicros, stopMicros) = if (scale == 1) { | ||
| (num.toLong(start), num.toLong(stop)) | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Don't think this is correct, see my comment https://github.com/apache/spark/pull/28856/files#r442366706 but it is at least backward compatible?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe we could separate this into different methods ? |
||
| } | ||
| else { | ||
| (daysToMicros(num.toInt(start), zoneId), | ||
| daysToMicros(num.toInt(stop), zoneId)) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you explain a bit more? It's hard to understand this change without any comment.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeh, please, explain why if
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Because when
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
How can we tell it?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. maybe we should add more documents to
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done, maybe we can pass the scale through constructor: |
||
| } | ||
|
|
||
| val maxEstimatedArrayLength = | ||
| getSequenceLength(startMicros, stopMicros, intervalStepInMicros) | ||
|
|
||
|
|
@@ -2635,7 +2645,11 @@ object Sequence { | |
| var i = 0 | ||
|
|
||
| while (t < exclusiveItem ^ stepSign < 0) { | ||
| arr(i) = fromLong(t / scale) | ||
TJX2014 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| arr(i) = if (scale == 1) { | ||
| fromLong(t) | ||
| } else { | ||
| fromLong(Math.round(t / scale.toFloat)) | ||
TJX2014 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| } | ||
| i += 1 | ||
| t = timestampAddInterval( | ||
| startMicros, i * stepMonths, i * stepDays, i * stepMicros, zoneId) | ||
|
|
@@ -2685,8 +2699,19 @@ object Sequence { | |
| |} else if ($stepMonths == 0 && $stepDays == 0 && ${scale}L == 1) { | ||
| | ${backedSequenceImpl.genCode(ctx, start, stop, stepMicros, arr, elemType)}; | ||
| |} else { | ||
| | final long $startMicros = $start * ${scale}L; | ||
| | final long $stopMicros = $stop * ${scale}L; | ||
| | long $startMicros; | ||
| | long $stopMicros; | ||
| | if (${scale}L == 1L) { | ||
| | $startMicros = $start; | ||
| | $stopMicros = $stop; | ||
| | } else { | ||
| | $startMicros = | ||
| | org.apache.spark.sql.catalyst.util.DateTimeUtils.daysToMicros( | ||
| | (int)$start, $zid); | ||
| | $stopMicros = | ||
| | org.apache.spark.sql.catalyst.util.DateTimeUtils.daysToMicros( | ||
| | (int)$stop, $zid); | ||
| | } | ||
| | | ||
| | $sequenceLengthCode | ||
| | | ||
|
|
@@ -2698,7 +2723,11 @@ object Sequence { | |
| | int $i = 0; | ||
| | | ||
| | while ($t < $exclusiveItem ^ $stepSign < 0) { | ||
| | $arr[$i] = ($elemType) ($t / ${scale}L); | ||
| | if (${scale}L == 1L) { | ||
| | $arr[$i] = ($elemType) $t; | ||
| | } else { | ||
| | $arr[$i] = ($elemType) (Math.round($t / (float)${scale}L)); | ||
| | } | ||
| | $i += 1; | ||
| | $t = org.apache.spark.sql.catalyst.util.DateTimeUtils.timestampAddInterval( | ||
| | $startMicros, $i * $stepMonths, $i * $stepDays, $i * $stepMicros, $zid); | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if start/end is date, can the step by seconds/minutes/hours?
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, seems we can, the result as follows:
`scala> sql("select explode(sequence(cast('2011-03-01' as date), cast('2011-05-01' as date), interval 1 hour))").count
res19: Long = 1465
scala> sql("select explode(sequence(cast('2011-03-01' as date), cast('2011-05-01' as date), interval 1 minute))").count
res20: Long = 87841
scala> sql("select explode(sequence(cast('2011-03-01' as date), cast('2011-05-01' as date), interval 1 second))").count
res21: Long = 5270401
scala> sql("select explode(sequence(cast('2011-03-01' as date), cast('2011-05-01' as date), interval 1 minute))").head(3)
res25: Array[org.apache.spark.sql.Row] = Array([2011-03-01], [2011-03-01], [2011-03-01])
scala> sql("select explode(sequence(cast('2011-03-01' as date), cast('2011-05-01' as date), interval 1 second))").head(3)
res26: Array[org.apache.spark.sql.Row] = Array([2011-03-01], [2011-03-01], [2011-03-01])
scala> sql("select explode(sequence(cast('2011-03-01' as date), cast('2011-05-01' as date), interval 1 minute))").head(3)
res27: Array[org.apache.spark.sql.Row] = Array([2011-03-01], [2011-03-01], [2011-03-01])
scala> sql("select explode(sequence(cast('2011-03-01' as date), cast('2011-05-01' as date), interval 1 hour))").head(3)
res28: Array[org.apache.spark.sql.Row] = Array([2011-03-01], [2011-03-01], [2011-03-01])
`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does pgsql support it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems pgsql can only support int as follows:
postgres= create sequence seq_test;
CREATE SEQUENCE
postgres= select nextval('seq_test');
1
(1 行记录)
postgres= select nextval('seq_test');
2
(1 行记录)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually this function is from presto: https://prestodb.io/docs/current/functions/array.html
Can you check the behavior of presto? It looks confusing to use time fields as the step for date start/stop.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Base
presto-server-0.236:presto> select sequence(date('2011-03-01'),date('2011-03-02'),interval '1' hour);
Query 20200624_122744_00002_pehix failed: sequence step must be a day interval if start and end values are dates
presto> select sequence(date('2011-03-01'),date('2011-03-02'),interval '1' day);
_col0
[2011-03-01, 2011-03-02]
(1 row)
Query 20200624_122757_00003_pehix, FINISHED, 1 node
Splits: 17 total, 17 done (100.00%)
0:00 [0 rows, 0B] [0 rows/s, 0B/s]
presto> select sequence(date('2011-03-01'),date('2011-03-02'),interval '1' month);
_col0
[2011-03-01]
(1 row)
Query 20200624_122806_00004_pehix, FINISHED, 1 node
Splits: 17 total, 17 done (100.00%)
0:00 [0 rows, 0B] [0 rows/s, 0B/s]
presto> select sequence(date('2011-03-01'),date('2011-03-02'),interval '1' year);
_col0
[2011-03-01]
(1 row)
Query 20200624_122810_00005_pehix, FINISHED, 1 node
Splits: 17 total, 17 done (100.00%)
0:00 [0 rows, 0B] [0 rows/s, 0B/s]
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan Done, It seems can be sequence
day,month,yearwhen start and end areDateTypein presto.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the presto behavior makes sense. Can we send a PR to follow it first? e.g. throw an exception if the step is time fields while start/end is date. This can also simplify the implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I will do it tomorrow.