-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-40892][SQL][SS] Loosen the requirement of window_time rule - allow multiple window_time calls #38361
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-40892][SQL][SS] Loosen the requirement of window_time rule - allow multiple window_time calls #38361
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -599,23 +599,37 @@ class DataFrameTimeWindowingSuite extends QueryTest with SharedSparkSession { | |
| ("2016-03-27 19:38:18"), ("2016-03-27 19:39:25") | ||
| ).toDF("time") | ||
|
|
||
| val e = intercept[AnalysisException] { | ||
| df | ||
| .withColumn("time2", expr("time - INTERVAL 5 minutes")) | ||
| .select( | ||
| window($"time", "10 seconds").as("window1"), | ||
| window($"time2", "10 seconds").as("window2") | ||
| ) | ||
| .select( | ||
| $"window1.end".cast("string"), | ||
| window_time($"window1").cast("string"), | ||
| $"window2.end".cast("string"), | ||
| window_time($"window2").cast("string") | ||
| ) | ||
| } | ||
| assert(e.getMessage.contains( | ||
| "Multiple time/session window expressions would result in a cartesian product of rows, " + | ||
| "therefore they are currently not supported")) | ||
| val df2 = df | ||
| .withColumn("time2", expr("time - INTERVAL 15 minutes")) | ||
| .select(window($"time", "10 seconds").as("window1"), $"time2") | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We have to do select twice since time window rule does not allow multiple time window function call to co-exist in same projection. time_window/session_window function is effectively a TVF. |
||
| .select($"window1", window($"time2", "10 seconds").as("window2")) | ||
|
|
||
| checkAnswer( | ||
| df2.select( | ||
| $"window1.end".cast("string"), | ||
| window_time($"window1").cast("string"), | ||
| $"window2.end".cast("string"), | ||
| window_time($"window2").cast("string")), | ||
| Seq( | ||
| Row("2016-03-27 19:38:20", "2016-03-27 19:38:19.999999", | ||
| "2016-03-27 19:23:20", "2016-03-27 19:23:19.999999"), | ||
| Row("2016-03-27 19:39:30", "2016-03-27 19:39:29.999999", | ||
| "2016-03-27 19:24:30", "2016-03-27 19:24:29.999999")) | ||
| ) | ||
|
|
||
| // check column names | ||
| val df3 = df2 | ||
| .select( | ||
| window_time($"window1").cast("string"), | ||
| window_time($"window2").cast("string"), | ||
| window_time($"window2").as("wt2_aliased").cast("string") | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This checks the functionality that "same" call of window_time function won't bring conflict, and can be tagged with different column name. |
||
| ) | ||
|
|
||
| val schema = df3.schema | ||
|
|
||
| assert(schema.fields.exists(_.name == "window_time(window1)")) | ||
| assert(schema.fields.exists(_.name == "window_time(window2)")) | ||
| assert(schema.fields.exists(_.name == "wt2_aliased")) | ||
| } | ||
|
|
||
| test("window_time function on agg output") { | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will this be materialized in the checkpoint or state store? The SQL string for an expression is unstable, as it depends on resolved expression, and resolution may change over time (e.g. type coercion may add cast differently).
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The schema checker of state store allows changing name on the column. It majority checks type and nullability.
The ideal column name would be exactly the same as what users call as it is, but I can't find the way to do, and I'm not sure it is available for all usages since there are multiple ways to call the SQL function. Seems like this is a best effort, if this way is already a thing to define the resulting column name for other SQL functions as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we allow name change, this is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In practice, end users will apply another time window function against the resulting column of window time, hence the final resulting column will be another "window".