-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-40892][SQL][SS] Loosen the requirement of window_time rule - allow multiple window_time calls #38361
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…llow multiple window_time calls
| "therefore they are currently not supported")) | ||
| val df2 = df | ||
| .withColumn("time2", expr("time - INTERVAL 15 minutes")) | ||
| .select(window($"time", "10 seconds").as("window1"), $"time2") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have to do select twice since time window rule does not allow multiple time window function call to co-exist in same projection. time_window/session_window function is effectively a TVF.
| .select( | ||
| window_time($"window1").cast("string"), | ||
| window_time($"window2").cast("string"), | ||
| window_time($"window2").as("wt2_aliased").cast("string") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This checks the functionality that "same" call of window_time function won't bring conflict, and can be tagged with different column name.
| .remove(TimeWindow.marker) | ||
| .remove(SessionWindow.marker) | ||
| .build() | ||
| val colName = windowTime.sql |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will this be materialized in the checkpoint or state store? The SQL string for an expression is unstable, as it depends on resolved expression, and resolution may change over time (e.g. type coercion may add cast differently).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The schema checker of state store allows changing name on the column. It majority checks type and nullability.
The ideal column name would be exactly the same as what users call as it is, but I can't find the way to do, and I'm not sure it is available for all usages since there are multiple ways to call the SQL function. Seems like this is a best effort, if this way is already a thing to define the resulting column name for other SQL functions as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we allow name change, this is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In practice, end users will apply another time window function against the resulting column of window time, hence the final resulting column will be another "window".
|
cc. @viirya I'd be lucky if I can get your traction to review this, who are familiar with both areas. Thanks in advance! |
|
Thanks @cloud-fan ! Given this PR stayed for 4 days and no feedback so far, I'm merging this. |
…llow multiple window_time calls ### What changes were proposed in this pull request? This PR proposes to loosen the requirement of window_time rule to allow multiple distinct window_time calls. After this change, users can call the window_time function with different windows in the same logical node (select, where, etc.). Given that we allow multiple calls of window_time in projection, we no longer be able to use the reserved column name "window_time". This PR picked up the SQL representation of the WindowTime, to distinguish each distinct function call. (This is different from time window/session window, but "arguably" saying, they are incorrect. Just that we can't fix them now since the change would incur backward incompatibility...) ### Why are the changes needed? The rule for window time followed the existing rules of time window / session window which only allows a single function call in a same projection (strictly saying, it considers the call of function as once if the function is called with same parameters). For time window/session window rules , the restriction makes sense since allowing this would produce cartesian product of rows (although Spark can handle it). But given that window_time only produces one value, the restriction no longer makes sense. ### Does this PR introduce _any_ user-facing change? Yes since it changes the resulting column name from window_time function call, but the function is not released yet. ### How was this patch tested? New test case. Closes apache#38361 from HeartSaVioR/SPARK-40892. Authored-by: Jungtaek Lim <[email protected]> Signed-off-by: Jungtaek Lim <[email protected]>
What changes were proposed in this pull request?
This PR proposes to loosen the requirement of window_time rule to allow multiple distinct window_time calls. After this change, users can call the window_time function with different windows in the same logical node (select, where, etc.).
Given that we allow multiple calls of window_time in projection, we no longer be able to use the reserved column name "window_time". This PR picked up the SQL representation of the WindowTime, to distinguish each distinct function call.
(This is different from time window/session window, but "arguably" saying, they are incorrect. Just that we can't fix them now since the change would incur backward incompatibility...)
Why are the changes needed?
The rule for window time followed the existing rules of time window / session window which only allows a single function call in a same projection (strictly saying, it considers the call of function as once if the function is called with same parameters).
For time window/session window rules , the restriction makes sense since allowing this would produce cartesian product of rows (although Spark can handle it). But given that window_time only produces one value, the restriction no longer makes sense.
Does this PR introduce any user-facing change?
Yes since it changes the resulting column name from window_time function call, but the function is not released yet.
How was this patch tested?
New test case.