Skip to content

Conversation

@HeartSaVioR
Copy link
Contributor

@HeartSaVioR HeartSaVioR commented Oct 23, 2022

What changes were proposed in this pull request?

This PR proposes to loosen the requirement of window_time rule to allow multiple distinct window_time calls. After this change, users can call the window_time function with different windows in the same logical node (select, where, etc.).

Given that we allow multiple calls of window_time in projection, we no longer be able to use the reserved column name "window_time". This PR picked up the SQL representation of the WindowTime, to distinguish each distinct function call.
(This is different from time window/session window, but "arguably" saying, they are incorrect. Just that we can't fix them now since the change would incur backward incompatibility...)

Why are the changes needed?

The rule for window time followed the existing rules of time window / session window which only allows a single function call in a same projection (strictly saying, it considers the call of function as once if the function is called with same parameters).

For time window/session window rules , the restriction makes sense since allowing this would produce cartesian product of rows (although Spark can handle it). But given that window_time only produces one value, the restriction no longer makes sense.

Does this PR introduce any user-facing change?

Yes since it changes the resulting column name from window_time function call, but the function is not released yet.

How was this patch tested?

New test case.

@github-actions github-actions bot added the SQL label Oct 23, 2022
"therefore they are currently not supported"))
val df2 = df
.withColumn("time2", expr("time - INTERVAL 15 minutes"))
.select(window($"time", "10 seconds").as("window1"), $"time2")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to do select twice since time window rule does not allow multiple time window function call to co-exist in same projection. time_window/session_window function is effectively a TVF.

.select(
window_time($"window1").cast("string"),
window_time($"window2").cast("string"),
window_time($"window2").as("wt2_aliased").cast("string")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This checks the functionality that "same" call of window_time function won't bring conflict, and can be tagged with different column name.

@HeartSaVioR
Copy link
Contributor Author

cc. @cloud-fan @alex-balikov

.remove(TimeWindow.marker)
.remove(SessionWindow.marker)
.build()
val colName = windowTime.sql
Copy link
Contributor

@cloud-fan cloud-fan Oct 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will this be materialized in the checkpoint or state store? The SQL string for an expression is unstable, as it depends on resolved expression, and resolution may change over time (e.g. type coercion may add cast differently).

Copy link
Contributor Author

@HeartSaVioR HeartSaVioR Oct 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The schema checker of state store allows changing name on the column. It majority checks type and nullability.

The ideal column name would be exactly the same as what users call as it is, but I can't find the way to do, and I'm not sure it is available for all usages since there are multiple ways to call the SQL function. Seems like this is a best effort, if this way is already a thing to define the resulting column name for other SQL functions as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we allow name change, this is fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In practice, end users will apply another time window function against the resulting column of window time, hence the final resulting column will be another "window".

@HeartSaVioR
Copy link
Contributor Author

cc. @viirya I'd be lucky if I can get your traction to review this, who are familiar with both areas. Thanks in advance!

@HeartSaVioR
Copy link
Contributor Author

Thanks @cloud-fan ! Given this PR stayed for 4 days and no feedback so far, I'm merging this.

SandishKumarHN pushed a commit to SandishKumarHN/spark that referenced this pull request Dec 12, 2022
…llow multiple window_time calls

### What changes were proposed in this pull request?

This PR proposes to loosen the requirement of window_time rule to allow multiple distinct window_time calls. After this change, users can call the window_time function with different windows in the same logical node (select, where, etc.).

Given that we allow multiple calls of window_time in projection, we no longer be able to use the reserved column name "window_time". This PR picked up the SQL representation of the WindowTime, to distinguish each distinct function call.
(This is different from time window/session window, but "arguably" saying, they are incorrect. Just that we can't fix them now since the change would incur backward incompatibility...)

### Why are the changes needed?

The rule for window time followed the existing rules of time window / session window which only allows a single function call in a same projection (strictly saying, it considers the call of function as once if the function is called with same parameters).

For time window/session window rules , the restriction makes sense since allowing this would produce cartesian product of rows (although Spark can handle it). But given that window_time only produces one value, the restriction no longer makes sense.

### Does this PR introduce _any_ user-facing change?

Yes since it changes the resulting column name from window_time function call, but the function is not released yet.

### How was this patch tested?

New test case.

Closes apache#38361 from HeartSaVioR/SPARK-40892.

Authored-by: Jungtaek Lim <[email protected]>
Signed-off-by: Jungtaek Lim <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants