Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add windowrel support in proto #399

Merged
merged 31 commits into from
Aug 21, 2023

Conversation

JkSelf
Copy link
Contributor

@JkSelf JkSelf commented Dec 1, 2022

No description provided.

@CLAassistant
Copy link

CLAassistant commented Dec 1, 2022

CLA assistant check
All committers have signed the CLA.

@JkSelf
Copy link
Contributor Author

JkSelf commented Dec 1, 2022

@jacques-n Please help to review if you have available time. Thanks for your help!

@JkSelf JkSelf changed the title feat: Add Window rel feat: add window rel in proto Dec 1, 2022
@JkSelf JkSelf force-pushed the add-window-operator branch 2 times, most recently from 10c8823 to c59b70f Compare December 1, 2022 09:30
@JkSelf
Copy link
Contributor Author

JkSelf commented Dec 1, 2022

cc @FelixYBW @baibaichen

WINDOW_TYPE_ROWS = 1;
WINDOW_TYPE_RANGE = 2;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a small suggestion. I think using window frame is more proper and more clear to user than using window type. Right?

  enum WindowFrame {
    WINDOW_FRAME_UNSPECIFIED = 0;
    WINDOW_FRAME_ROWS = 1;
    WINDOW_FRAME_RANGE = 2;
  }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for you suggestion. And very sorry that I did not reply in time. I have already changed to WindowFrame.

@github-actions
Copy link

github-actions bot commented Mar 2, 2023

ACTION NEEDED

Substrait follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

@JkSelf JkSelf changed the title feat: add window rel in proto feat: add windowrel support in proto Mar 2, 2023
@JkSelf JkSelf force-pushed the add-window-operator branch 2 times, most recently from 3600c8a to 343bf7b Compare March 3, 2023 00:32
@JkSelf
Copy link
Contributor Author

JkSelf commented Mar 3, 2023

@jacques-n Can you help to review again?

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few questions but this seems like a good start.

proto/substrait/algebra.proto Outdated Show resolved Hide resolved
proto/substrait/algebra.proto Outdated Show resolved Hide resolved
proto/substrait/algebra.proto Show resolved Hide resolved
proto/substrait/algebra.proto Outdated Show resolved Hide resolved
@JkSelf JkSelf force-pushed the add-window-operator branch 3 times, most recently from f5acd5e to 5e24a88 Compare March 6, 2023 07:30
@JkSelf
Copy link
Contributor Author

JkSelf commented Mar 14, 2023

@westonpace @cpcloud Can you help to review again? Thanks for your help!

@westonpace
Copy link
Member

@JkSelf thanks for the update. Based on previous discussion I think this is a physical counterpart to WindowFunction. In other words:

A logical plan will only need WindowFunction
A physical (or less logical plan) may instead use WindowRel.

If you agree I think we should make a few changes:

  • WindowRel should not use WindowFunction. Instead, copy the fields that you need (I think it will be similar to ScalarFunction) into a separate message that is local to WindowRel.
    • For example, WindowRel::Function will not have sorts. This way there is no confusion which instance of sorts should be used.
  • Similar to above, but should lower_bound/upper_bound be added to WindowRel?
  • We should update physical_relations.md to reflect this new operator. Currently, there is "Hashing Window Operator" and "Streaming Window Operator". Does this replace one of these? Or is it a third kind of node? Or does "Hashing" and "Streaming" not make sense and we should combine them into this?
  • Does this operator handle "scalar window functions" (e.g. SELECT RANK() OVER (ORDER BY time)) or "window aggregates" (e.g. SELECT SUM(val) OVER (PARTITION BY month)) or both? I think it is best if it only does one or the other. That way the number of output rows is clear.

@JkSelf
Copy link
Contributor Author

JkSelf commented Mar 15, 2023

@westonpace Thanks for your reply.

WindowRel should not use WindowFunction. Instead, copy the fields that you need (I think it will be similar to ScalarFunction) into a separate message that is local to WindowRel. For example, WindowRel::Function will not have sorts. This way there is no confusion which instance of sorts should be used.

Yes. We can define a WindowRel::Function with not sorts. But this will bring ambiguity in understanding, because WindowRel is originally designed to perform a series of Window functions on a specific partition. And it will also bring redundancy in the code, because both functions need to define Bound. Can we remove the sort in WindowFunction here? If not, I will add a new Function later.

We should update physical_relations.md to reflect this new operator. Currently, there is "Hashing Window Operator" and "Streaming Window Operator". Does this replace one of these? Or is it a third kind of node? Or does "Hashing" and "Streaming" not make sense and we should combine them into this?

The difference between Hashing and Steaming is whether it is necessary to obtain the entire partition before execution described in WindowType. For example, SUM does not need it, but NTILE does. The WindowRel we support here includes these two situations.

Does this operator handle "scalar window functions" (e.g. SELECT RANK() OVER (ORDER BY time)) or "window aggregates" (e.g. SELECT SUM(val) OVER (PARTITION BY month)) or both? I think it is best if it only does one or the other. That way the number of output rows is clear.

If we use WindowFunction without sort here, we need to support both here. Because the current WindowFunction supports both here.

@westonpace
Copy link
Member

Yes. We can define a WindowRel::Function with not sorts. But this will bring ambiguity in understanding, because WindowRel is originally designed to perform a series of Window functions on a specific partition. And it will also bring redundancy in the code, because both functions need to define Bound.

I think this redundancy is expected when we have logical and physical operators. For example, the physical HashJoinRel::left_keys and HashJoinRel::right_keys are narrower, but redundant, when compared with the logical JoinRel::expression.

Can we remove the sort in WindowFunction here?

No, because the logical representation needs to be able to stand alone. It should be legal to have a "purely logical plan" that has WindowFunction but no WindowRel (the window functions would be part of a ProjectRel).

The difference between Hashing and Steaming is whether it is necessary to obtain the entire partition before execution described in WindowType. For example, SUM does not need it, but NTILE does.

Yes. Also, streaming could make sense if the data was also sorted by the partition field.

The WindowRel we support here includes these two situations.

Ok, then I think we need to add a new relation to physical_relations.md which is WindowOperation. Perhaps:

## Window Operation

A window operation is a special type of project operation where every function
is a window function and all of the window functions share the same sorting and
partitioning.  This allows for the sort and partition to be calculated once and
shared between the various function evaluations.

| Signature | Value |
...

### Window Operation Properties

| Property | Description | Required |
...

## Hashing Window Operation
...

@westonpace
Copy link
Member

If we use WindowFunction without sort here, we need to support both here. Because the current WindowFunction supports both here.

Ok, I agree this should support both. However, the output of an aggregate window function is scalar. In other words:

Given x=[1,2,3,4] then SUM(x) OVER () is [10, 10, 10, 10] and not [10].

@JkSelf
Copy link
Contributor Author

JkSelf commented Mar 16, 2023

If we use WindowFunction without sort here, we need to support both here. Because the current WindowFunction supports both here.

Ok, I agree this should support both. However, the output of an aggregate window function is scalar. In other words:

Given x=[1,2,3,4] then SUM(x) OVER () is [10, 10, 10, 10] and not [10].

Yes. the sum(x) over() return [10, 10, 10, 10] and also the rank() over() return [1, 2, 3, 4]. They all return a single value for each row. So I think the output of rows is same both "scalar window functions" and "window aggregates".

@JkSelf
Copy link
Contributor Author

JkSelf commented Mar 16, 2023

@westonpace Thanks for your detailed explanation. I will define a new function without sort to replace the Window function. And also update the physical_relations.md based on your suggestions later.

@baibaichen
Copy link

@westonpace is this ok to merge?

@JkSelf
Copy link
Contributor Author

JkSelf commented Mar 21, 2023

@westonpace Can you help to review again? Thanks for your help.

@jacques-n
Copy link
Contributor

A few minor comments. I'm not sure I understand the comparison to hashing and streaming yet.

@jacques-n is there any engine out there that uses hashing to calculate window functions? If not, then we should get rid of the hashing and streaming operators.

I'm fine with not implementing this for now.

@vbarua
Copy link
Member

vbarua commented Aug 16, 2023

Per discussion in the sync this morning, @westonpace and I have merged in main and applied some small changes to move this along.

@vbarua vbarua requested a review from westonpace August 16, 2023 18:26
westonpace
westonpace previously approved these changes Aug 16, 2023
Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Structurally I think we've reached the correct point for this PR. I have some minor suggestions regarding how we document this. We recently updated (or are updating) the wording of WindowFunction.function_reference to state that we allow both window and aggregate functions. However, that change is not mirrored here.

Rather than try and keep the comments in sync between the two definitions I think we should only have one set of definitions and only document the parts that are different here.

I'm marking this as approved because this PR has been around for a while. If these changes are not addressed then I propose we merge it as is and clean up the docs in a follow-up PR.

proto/substrait/algebra.proto Outdated Show resolved Hide resolved
proto/substrait/algebra.proto Outdated Show resolved Hide resolved
proto/substrait/algebra.proto Outdated Show resolved Hide resolved
proto/substrait/algebra.proto Outdated Show resolved Hide resolved
proto/substrait/algebra.proto Outdated Show resolved Hide resolved
proto/substrait/algebra.proto Outdated Show resolved Hide resolved
proto/substrait/algebra.proto Outdated Show resolved Hide resolved
proto/substrait/algebra.proto Outdated Show resolved Hide resolved
proto/substrait/algebra.proto Outdated Show resolved Hide resolved
site/docs/relations/physical_relations.md Outdated Show resolved Hide resolved
@JkSelf
Copy link
Contributor Author

JkSelf commented Aug 18, 2023

@vbarua @westonpace @jacques-n
Sorry for the late response. I have updated the changes. Can you help to review again?

Copy link
Contributor

@jacques-n jacques-n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@westonpace westonpace merged commit bd14e0e into substrait-io:main Aug 21, 2023
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants