Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: adding SUM0 definition for aggregate functions #465

Merged
merged 3 commits into from
Mar 27, 2023

Conversation

vibhatha
Copy link
Contributor

@vibhatha vibhatha commented Mar 7, 2023

This PR includes a resolution to #259

@github-actions
Copy link

github-actions bot commented Mar 7, 2023

ACTION NEEDED

Substrait follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

@vibhatha vibhatha self-assigned this Mar 7, 2023
@vibhatha vibhatha added the enhancement New feature or request label Mar 7, 2023
extensions/functions_arithmetic.yaml Show resolved Hide resolved
@@ -922,6 +922,69 @@ aggregate_functions:
decomposable: MANY
intermediate: fp64?
return: fp64?
- name: "sum0"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An analog could be the multiply function returning one for a set of zero elements. Do we want to make a similar change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you’re correct. We should probably add that in another PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, no need to make this PR do too much.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -922,6 +922,69 @@ aggregate_functions:
decomposable: MANY
intermediate: fp64?
return: fp64?
- name: "sum0"
description: Sum a set of values. The sum of zero elements yields zero.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternative is to have one sum function with an option for how to handle the set of zero inputs. Potential complications of course arise if we want to add an option after the fact but is that something we are interested in?

Copy link
Contributor Author

@vibhatha vibhatha Mar 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point. Let’s discuss further.
Options feature is interesting and that could simplify things, but not sure how to represent nullability in return statements.

cc @westonpace @jacques-n

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also pretty easy to get the behavior already with COALESCE(SUM(x), 0) (though this is not exactly the same because it would return 0 for the sum of all null input where this function would hopefully return null). However, I don't think we need to shy away from functions that can be represented elsewhere. So I think adding a new function or option is still a good idea.

I'm not sure where exactly to draw the line between "new function" and "option to an existing function". I don't have a really strong opinion either way.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I think it would behave the same as COALESCE(SUM(x), 0) . It is my understanding that if you had a set of all null values and you do a SUM0() on them, you would get 0, not null.

This is a description I found for SUM0 that I found here https://www.tabnine.com/code/java/classes/org.apache.calcite.sql.fun.SqlSumEmptyIsZeroAggFunction

Sum0 is an aggregator which returns the sum of the values which go into it like Sum. It differs in that when no non null values are applied zero is returned instead of null. 

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then let's update the description to something like...

Sum a set of values. The sum of zero elements yields zero.  Null values are ignored.

We should probably clarify this for sum as well but that can be left for a future PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@westonpace And I think we should probably leave the return type without nulls. Is that okay?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good catch. Given this updated understanding I think the return type should be DECALRED_OUTPUT nullability and a non-nullable return type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I will update the PR. Thanks @wmalpica for your feedback as well.

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given:

sum0([1, 2, 3]) = 6
sum0([]) = 0

What does this result in:

sum0([null, null, null]) = ?

Is it null or 0? I think it should be null. If you agree we should note that in the description. In this case, I think the return attribute still needs to be i64?/fp64?.

@westonpace
Copy link
Member

sqlite has sum and total. total is the sqlite equivalent of sum0. Curiously, in sqlite, total([null, null, null]) = 0 so perhaps my intuition is off :)

@EpsilonPrime
Copy link
Member

sqlite has sum and total. total is the sqlite equivalent of sum0. Curiously, in sqlite, total([null, null, null]) = 0 so perhaps my intuition is off :)

Ah, so total is more of "add all the values that are present" where sum is treating null as unknown. (So for sum adding a known value to an unknown one is still unknown.)

@westonpace
Copy link
Member

Note that #401 is probably related

@wmalpica
Copy link

sqlite has sum and total. total is the sqlite equivalent of sum0. Curiously, in sqlite, total([null, null, null]) = 0 so perhaps my intuition is off :)

Yes I think SUM0 would be like total

@vibhatha vibhatha marked this pull request as ready for review March 24, 2023 16:58
Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small change.

values: [ SILENT, SATURATE, ERROR ]
nullability: DECLARED_OUTPUT
decomposable: MANY
intermediate: i64?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
intermediate: i64?
intermediate: i64

The intermediate is also non-nullable.

values: [ SILENT, SATURATE, ERROR ]
nullability: DECLARED_OUTPUT
decomposable: MANY
intermediate: i64?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
intermediate: i64?
intermediate: i64

values: [ SILENT, SATURATE, ERROR ]
nullability: DECLARED_OUTPUT
decomposable: MANY
intermediate: i64?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
intermediate: i64?
intermediate: i64

values: [ SILENT, SATURATE, ERROR ]
nullability: DECLARED_OUTPUT
decomposable: MANY
intermediate: i64?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
intermediate: i64?
intermediate: i64

values: [ SILENT, SATURATE, ERROR ]
nullability: DECLARED_OUTPUT
decomposable: MANY
intermediate: fp64?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
intermediate: fp64?
intermediate: fp64

values: [ SILENT, SATURATE, ERROR ]
nullability: DECLARED_OUTPUT
decomposable: MANY
intermediate: fp64?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
intermediate: fp64?
intermediate: fp64

@westonpace westonpace merged commit 73228b4 into substrait-io:main Mar 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants