Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add standard deviation functions #257

Merged
merged 1 commit into from
Jul 28, 2022

Conversation

vibhatha
Copy link
Contributor

@vibhatha vibhatha commented Jul 25, 2022

This PR includes standard_deviation definition.

@jvanstraten
Copy link
Contributor

#251 applies. I would expect to see only fp32 -> fp32 and fp64 -> fp64 stddev, mean, etc.

@vibhatha
Copy link
Contributor Author

#251 applies. I would expect to see only fp32 -> fp32 and fp64 -> fp64 stddev, mean, etc.

I thought about this, since the discussio is active here, I decided to keep the promotion. I am also aligned to non-promotion. I will update this.

@vibhatha vibhatha marked this pull request as ready for review July 25, 2022 14:25
Comment on lines 368 to 470
- args:
- options: [ SILENT, SATURATE, ERROR ]
required: false
- value: i8
nullability: DECLARED_OUTPUT
decomposable: MANY
intermediate: "fp64?"
return: i8?
- args:
- options: [ SILENT, SATURATE, ERROR ]
required: false
- value: i16
nullability: DECLARED_OUTPUT
decomposable: MANY
intermediate: "fp64?"
return: i16?
- args:
- options: [ SILENT, SATURATE, ERROR ]
required: false
- value: i32
nullability: DECLARED_OUTPUT
decomposable: MANY
intermediate: "fp64?"
return: i32?
- args:
- options: [ SILENT, SATURATE, ERROR ]
required: false
- value: i64
nullability: DECLARED_OUTPUT
decomposable: MANY
intermediate: "fp64?"
return: i64?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a use case in mind for stddev on integer types? This seems like one of the things where you'd promote to fp32 or fp64 first.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially, I looked into the usage in higher level, so even if we use integers the output should be in float. That's why I included the return as fp64. I am not sure about a practical use case for this, assuming we are excluding a data type, does that mean Substrait doesn't approve standard deviation calculated on integer values?

required: false
- value: fp32
nullability: DECLARED_OUTPUT
decomposable: MANY
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure standard deviation isn't decomposable. AFAIK it can only be computed once the mean is already known.

Copy link
Contributor Author

@vibhatha vibhatha Jul 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please correct me if I am wrong.

Standard deviation is the sqrt(variance).

Variance can be calculated by using mean (avg) and deviation for each sample.

So if we partition the data, we can calculate the mean for each partition in an embarassingly parallel manner. Then using an allreduce (aggregate in parallel) call we can find the global mean. Then we can calculate the deviation for each sample in parallel for each partition. Again we can use allreduce to sum these values up and also get the count of values in each partition. Now we have total sample count and deviation sum for all values. In the rank=0 process we can divide the sum by sample count. Then taking the sqrt we can calculate std_dev.

Is this wrong? Am I not following the definition of decomposability?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if this is correct, my definition is wrong, there has to be a struct<V1, V2>, I need to fix this either way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That works for parallelizing stddev in general, but violates causality for the way Substrait deals with decomposition. Decomposition in Substrait means separating the aggregation between independent plans, so for instance one plan is pushed to several storage nodes and does the first part of the aggregation, and then another plan operates on a column containing the results from those plans and finishes the computation. That's why the intermediate type matters for as far as Substrait is concerned; how it's parallelized at the level of a single plan is not Substrait's concern. When applied to stddev, you can first indeed compute the mean in a decomposable manner, but then you need that mean and all individual rows in order to compute the variance (which again you could do in parallel, but you don't have the individual rows anymore at this point). At best the intermediate type could be something like struct<fp64, list<fp64>> to track the mean and all the values, but that seems a bit silly; at that point you can make all aggregates "decomposable" by just making the intermediate type the list of all input values thus far.

Copy link
Contributor

@jacques-n jacques-n Jul 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reference, this is the way that Calcite does decomposition of these functions:

AVG(x) → SUM(x) / COUNT(x)
STDDEV_POP(x) → SQRT( (SUM(x * x) - SUM(x) * SUM(x) / COUNT(x)) / COUNT(x))
STDDEV_SAMP(x) → SQRT( (SUM(x * x) - SUM(x) * SUM(x) / COUNT(x)) / CASE COUNT(x) WHEN 1 THEN NULL ELSE COUNT(x) - 1 END)
VAR_POP(x) → (SUM(x * x) - SUM(x) * SUM(x) / COUNT(x)) / COUNT(x)
VAR_SAMP(x) → (SUM(x * x) - SUM(x) * SUM(x) / COUNT(x)) / CASE COUNT(x) WHEN 1 THEN NULL ELSE COUNT(x) - 1 END
COVAR_POP(x, y) → (SUM(x * y) - SUM(x, y) * SUM(y, x) / REGR_COUNT(x, y)) / REGR_COUNT(x, y)
COVAR_SAMP(x, y) → (SUM(x * y) - SUM(x, y) * SUM(y, x) / REGR_COUNT(x, y)) / CASE REGR_COUNT(x, y) WHEN 1 THEN NULL ELSE REGR_COUNT(x, y) - 1 END
REGR_SXX(x, y) → REGR_COUNT(x, y) * VAR_POP(y)
REGR_SYY(x, y) → REGR_COUNT(x, y) * VAR_POP(x)

From there, SUM is decomposed into SUM0 in numerous places.

That being said, I don't think this decomposition needs to be defined within the context of the function. I'd be inclined for this to be done as a rewrite as opposed to a decomposable function. (The decomposition here is far more complex than what is intended for the decomposable concept.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed explanation @jacques-n
Should these be included in an Appendix section somewhere in the documentation. For instance, there are terminologies/standards that the Substrait community is trying to bring forward. So it would be very helpful if pointers to such definitions are there with more facts.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who is the audience for the decomposition field? Is this advice to consumers / planners on how they might be able to decompose something? Is it a restriction that limits what decompositions are allowed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the audience is a hypothetical optimizer & task distribution engine. If it gets a plan with a decomposable aggregation function, it can use the decomposability information and intermediate type to split the plan up and distribute it over nodes. Using [X] to signify a column of type X, saying an aggregate function from [X] -> Z is ONE- or MANY-decomposable with intermediate type Y implies that [X] -> Y (INITIAL_TO_INTERMEDIATE) and [Y] -> Z (INTERMEDIATE_TO_RESULT) can also be used in plans, and that piping the results of a number of [X] -> Y invocations to a single [Y] -> Z invocation yields the same result as doing [X] -> Z on all Xs in one go. If it's MANY-decomposable [Y] -> Y (INTERMEDIATE_TO_INTERMEDIATE) also exists, allowing a distributor to use more exotic distribution patterns as well. Selection between these modes of operation happen using the "aggregation phase" option when binding the aggregate function (the default being INITIAL_TO_RESULT for the [X] -> Z variant), so effectively you're declaring several aggregate functions in one go by defining one to be decomposable.

How an engine implements a function internally (parallelization strategy, distribution over nodes, etc.) is out of scope for Substrait. It's ultimately just an interface, after all; such things are implementation details that are intentionally hidden by that interface.

@jvanstraten jvanstraten changed the title feat: Adding Standard Deviation feat: add standard deviation functions Jul 25, 2022
@vibhatha
Copy link
Contributor Author

@jvanstraten updated the PR with the recent suggestions.

@vibhatha vibhatha requested a review from jacques-n July 26, 2022 04:32
Copy link
Contributor

@jvanstraten jvanstraten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't know how I feel about the integer variants of these functions. I don't really see a use case for them; in fact I'd sooner implement them for decimals than for integers.

Comment on lines 539 to 541
decomposable: MANY
intermediate: "fp64?"
return: fp64?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't look right.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo, fixed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, what doesn't look right is that these three lines came out of nowhere at the bottom of the file. They should be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I didn't catch that sorry. Now it should be okay.

Comment on lines 471 to 713
- args:
- options: [ SILENT, SATURATE, ERROR ]
required: false
- value: fp32
nullability: DECLARED_OUTPUT
return: fp32?
- args:
- options: [ SILENT, SATURATE, ERROR ]
required: false
- value: fp64
nullability: DECLARED_OUTPUT
return: fp64?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use the floating point rounding options rather than integer overflow options for floats.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make sure, if I understood this right, we are not including this for integer inputs?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's still an open question to me if the integer input versions should even exist (I feel pretty strongly that they shouldn't).

If they do, well, you're going to have to specify something about rounding for sure, just not the floating point options because those are specific to floating points. I mean, you're doing divisions and a sqrt at the end of the computation there. Division is defined for integers but won't do what you want (it rounds toward zero or negative infinity depending on who you ask, which in itself is a problem, but also means it's statistically biased), but sqrt certainly isn't.

Anyone intending to use the integer versions is still going to expect floating point return types from these. But then the question is which, i.e. fp32 or fp64? Or even some decimal? The solution to that, from a user-perspective, would be to just cast to the real-number type of your choice first, and then use that version of the function. So the integer versions would never be used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That make sense, let me remove the integer types from this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the integer types.

jvanstraten
jvanstraten previously approved these changes Jul 27, 2022
Copy link
Contributor

@jvanstraten jvanstraten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM now.

@jvanstraten
Copy link
Contributor

Hm, CI doesn't agree with me though.

  • You'll need to rebase and make all your commits pass commitlint for the semantic release check. IMO it's a bit silly because your commits will be squashed anyway, but still...
  • The YAML validation error is because the intermediate type is erroneously marked as required. This has come up in an issue before, but I forget which. Apparently that one didn't get merged yet. Can you loosen the restriction by changing
    required: [ intermediate, return ]
    to just:
    required: [ return ]
    to make CI pass?

@jacques-n
Copy link
Contributor

@jvanstraten, I think the intermediate change was only for window functions. Looks like the same should be applied to aggregates.

@jvanstraten
Copy link
Contributor

Ah, I forgot those were separate things in the YAML schema. That makes sense.

@vibhatha
Copy link
Contributor Author

vibhatha commented Jul 28, 2022

@jacques-n @jvanstraten I squashed it to a single commit message. Let's check the CI's.
Do I need to make any changes to CI rules? Did I misundertood this

jvanstraten
jvanstraten previously approved these changes Jul 28, 2022
Copy link
Contributor

@jvanstraten jvanstraten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just made the schema change myself. LGTM now.

@jacques-n
Copy link
Contributor

@vibhatha , can you rebase so there aren't merge conflicts?

@vibhatha
Copy link
Contributor Author

@jacques-n rebased and resolved conflicts. Please check it once more.

Copy link
Contributor

@jacques-n jacques-n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @vibhatha !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants