feat: add standard deviation functions #257

vibhatha · 2022-07-25T08:12:07Z

This PR includes standard_deviation definition.

jvanstraten · 2022-07-25T11:06:24Z

#251 applies. I would expect to see only fp32 -> fp32 and fp64 -> fp64 stddev, mean, etc.

vibhatha · 2022-07-25T14:08:31Z

#251 applies. I would expect to see only fp32 -> fp32 and fp64 -> fp64 stddev, mean, etc.

I thought about this, since the discussio is active here, I decided to keep the promotion. I am also aligned to non-promotion. I will update this.

jvanstraten · 2022-07-25T14:30:20Z

extensions/functions_arithmetic.yaml

+      - args:
+          - options: [ SILENT, SATURATE, ERROR ]
+            required: false
+          - value: i8
+        nullability: DECLARED_OUTPUT
+        decomposable: MANY
+        intermediate: "fp64?"
+        return: i8?
+      - args:
+          - options: [ SILENT, SATURATE, ERROR ]
+            required: false
+          - value: i16
+        nullability: DECLARED_OUTPUT
+        decomposable: MANY
+        intermediate: "fp64?"
+        return: i16?
+      - args:
+          - options: [ SILENT, SATURATE, ERROR ]
+            required: false
+          - value: i32
+        nullability: DECLARED_OUTPUT
+        decomposable: MANY
+        intermediate: "fp64?"
+        return: i32?
+      - args:
+          - options: [ SILENT, SATURATE, ERROR ]
+            required: false
+          - value: i64
+        nullability: DECLARED_OUTPUT
+        decomposable: MANY
+        intermediate: "fp64?"
+        return: i64?


Do you have a use case in mind for stddev on integer types? This seems like one of the things where you'd promote to fp32 or fp64 first.

Initially, I looked into the usage in higher level, so even if we use integers the output should be in float. That's why I included the return as fp64. I am not sure about a practical use case for this, assuming we are excluding a data type, does that mean Substrait doesn't approve standard deviation calculated on integer values?

jvanstraten · 2022-07-25T14:33:07Z

extensions/functions_arithmetic.yaml

+            required: false
+          - value: fp32
+        nullability: DECLARED_OUTPUT
+        decomposable: MANY


I'm pretty sure standard deviation isn't decomposable. AFAIK it can only be computed once the mean is already known.

Please correct me if I am wrong.

Standard deviation is the sqrt(variance).

Variance can be calculated by using mean (avg) and deviation for each sample.

So if we partition the data, we can calculate the mean for each partition in an embarassingly parallel manner. Then using an allreduce (aggregate in parallel) call we can find the global mean. Then we can calculate the deviation for each sample in parallel for each partition. Again we can use allreduce to sum these values up and also get the count of values in each partition. Now we have total sample count and deviation sum for all values. In the rank=0 process we can divide the sum by sample count. Then taking the sqrt we can calculate std_dev.

Is this wrong? Am I not following the definition of decomposability?

Even if this is correct, my definition is wrong, there has to be a struct<V1, V2>, I need to fix this either way.

That works for parallelizing stddev in general, but violates causality for the way Substrait deals with decomposition. Decomposition in Substrait means separating the aggregation between independent plans, so for instance one plan is pushed to several storage nodes and does the first part of the aggregation, and then another plan operates on a column containing the results from those plans and finishes the computation. That's why the intermediate type matters for as far as Substrait is concerned; how it's parallelized at the level of a single plan is not Substrait's concern. When applied to stddev, you can first indeed compute the mean in a decomposable manner, but then you need that mean and all individual rows in order to compute the variance (which again you could do in parallel, but you don't have the individual rows anymore at this point). At best the intermediate type could be something like struct<fp64, list<fp64>> to track the mean and all the values, but that seems a bit silly; at that point you can make all aggregates "decomposable" by just making the intermediate type the list of all input values thus far.

For reference, this is the way that Calcite does decomposition of these functions:

AVG(x) → SUM(x) / COUNT(x) STDDEV_POP(x) → SQRT( (SUM(x * x) - SUM(x) * SUM(x) / COUNT(x)) / COUNT(x)) STDDEV_SAMP(x) → SQRT( (SUM(x * x) - SUM(x) * SUM(x) / COUNT(x)) / CASE COUNT(x) WHEN 1 THEN NULL ELSE COUNT(x) - 1 END) VAR_POP(x) → (SUM(x * x) - SUM(x) * SUM(x) / COUNT(x)) / COUNT(x) VAR_SAMP(x) → (SUM(x * x) - SUM(x) * SUM(x) / COUNT(x)) / CASE COUNT(x) WHEN 1 THEN NULL ELSE COUNT(x) - 1 END COVAR_POP(x, y) → (SUM(x * y) - SUM(x, y) * SUM(y, x) / REGR_COUNT(x, y)) / REGR_COUNT(x, y) COVAR_SAMP(x, y) → (SUM(x * y) - SUM(x, y) * SUM(y, x) / REGR_COUNT(x, y)) / CASE REGR_COUNT(x, y) WHEN 1 THEN NULL ELSE REGR_COUNT(x, y) - 1 END REGR_SXX(x, y) → REGR_COUNT(x, y) * VAR_POP(y) REGR_SYY(x, y) → REGR_COUNT(x, y) * VAR_POP(x)

From there, SUM is decomposed into SUM0 in numerous places.

That being said, I don't think this decomposition needs to be defined within the context of the function. I'd be inclined for this to be done as a rewrite as opposed to a decomposable function. (The decomposition here is far more complex than what is intended for the decomposable concept.)

Thanks for the detailed explanation @jacques-n
Should these be included in an Appendix section somewhere in the documentation. For instance, there are terminologies/standards that the Substrait community is trying to bring forward. So it would be very helpful if pointers to such definitions are there with more facts.

Who is the audience for the decomposition field? Is this advice to consumers / planners on how they might be able to decompose something? Is it a restriction that limits what decompositions are allowed?

I think the audience is a hypothetical optimizer & task distribution engine. If it gets a plan with a decomposable aggregation function, it can use the decomposability information and intermediate type to split the plan up and distribute it over nodes. Using [X] to signify a column of type X, saying an aggregate function from [X] -> Z is ONE- or MANY-decomposable with intermediate type Y implies that [X] -> Y (INITIAL_TO_INTERMEDIATE) and [Y] -> Z (INTERMEDIATE_TO_RESULT) can also be used in plans, and that piping the results of a number of [X] -> Y invocations to a single [Y] -> Z invocation yields the same result as doing [X] -> Z on all Xs in one go. If it's MANY-decomposable [Y] -> Y (INTERMEDIATE_TO_INTERMEDIATE) also exists, allowing a distributor to use more exotic distribution patterns as well. Selection between these modes of operation happen using the "aggregation phase" option when binding the aggregate function (the default being INITIAL_TO_RESULT for the [X] -> Z variant), so effectively you're declaring several aggregate functions in one go by defining one to be decomposable.

How an engine implements a function internally (parallelization strategy, distribution over nodes, etc.) is out of scope for Substrait. It's ultimately just an interface, after all; such things are implementation details that are intentionally hidden by that interface.

vibhatha · 2022-07-26T04:32:02Z

@jvanstraten updated the PR with the recent suggestions.

jvanstraten

I still don't know how I feel about the integer variants of these functions. I don't really see a use case for them; in fact I'd sooner implement them for decimals than for integers.

jvanstraten · 2022-07-26T09:31:05Z

extensions/functions_arithmetic.yaml

+        decomposable: MANY
+        intermediate: "fp64?"
+        return: fp64?


This doesn't look right.

Typo, fixed.

No, what doesn't look right is that these three lines came out of nowhere at the bottom of the file. They should be removed.

Yes, I didn't catch that sorry. Now it should be okay.

jvanstraten · 2022-07-26T09:32:57Z

extensions/functions_arithmetic.yaml

+      - args:
+          - options: [ SILENT, SATURATE, ERROR ]
+            required: false
+          - value: fp32
+        nullability: DECLARED_OUTPUT
+        return: fp32?
+      - args:
+          - options: [ SILENT, SATURATE, ERROR ]
+            required: false
+          - value: fp64
+        nullability: DECLARED_OUTPUT
+        return: fp64?


Please use the floating point rounding options rather than integer overflow options for floats.

Just to make sure, if I understood this right, we are not including this for integer inputs?

It's still an open question to me if the integer input versions should even exist (I feel pretty strongly that they shouldn't).

If they do, well, you're going to have to specify something about rounding for sure, just not the floating point options because those are specific to floating points. I mean, you're doing divisions and a sqrt at the end of the computation there. Division is defined for integers but won't do what you want (it rounds toward zero or negative infinity depending on who you ask, which in itself is a problem, but also means it's statistically biased), but sqrt certainly isn't.

Anyone intending to use the integer versions is still going to expect floating point return types from these. But then the question is which, i.e. fp32 or fp64? Or even some decimal? The solution to that, from a user-perspective, would be to just cast to the real-number type of your choice first, and then use that version of the function. So the integer versions would never be used.

That make sense, let me remove the integer types from this.

Removed the integer types.

jvanstraten

LGTM now.

jvanstraten · 2022-07-27T16:57:37Z

Hm, CI doesn't agree with me though.

You'll need to rebase and make all your commits pass commitlint for the semantic release check. IMO it's a bit silly because your commits will be squashed anyway, but still...
The YAML validation error is because the intermediate type is erroneously marked as required. This has come up in an issue before, but I forget which. Apparently that one didn't get merged yet. Can you loosen the restriction by changing

substrait/text/simple_extensions_schema.yaml

Line 203 in d4cfbe0

required: [ intermediate, return ]

to just:
```
required: [ return ]
```
to make CI pass?

jacques-n · 2022-07-28T04:41:43Z

@jvanstraten, I think the intermediate change was only for window functions. Looks like the same should be applied to aggregates.

jvanstraten · 2022-07-28T11:34:17Z

Ah, I forgot those were separate things in the YAML schema. That makes sense.

vibhatha · 2022-07-28T13:54:28Z

@jacques-n @jvanstraten I squashed it to a single commit message. Let's check the CI's.
Do I need to make any changes to CI rules? Did I misundertood this

jvanstraten

I just made the schema change myself. LGTM now.

jacques-n · 2022-07-28T16:27:50Z

@vibhatha , can you rebase so there aren't merge conflicts?

vibhatha · 2022-07-28T17:14:47Z

@jacques-n rebased and resolved conflicts. Please check it once more.

jacques-n

Thanks @vibhatha !

vibhatha marked this pull request as ready for review July 25, 2022 14:25

jvanstraten reviewed Jul 25, 2022

View reviewed changes

vibhatha requested a review from jvanstraten July 25, 2022 15:11

jvanstraten changed the title ~~feat: Adding Standard Deviation~~ feat: add standard deviation functions Jul 25, 2022

vibhatha force-pushed the aggregate-std_dev branch from a2791cb to e95afb5 Compare July 26, 2022 04:22

vibhatha requested a review from jacques-n July 26, 2022 04:32

jvanstraten reviewed Jul 26, 2022

View reviewed changes

jvanstraten requested changes Jul 26, 2022

View reviewed changes

vibhatha mentioned this pull request Jul 26, 2022

feat: add variance function #263

Merged

vibhatha force-pushed the aggregate-std_dev branch from ca7bb91 to 34e1368 Compare July 27, 2022 13:29

vibhatha requested a review from jvanstraten July 27, 2022 13:30

jvanstraten previously approved these changes Jul 27, 2022

View reviewed changes

vibhatha dismissed jvanstraten’s stale review via 6725d86 July 28, 2022 13:53

vibhatha force-pushed the aggregate-std_dev branch from 34e1368 to 6725d86 Compare July 28, 2022 13:53

vibhatha requested review from westonpace and jvanstraten July 28, 2022 13:56

jvanstraten previously approved these changes Jul 28, 2022

View reviewed changes

vibhatha dismissed jvanstraten’s stale review via 995fc2f July 28, 2022 16:51

vibhatha force-pushed the aggregate-std_dev branch from 20d5959 to 995fc2f Compare July 28, 2022 16:51

fix(rebase): resolve conflicts

ac57859

vibhatha force-pushed the aggregate-std_dev branch from 6ada42b to ac57859 Compare July 28, 2022 17:12

jacques-n approved these changes Jul 28, 2022

View reviewed changes

jacques-n merged commit 1339534 into substrait-io:main Jul 28, 2022

jacques-n mentioned this pull request Aug 3, 2022

[ISTHMUS] Cannot upgrade to substrait 0.9.0 substrait-io/substrait-java#83

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add standard deviation functions #257

feat: add standard deviation functions #257

vibhatha commented Jul 25, 2022 •

edited

Loading

jvanstraten commented Jul 25, 2022

vibhatha commented Jul 25, 2022

jvanstraten Jul 25, 2022

vibhatha Jul 25, 2022

jvanstraten Jul 25, 2022

vibhatha Jul 25, 2022 •

edited

Loading

vibhatha Jul 25, 2022

jvanstraten Jul 25, 2022

jacques-n Jul 25, 2022 •

edited

Loading

vibhatha Jul 26, 2022

westonpace Jul 26, 2022

jvanstraten Jul 26, 2022

vibhatha commented Jul 26, 2022

jvanstraten left a comment

jvanstraten Jul 26, 2022

vibhatha Jul 26, 2022

jvanstraten Jul 27, 2022

vibhatha Jul 27, 2022

jvanstraten Jul 26, 2022

vibhatha Jul 26, 2022

jvanstraten Jul 27, 2022

vibhatha Jul 27, 2022

vibhatha Jul 27, 2022

jvanstraten left a comment

jvanstraten commented Jul 27, 2022

jacques-n commented Jul 28, 2022

jvanstraten commented Jul 28, 2022

vibhatha commented Jul 28, 2022 •

edited

Loading

jvanstraten left a comment

jacques-n commented Jul 28, 2022

vibhatha commented Jul 28, 2022

jacques-n left a comment

feat: add standard deviation functions #257

feat: add standard deviation functions #257

Conversation

vibhatha commented Jul 25, 2022 • edited Loading

jvanstraten commented Jul 25, 2022

vibhatha commented Jul 25, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vibhatha Jul 25, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jacques-n Jul 25, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vibhatha commented Jul 26, 2022

jvanstraten left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jvanstraten left a comment

Choose a reason for hiding this comment

jvanstraten commented Jul 27, 2022

jacques-n commented Jul 28, 2022

jvanstraten commented Jul 28, 2022

vibhatha commented Jul 28, 2022 • edited Loading

jvanstraten left a comment

Choose a reason for hiding this comment

jacques-n commented Jul 28, 2022

vibhatha commented Jul 28, 2022

jacques-n left a comment

Choose a reason for hiding this comment

vibhatha commented Jul 25, 2022 •

edited

Loading

vibhatha Jul 25, 2022 •

edited

Loading

jacques-n Jul 25, 2022 •

edited

Loading

vibhatha commented Jul 28, 2022 •

edited

Loading