Improve correctness of stddev and variance with partial aggregation#23447
Merged
amitkdutta merged 1 commit intoprestodb:masterfrom Aug 15, 2024
Merged
Improve correctness of stddev and variance with partial aggregation#23447amitkdutta merged 1 commit intoprestodb:masterfrom
amitkdutta merged 1 commit intoprestodb:masterfrom
Conversation
When merging varianceStates for partial aggregation, if the current state has zero rows, use the values from the other state without doing computation. This prevents introducing error due to imprecision in floating point numbers. Additionally, change the way we combine means. This ensures that we do not introduce error due to imprecision in multiplication/division when the delta is 0. I think it should in general improve the error introduced by the mean computation, but I don't have a rigorous proof or even experimental data for this.
Contributor
|
Nit: suggest a minor edit to release notes entry following the Order of Changes in the Release Notes Guidelines, based on the commit message. Please modify my suggestion if you think of a better wording! |
amitkdutta
approved these changes
Aug 14, 2024
Contributor
amitkdutta
left a comment
There was a problem hiding this comment.
Looks great. Thanks @rschlussel
Additionally, this will remove verification noise between native and java engines, as native engine computes it properly today with identical values for statistical aggregates (e.g. stddev, variance)
elharo
approved these changes
Aug 14, 2024
feilong-liu
approved these changes
Aug 14, 2024
34 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
When merging varianceStates for partial aggregation, if the current state has zero rows, use the values from the other state without doing computation. This prevents introducing error due to imprecision in floating point numbers.
Additionally, change the way we combine means. This ensures that we do not introduce error due to imprecision in multiplication/division when the delta is 0. I think it should in general improve the error introduced by the mean computation, but I don't have a rigorous proof or even experimental data for this.
Motivation and Context
Queries with 0 variance on large values can return inconsistent and incorrect stddev due to error introduced by floating point arithmetic. For example, see the following result for a stddev and variance computations over a constant.
that same query with partial aggregation disabled returns correct results
This change reduces the amount of error we introduce in merging variance states during partial aggregation for certain cases to improve the accuracy of our variance and stddev functions.
Impact
Ensures that when variance or stddev is zero, results are always correct, and reduces the error we introduce for other cases.
Test Plan
new unit tests
production verifier run (in progress)
Contributor checklist
Release Notes
Please follow release notes guidelines and fill in the release notes below.