Skip to content

Improve correctness of stddev and variance with partial aggregation#23447

Merged
amitkdutta merged 1 commit intoprestodb:masterfrom
rschlussel:fix-stddev
Aug 15, 2024
Merged

Improve correctness of stddev and variance with partial aggregation#23447
amitkdutta merged 1 commit intoprestodb:masterfrom
rschlussel:fix-stddev

Conversation

@rschlussel
Copy link
Contributor

@rschlussel rschlussel commented Aug 14, 2024

Description

When merging varianceStates for partial aggregation, if the current state has zero rows, use the values from the other state without doing computation. This prevents introducing error due to imprecision in floating point numbers.

Additionally, change the way we combine means. This ensures that we do not introduce error due to imprecision in multiplication/division when the delta is 0. I think it should in general improve the error introduced by the mean computation, but I don't have a rigorous proof or even experimental data for this.

Motivation and Context

Queries with 0 variance on large values can return inconsistent and incorrect stddev due to error introduced by floating point arithmetic. For example, see the following result for a stddev and variance computations over a constant.

presto> SELECT count(*), stddev(6523763181031200), variance(6523763181031200) from test_table;
  _col0   |       _col1        |      _col2       
----------+--------------------+------------------
 61782553 | 2.2677238191332383 | 5.14257131986424 
(1 row)

that same query with partial aggregation disabled returns correct results


presto> set session prefer_partial_aggregation=false;
SET SESSION           
presto> SELECT count(*), stddev(6523763181031200), variance(6523763181031200) from test_table;
  _col0   | _col1 | _col2 
----------+-------+-------
 61782553 |   0.0 |   0.0 
(1 row)

This change reduces the amount of error we introduce in merging variance states during partial aggregation for certain cases to improve the accuracy of our variance and stddev functions.

Impact

Ensures that when variance or stddev is zero, results are always correct, and reduces the error we introduce for other cases.

Test Plan

new unit tests
production verifier run (in progress)

Contributor checklist

  • Please make sure your submission complies with our development, formatting, commit message, and attribution guidelines.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

Please follow release notes guidelines and fill in the release notes below.

== RELEASE NOTES ==

General Changes
* Improve stddev and variance functions to always return correct results when input is constant. :pr:`23447`

When merging varianceStates for partial aggregation, if the current
state has zero rows, use the values from the other state without doing
computation. This prevents introducing error due to imprecision in
floating point numbers.

Additionally, change the way we combine means. This ensures that we
do not introduce error due to imprecision in multiplication/division
when the delta is 0. I think it should in general improve the error
introduced by the mean computation, but I don't have a rigorous proof
or even experimental data for this.
@steveburnett
Copy link
Contributor

Nit: suggest a minor edit to release notes entry following the Order of Changes in the Release Notes Guidelines, based on the commit message. Please modify my suggestion if you think of a better wording!

== RELEASE NOTES ==

General Changes
* Improve stddev and variance functions to always return correct results when input is constant. :pr:`23447`

@rschlussel rschlussel marked this pull request as ready for review August 14, 2024 20:43
@rschlussel rschlussel requested a review from presto-oss August 14, 2024 20:43
Copy link
Contributor

@amitkdutta amitkdutta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Thanks @rschlussel
Additionally, this will remove verification noise between native and java engines, as native engine computes it properly today with identical values for statistical aggregates (e.g. stddev, variance)

Copy link
Member

@zacw7 zacw7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the prompt fix!

@amitkdutta amitkdutta merged commit 4f74f52 into prestodb:master Aug 15, 2024
@tdcmeehan tdcmeehan mentioned this pull request Aug 23, 2024
34 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants