Skip to content

Conversation

@cyb70289
Copy link
Contributor

@cyb70289 cyb70289 commented Oct 12, 2020

Improve variance merging method to address stability issue when merging
short chunks with approximate mean value.

Improve reference variance accuracy by leveraging Kahan summation.

Improve variance merging method to address stabiliy issue when merging
short chunks with approximate mean value.

Improve reference variance calculation by leveraging Kahan summation.
@github-actions
Copy link

@cyb70289
Copy link
Contributor Author

CI failure looks not related

this->AssertVarStdIs("[100000004, 100000007, 100000013, 100000016]", options, 30.0);
this->AssertVarStdIs("[1000000004, 1000000007, 1000000013, 1000000016]", options, 30.0);

#ifndef __MINGW32__ // MinGW has precision issues
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was only the 32-bit MinGW build, i.e. it was perhaps not MinGW but x87 (perhaps you can check with a 32-bit Linux build?).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test failed on mingw 32 community CI. And I see similar comments in decimal unit test.
https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/decimal_test.cc#L695

I didn't tested it on my side. Maybe I can start a 32bit VM to check.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I've checked and there is no failure on Linux i386. It does seem MinGW-related.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will merge. Thanks a lot for doing this!

@pitrou pitrou closed this in f07a415 Oct 13, 2020
@cyb70289 cyb70289 deleted the variance-stability branch October 13, 2020 09:14
@alippai
Copy link
Contributor

alippai commented Oct 13, 2020

Are there any before/after benchmarks? It's really nice that we can have extra numerical stability, I'm just curious what's the penalty for it.

@cyb70289
Copy link
Contributor Author

Are there any before/after benchmarks? It's really nice that we can have extra numerical stability, I'm just curious what's the penalty for it.

This change is only for combing variances from multiple arrays. The time is trivial compared with computing variance for each array.
Benchmark also shows no difference (benchmark PR is pending review, #8407)

@alippai
Copy link
Contributor

alippai commented Oct 14, 2020

Amazing, thanks!

kszucs pushed a commit that referenced this pull request Oct 19, 2020
Improve variance merging method to address stability issue when merging
short chunks with approximate mean value.

Improve reference variance accuracy by leveraging Kahan summation.

Closes #8437 from cyb70289/variance-stability

Authored-by: Yibo Cai <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants