-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade t-digest to 3.2 #28305
Upgrade t-digest to 3.2 #28305
Conversation
Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually? |
1 similar comment
Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to understand why some of these values are different now, especially as for the one on line 100 of the tests the change in the values is relatively big. I think its ok for the values to change if they are more correct but we should be able to explain why the values needed to change.
Thanks @colings86 ! I'm not a expert on t-digest, please correct me if I'm wrong. I made a test for both t-digest 3.2 and 3.1, I found the big change is introduced in 3.2:
The output is:
For 3.1, the output:
Then I found another implementation https://github.com/CamDavidsonPilon/tdigest and made a test:
The output is:
So I trend to think it's more accurate than before. |
@liketic I had a look into the differences between t-digest 3.0 and t-digest 3.2 specifically around the AVLTreeDigest that we use in Elasticsearch. The key difference that is causing the tests here to need changing seems to be a different approach to interpolating the value when the requested quantile is between two centroids. As an example, if we are looking for the 81.25th percentile and we have 8 values the percentile is located at the 6.5th value. Let say that we have a centroid that contains the 6th value with a centroid value of 4 and a centroid containing the 7th value with centroid value of 9. In 3.0 the interpolation was linear between the two centroids. So it will return the value that is half way between 5 and 8, since the value we want is half way between the 6th and 7th value. So the returned value is 8.5 In 3.2 the interpolation is a weighted average between the two value which takes into account the count of values in the 6th and 7th centroids. So if the 7th centroid has more values than the 6th centroid, the interpolation is biased more towards the 7th centroid. I think the weighted interpolation is a better approach so I am happy that the change is a positive one. Would you mind updating your branch with the latest master and then I can set of another CI build to make sure we are still green before I merge this? Thanks for working on this and also for your patience waiting for my response (sorry for the delay) |
7309ac5
to
d1890b5
Compare
Wow! @colings86 Thank you so much for your professional investigation! I rebased the pull request, please trigger jenkins. |
@liketic no problem, I'll trigger a build now |
jenkins please test this |
jenkins test this please |
@liketic it looks like there are still test failures in |
jenkins test this please |
Jenkins retest this please |
@liketic it looks like there are some test failures left in the documentation tests:
I think the expected outputs in the percentiles aggregation page in the docs just need adjusting. Are you able to make those changes? |
@colings86 That's my mistake. I pushed 6f65266f3d9bd6b67e42201ef8b4f518f2612410 to fix the tests. Please trigger jenkins again. |
jenkins please retest this |
@colings86 The testing is still failed, I'll rebase the PR, WDYT? |
@liketic yes I think rebasing the PR would be good to try here as I think the failure is related to some changes that were made overnight. Thanks |
6f65266
to
a307bdd
Compare
jenkins test this please |
Merged to master and backported to 6.x. Thanks for working on this @liketic and for sticking with it through all the build failures 😄 |
* master: Add a note to the docs that _cat api `help` option cannot be used if an optional url param is used (elastic#28686) Lift error finding utility to exceptions helpers Change "tweet" type to "_doc" (elastic#28690) [Docs] Add missing word in nested.asciidoc (elastic#28507) Simplify the Translog constructor by always expecting an existing translog (elastic#28676) Upgrade t-digest to 3.2 (elastic#28295) (elastic#28305) Add comment explaining lazy declared versions
…on on Percentiles aggregations (#1531) I've updated the: - docker-compose to a newer version (3.4) - updated to ElasticSearch 6.4.x (available on [HERE](https://www.docker.elastic.co/)) - Updated Percentiles Aggregation Tests as the T-Digest algorithm has been updated and so all the tests failed. Have a look at [the PR on ES](elastic/elasticsearch#28305)
Upgrade t-digest to 3.2. Some unit tests are failed seems like the
percentile
algorithm is different as the current version.Closes #28295
/cc @colings86