Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update to reflect new changes to the t-digest algorithm #20

Open
Dimplexion opened this issue Mar 19, 2019 · 2 comments
Open

Update to reflect new changes to the t-digest algorithm #20

Dimplexion opened this issue Mar 19, 2019 · 2 comments

Comments

@Dimplexion
Copy link

Lately there have been many changes to the t-digest that improve both accuracy and speed of the algorithm. Having the changes incorporated into this module will be very useful when this implementation of t-digest is used to implement the user facing quantile function in Dask.

There are couple of questions open still before the work can begin:

  1. Is the algorithm described in this paper the final one?
  2. Does the Java implementation in the t-digest repository contain the new changes? Can it be used as a reference?
@jcrist
Copy link
Member

jcrist commented Mar 19, 2019

Agreed. I'd have to give the paper a read to see what changes were made since I implemented this. However, I suggest you start with adding the existing implementation to Dask, and update the algorithm here afterwards. The algorithm as written works, and seeing it in use will likely influence what changes (if any) need to be made.

Is the algorithm described in this paper the final one?

Yes. The paper is tracked in the github repo as well: https://github.com/tdunning/t-digest/blob/master/docs/t-digest-paper/histo.pdf.

Does the Java implementation in the t-digest repository contain the new changes? Can it be used as a reference?

Yes, and kind of. The tdigest license (apache) is included in our license file with a note in the relevant files that the algorithm is borrowed from the tdigest repo. However, the fastest way to do things in java is not the same as in C - you will likely have to make changes rather than doing a direct port.


I personally have no bandwidth right now to help with this project, but would accept whatever changes are made, provided they come with speed/accuracy improvements and proper tests.

@Dimplexion
Copy link
Author

Dimplexion commented Mar 19, 2019

It makes sense to do the Dask implementation first. The speed improvement is nice to have but probably not essential, the current version should work alright.

Thanks for the answers. My C is actually better than my Java so hopefully it's not too complicated to convert it.

I should have time to do this after I have the Dask implementation ready so I'll just put this on my pile.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants