refactor cluster selection so it can be replaced, add dbscan based clustering #28

kchaliki · 2020-02-04T20:40:00Z

Sending this here in case it is useful for anyone else, I found that for my use case of searching through a large body of music tracks (converted to 3-grams), trying to find the same track in different parts of the library I was getting bad results even with k_clusters=500 (and the obvious speed disadvantage). This change seems to make it behave as good as the greedy knn in scikit with k_clusters=5 hence many orders of magnitude faster.

… based

facebook-github-bot · 2020-02-04T20:40:17Z

Hi kchaliki! Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file.In order for us to review and merge your code, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

facebook-github-bot · 2020-02-04T21:04:07Z

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks!

spencebeecher · 2020-02-05T15:27:27Z

This is great! I like the architecture change to the code.

One thing I would like to add is an example in one of the ipython notebooks - specifically this one - https://github.com/facebookresearch/pysparnn/blob/master/examples/sparse_search_comparison.ipynb

The other question that I have - is this only intended to be used with ClusterIndex and not MultiClusterIndex?

kchaliki · 2020-02-05T22:44:57Z

Thanks for looking at this @spencebeecher - to be honest I didn't really look at what MultiClusterIndex is about in a lot of detail, do you think it should also be used there? I will definitely add to the notebook in the coming days.

kchaliki · 2020-02-06T08:38:44Z

So it looks like LSHForest which you are comparing against in the notebook has been deprecated and in fact removed in later versions of sklearn in favour of other ANN libraries like this one 😂 - discussion is here

I tried going back to the original requirements versions in the package but can no longer get numpy 1.11.2 to pip install due to header changes in Ubuntu. I can most likely get it to work by juggling versions but perhaps it doesn't make sense to compare to LSHForest any more anyway?

kchaliki · 2020-02-14T11:19:32Z

Hey @spencebeecher - have you decided what you want to do about the notebook comparisons?

refactor cluster selection so it can be replaced, start adding dbscan…

3f732b6

… based

facebook-github-bot added the CLA Signed label Feb 4, 2020

kchaliki mentioned this pull request Feb 5, 2020

Improve top-level item selection with k-means++ initialisation #24

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor cluster selection so it can be replaced, add dbscan based clustering #28

refactor cluster selection so it can be replaced, add dbscan based clustering #28

kchaliki commented Feb 4, 2020

facebook-github-bot commented Feb 4, 2020

facebook-github-bot commented Feb 4, 2020

spencebeecher commented Feb 5, 2020

kchaliki commented Feb 5, 2020

kchaliki commented Feb 6, 2020

kchaliki commented Feb 14, 2020

refactor cluster selection so it can be replaced, add dbscan based clustering #28

Are you sure you want to change the base?

refactor cluster selection so it can be replaced, add dbscan based clustering #28

Conversation

kchaliki commented Feb 4, 2020

facebook-github-bot commented Feb 4, 2020

facebook-github-bot commented Feb 4, 2020

spencebeecher commented Feb 5, 2020

kchaliki commented Feb 5, 2020

kchaliki commented Feb 6, 2020

kchaliki commented Feb 14, 2020