Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix incorrect dot_factor usage #628

Merged
merged 13 commits into from
Aug 20, 2023
Merged

Conversation

pkorobov
Copy link
Contributor

@pkorobov pkorobov commented Jan 19, 2023

Annoy surprisingly didn't use dot_factor during DotProduct index construction. Here is my attempt to fix it.

I tried to reproduce the behavior of algorithm when we manually add (f+1)'th component to the data, as this is made in ann benchmarks (more information in the issue #619).

The list of fixes includes the following:

  1. I made an alternative two_means method called two_means_dot which uses dot_factor. Maybe not the neatest solution, I'm open to improve it.
  2. Fixed the formula in margin method of DotProduct class.

margin(a, b) = dot(a[:f], b[:f]) + a->dot_factor * b->dot_factor , but the code was a bit different.

Note that during index construction we need to use dot_factor, therefore I added extra margin and side methods with different signatures. They take a Node (not an array) as the second input vector and we'll call them while making a tree.

But during inference we don't need dot_factor, as the input vector must have it being equal to 0.
So during the inference the side / margin methods with old signature will be used.

  1. I've changed distance calculation

According to the xbox paper, we should use Angular or L2 distance with the extra component.
I fixed this according to that.


After fixes runs on a lastfm dataset show the following:

search_k 1000 10000 50000
Angular index recall (65 components) 0.563 0.736 0.783
Dot index (master, 64 components) 0.003 0.034 0.175
Dot index (my fix, 64 components) 0.514 0.692 0.787

So the results on this dataset finally became comparable.

Note that dot tests are passed, except the ones that check distance.
Of course, after all, I've changed it! :)
I am open to discuss how would it be better to fix them. If we really need exact dot scores, it would be better, I guess, to add some extra method to compute exactly dot scores.
But, as I think, it is better to use angular distance to form more proper splits and search for neighbors.

Also I would add an accuracy test for the lastfm dataset with only 64 components.
@erikbern may I ask you to load such a dataset with a link like here?
Then I would shortly add one more test to accuracy_test.py.

@pkorobov
Copy link
Contributor Author

Removed two_means_dot, must look better now.

@erikbern
Copy link
Collaborator

There's a dot product test in https://github.com/erikbern/ann-benchmarks/ that you could use if you want to.

This looks great – I'll merge this and will publish a new version to PyPI. Let me know when it's ready!

@pkorobov
Copy link
Contributor Author

pkorobov commented Jan 26, 2023

Added extra accuracy tests for lastfm-64-dot: angular test for this dataset as is and also dot test for shortened vectors. Had to complicate code a bit to override a metric written in the dataset, but, I hope, not too much.

Also I managed to return dot products as distances and fix 4 broken tests.
I added a flag built to nodes. When an index is built, index item nodes will remember it.
Angular distance is not needed after the index is already constructed, so when we calculate distance between some node and an index item node we can just return dot products.

We only need order made by a metric for NNs sorting during the inference, and both angular distance and dot make the same order because of our index construction.

Hope, this solution looks good.
As for me, I guess, it looks ready. However, I am ready to improve the PR if you have any comments!

@erikbern
Copy link
Collaborator

It doesn't change anything for the other distance metrics right? (angular etc)

@pkorobov
Copy link
Contributor Author

pkorobov commented Jan 27, 2023

No, it doesn't. Particularly angular performance hasn't changed at all. Haven't checked other metrics exact recalls, but tests seem to be fine.

I'm only adding some new methods for other metrics that don't actually do anything new (they do for DotProduct). These are update_mean and extra side and margin methods.

@pkorobov
Copy link
Contributor Author

pkorobov commented Jan 30, 2023

@erikbern I've checked the exact recalls in accuracy_test, including the new tests:

master:
fashion-mnist-784-euclidean accuracy: 99.72% (expected 90.00%)
glove-25-angular accuracy: 95.72% (expected 69.00%)
lastfm-65-angular accuracy: 66.80% (expected 60.00%)
lastfm-64-dot accuracy: 3.69% (expected 60.00%)
nytimes-16-angular accuracy: 98.12% (expected 80.00%)

PR:
fashion-mnist-784-euclidean accuracy: 99.72% (expected 90.00%)
glove-25-angular accuracy: 95.72% (expected 69.00%)
lastfm-65-angular accuracy: 66.80% (expected 60.00%)
lastfm-64-dot accuracy: 68.61% (expected 60.00%)
nytimes-16-angular accuracy: 98.12% (expected 80.00%)

@pkorobov
Copy link
Contributor Author

@erikbern a gentle reminder that this one is waiting for review, PTAL

@erikbern
Copy link
Collaborator

I think it would be great if @psobot could look at this! Since he's the author of the dot product code

@psobot
Copy link
Member

psobot commented Aug 17, 2023

Oof, my apologies - I didn't have GitHub notifications on for this repo and just stumbled across this today.

Very nice work @pkorobov! Changes look good to me, and the accuracy improvements speak for themselves. Thank you!

@erikbern
Copy link
Collaborator

Sorry for also dropping the ball on this. Let's try to get this merged. Let me see how bad the conflicts are.

@erikbern
Copy link
Collaborator

Will merge this if tests are passing!

@pkorobov
Copy link
Contributor Author

This is great to see updates on the PR, thanks!
It seems that the lastfm-dot-64.hdf5 dataset is not available on ann benchmarks anymore.

@erikbern
Copy link
Collaborator

Cool let me try to fix

@erikbern erikbern merged commit 2be37c9 into spotify:main Aug 20, 2023
15 checks passed
@erikbern
Copy link
Collaborator

Thanks!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants