Fix incorrect dot_factor usage #628

pkorobov · 2023-01-19T11:48:38Z

Annoy surprisingly didn't use dot_factor during DotProduct index construction. Here is my attempt to fix it.

I tried to reproduce the behavior of algorithm when we manually add (f+1)'th component to the data, as this is made in ann benchmarks (more information in the issue #619).

The list of fixes includes the following:

I made an alternative two_means method called two_means_dot which uses dot_factor. Maybe not the neatest solution, I'm open to improve it.
Fixed the formula in margin method of DotProduct class.

margin(a, b) = dot(a[:f], b[:f]) + a->dot_factor * b->dot_factor , but the code was a bit different.

Note that during index construction we need to use dot_factor, therefore I added extra margin and side methods with different signatures. They take a Node (not an array) as the second input vector and we'll call them while making a tree.

But during inference we don't need dot_factor, as the input vector must have it being equal to 0.
So during the inference the side / margin methods with old signature will be used.

I've changed distance calculation

According to the xbox paper, we should use Angular or L2 distance with the extra component.
I fixed this according to that.

After fixes runs on a lastfm dataset show the following:

search_k	1000	10000	50000
Angular index recall (65 components)	0.563	0.736	0.783
Dot index (master, 64 components)	0.003	0.034	0.175
Dot index (my fix, 64 components)	0.514	0.692	0.787

So the results on this dataset finally became comparable.

Note that dot tests are passed, except the ones that check distance.
Of course, after all, I've changed it! :)
I am open to discuss how would it be better to fix them. If we really need exact dot scores, it would be better, I guess, to add some extra method to compute exactly dot scores.
But, as I think, it is better to use angular distance to form more proper splits and search for neighbors.

Also I would add an accuracy test for the lastfm dataset with only 64 components.
@erikbern may I ask you to load such a dataset with a link like here?
Then I would shortly add one more test to accuracy_test.py.

pkorobov · 2023-01-20T08:02:06Z

Removed two_means_dot, must look better now.

erikbern · 2023-01-24T13:57:45Z

There's a dot product test in https://github.com/erikbern/ann-benchmarks/ that you could use if you want to.

This looks great – I'll merge this and will publish a new version to PyPI. Let me know when it's ready!

pkorobov · 2023-01-26T17:32:34Z

Added extra accuracy tests for lastfm-64-dot: angular test for this dataset as is and also dot test for shortened vectors. Had to complicate code a bit to override a metric written in the dataset, but, I hope, not too much.

Also I managed to return dot products as distances and fix 4 broken tests.
I added a flag built to nodes. When an index is built, index item nodes will remember it.
Angular distance is not needed after the index is already constructed, so when we calculate distance between some node and an index item node we can just return dot products.

We only need order made by a metric for NNs sorting during the inference, and both angular distance and dot make the same order because of our index construction.

Hope, this solution looks good.
As for me, I guess, it looks ready. However, I am ready to improve the PR if you have any comments!

erikbern · 2023-01-27T03:55:00Z

It doesn't change anything for the other distance metrics right? (angular etc)

pkorobov · 2023-01-27T04:16:05Z

No, it doesn't. Particularly angular performance hasn't changed at all. Haven't checked other metrics exact recalls, but tests seem to be fine.

I'm only adding some new methods for other metrics that don't actually do anything new (they do for DotProduct). These are update_mean and extra side and margin methods.

pkorobov · 2023-01-30T09:49:16Z

@erikbern I've checked the exact recalls in accuracy_test, including the new tests:

master:
fashion-mnist-784-euclidean accuracy: 99.72% (expected 90.00%)
glove-25-angular accuracy: 95.72% (expected 69.00%)
lastfm-65-angular accuracy: 66.80% (expected 60.00%)
lastfm-64-dot accuracy: 3.69% (expected 60.00%)
nytimes-16-angular accuracy: 98.12% (expected 80.00%)

PR:
fashion-mnist-784-euclidean accuracy: 99.72% (expected 90.00%)
glove-25-angular accuracy: 95.72% (expected 69.00%)
lastfm-65-angular accuracy: 66.80% (expected 60.00%)
lastfm-64-dot accuracy: 68.61% (expected 60.00%)
nytimes-16-angular accuracy: 98.12% (expected 80.00%)

pkorobov · 2023-02-15T14:56:25Z

@erikbern a gentle reminder that this one is waiting for review, PTAL

erikbern · 2023-02-15T22:07:49Z

I think it would be great if @psobot could look at this! Since he's the author of the dot product code

psobot · 2023-08-17T21:17:40Z

Oof, my apologies - I didn't have GitHub notifications on for this repo and just stumbled across this today.

Very nice work @pkorobov! Changes look good to me, and the accuracy improvements speak for themselves. Thank you!

erikbern · 2023-08-17T22:07:50Z

Sorry for also dropping the ball on this. Let's try to get this merged. Let me see how bad the conflicts are.

erikbern · 2023-08-17T22:19:58Z

Will merge this if tests are passing!

fix merge resolve typo

pkorobov · 2023-08-18T10:29:11Z

This is great to see updates on the PR, thanks!
It seems that the lastfm-dot-64.hdf5 dataset is not available on ann benchmarks anymore.

erikbern · 2023-08-18T10:31:29Z

Cool let me try to fix

erikbern · 2023-08-20T17:39:00Z

Thanks!!

pkorobov force-pushed the fix-dot-recall branch from 4c6e636 to 555a3e4 Compare January 19, 2023 12:24

pkorobov added 2 commits January 19, 2023 18:40

Fix incorrect dot_factor usage

53d0007

Make it the same as in angular

0c00b06

pkorobov force-pushed the fix-dot-recall branch from 555a3e4 to 0c00b06 Compare January 19, 2023 12:40

Remove two_means duplication

66209ad

pkorobov added 4 commits January 26, 2023 19:18

Fix distance

66b74da

Add extra accuracy tests

e75dfc3

Add reference

7fef46a

Improve some details

d68d025

pkorobov force-pushed the fix-dot-recall branch from 8f6b476 to d68d025 Compare January 26, 2023 17:09

Remove redundant whitespace

a436f78

Update comments

359aaa5

Merge branch 'main' into fix-dot-recall

edca4f6

erikbern added 3 commits August 18, 2023 00:24

Update accuracy_test.py

7ee2dce

fix merge resolve typo

Fix another merge conflict

9ae9801

Fix another merge conflict (remove self from arguments)

a70dac2

erikbern merged commit 2be37c9 into spotify:main Aug 20, 2023
15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix incorrect dot_factor usage #628

Fix incorrect dot_factor usage #628

pkorobov commented Jan 19, 2023 •

edited

Loading

pkorobov commented Jan 20, 2023

erikbern commented Jan 24, 2023

pkorobov commented Jan 26, 2023 •

edited

Loading

erikbern commented Jan 27, 2023

pkorobov commented Jan 27, 2023 •

edited

Loading

pkorobov commented Jan 30, 2023 •

edited

Loading

pkorobov commented Feb 15, 2023

erikbern commented Feb 15, 2023

psobot commented Aug 17, 2023

erikbern commented Aug 17, 2023

erikbern commented Aug 17, 2023

pkorobov commented Aug 18, 2023

erikbern commented Aug 18, 2023

erikbern commented Aug 20, 2023

Fix incorrect dot_factor usage #628

Fix incorrect dot_factor usage #628

Conversation

pkorobov commented Jan 19, 2023 • edited Loading

pkorobov commented Jan 20, 2023

erikbern commented Jan 24, 2023

pkorobov commented Jan 26, 2023 • edited Loading

erikbern commented Jan 27, 2023

pkorobov commented Jan 27, 2023 • edited Loading

pkorobov commented Jan 30, 2023 • edited Loading

pkorobov commented Feb 15, 2023

erikbern commented Feb 15, 2023

psobot commented Aug 17, 2023

erikbern commented Aug 17, 2023

erikbern commented Aug 17, 2023

pkorobov commented Aug 18, 2023

erikbern commented Aug 18, 2023

erikbern commented Aug 20, 2023

pkorobov commented Jan 19, 2023 •

edited

Loading

pkorobov commented Jan 26, 2023 •

edited

Loading

pkorobov commented Jan 27, 2023 •

edited

Loading

pkorobov commented Jan 30, 2023 •

edited

Loading