-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix incorrect dot_factor usage #628
Conversation
4c6e636
to
555a3e4
Compare
555a3e4
to
0c00b06
Compare
Removed |
There's a dot product test in https://github.com/erikbern/ann-benchmarks/ that you could use if you want to. This looks great – I'll merge this and will publish a new version to PyPI. Let me know when it's ready! |
8f6b476
to
d68d025
Compare
Added extra accuracy tests for Also I managed to return dot products as distances and fix 4 broken tests. We only need order made by a metric for NNs sorting during the inference, and both angular distance and dot make the same order because of our index construction. Hope, this solution looks good. |
It doesn't change anything for the other distance metrics right? (angular etc) |
No, it doesn't. Particularly angular performance hasn't changed at all. Haven't checked other metrics exact recalls, but tests seem to be fine. I'm only adding some new methods for other metrics that don't actually do anything new (they do for DotProduct). These are |
@erikbern I've checked the exact recalls in master: PR: |
@erikbern a gentle reminder that this one is waiting for review, PTAL |
I think it would be great if @psobot could look at this! Since he's the author of the dot product code |
Oof, my apologies - I didn't have GitHub notifications on for this repo and just stumbled across this today. Very nice work @pkorobov! Changes look good to me, and the accuracy improvements speak for themselves. Thank you! |
Sorry for also dropping the ball on this. Let's try to get this merged. Let me see how bad the conflicts are. |
Will merge this if tests are passing! |
This is great to see updates on the PR, thanks! |
Cool let me try to fix |
Thanks!! |
Annoy surprisingly didn't use
dot_factor
duringDotProduct
index construction. Here is my attempt to fix it.I tried to reproduce the behavior of algorithm when we manually add (f+1)'th component to the data, as this is made in ann benchmarks (more information in the issue #619).
The list of fixes includes the following:
two_means
method calledtwo_means_dot
which usesdot_factor
. Maybe not the neatest solution, I'm open to improve it.margin
method ofDotProduct
class.margin(a, b) = dot(a[:f], b[:f]) + a->dot_factor * b->dot_factor
, but the code was a bit different.Note that during index construction we need to use
dot_factor
, therefore I added extramargin
andside
methods with different signatures. They take aNode
(not an array) as the second input vector and we'll call them while making a tree.But during inference we don't need
dot_factor
, as the input vector must have it being equal to 0.So during the inference the
side
/margin
methods with old signature will be used.According to the xbox paper, we should use Angular or L2 distance with the extra component.
I fixed this according to that.
After fixes runs on a lastfm dataset show the following:
So the results on this dataset finally became comparable.
Note that dot tests are passed, except the ones that check distance.
Of course, after all, I've changed it! :)
I am open to discuss how would it be better to fix them. If we really need exact dot scores, it would be better, I guess, to add some extra method to compute exactly dot scores.
But, as I think, it is better to use angular distance to form more proper splits and search for neighbors.
Also I would add an accuracy test for the lastfm dataset with only 64 components.
@erikbern may I ask you to load such a dataset with a link like here?
Then I would shortly add one more test to
accuracy_test.py
.