Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sklearn and faiss give different result with range_search #1630

Closed
lampretl opened this issue Jan 17, 2021 · 4 comments
Closed

sklearn and faiss give different result with range_search #1630

lampretl opened this issue Jan 17, 2021 · 4 comments
Labels

Comments

@lampretl
Copy link

Hello! This is probably not an issue but just me not using faiss properly. I could not reproduce the results of sklearn for finding the R-neighbourhood of x in y. Here is my Python 3 code, that compares the results:

import numpy as np
import pandas as pd
import time
import plotly.express as plt 
from sklearn import *
import faiss
def ti(): return time.perf_counter() #timing results in seconds
def plotS(xx,sz): #scatterplot in 2D; xx=PCDs; sz=markes sizes; 
    d=len(xx[0][0]); a='abcdefghijklmnopqrstuvwxyz';  #d=dimension of input points; a=alphabet
    if len(xx)>len(a): a=[c1+c2 for c1 in a for c2 in a]
    x=pd.DataFrame(np.concatenate(xx),columns=['x','y'])
    x['c']=a[0]; x['s']=sz[0]; l=[len(x) for x in xx]; l=[sum(l[:i+1]) for i in range(len(l))];
    for i in range(1,len(l)): x.iloc[l[i-1]:l[i],[2,3]]=[a[i],sz[i]]
    fig=plt.scatter(data_frame=x,x='x',y='y',color='c',size='s',size_max=max(sz)); fig.show()

r=100; R=10; #Given arrays x and y of points in 2D, find the R-neighbourhood of x in y.
x=np.random.multivariate_normal(mean=[50,20],cov=np.diag([150,50]),size=r**2); #Gaussian PCD
y=np.mgrid[0: 50:50/r,0:100:100/r]; y=np.array(list(zip(y[0].ravel(),y[1].ravel()))); #r x r grid of points
M,N=len(x),len(y); print('|x|=',M,' |y|=',N,sep=''); 

t0=ti(); kdt=neighbors.KDTree(y,metric='euclidean'); l1=np.array([],dtype=int); #via sklearn
for i in range(M):
    l=kdt.query_radius(x[i:i+1,:],r=R); 
    l1=np.unique(np.concatenate([l1,l[0]])); 
l1.sort(); z=y[l1]; t1=ti(); print(t1-t0,len(l1)); plotS([y,z,x],[7,4,4]); 

t0=ti(); x=x.astype('float32'); y=y.astype('float32'); #via faiss
ind=faiss.IndexFlatL2(x.shape[1]); ind.add(y); _,d,l2=ind.range_search(x,R); 
l2=np.sort(np.unique(l2)); z=y[l2]; t1=ti(); print(t1-t0,len(l2),max(d)); plotS([y,z,x],[7,4,4]); 

The output on my HP ZBook 17 G5 laptop with i7-8850H CPU and 64GB DDR4 RAM is:

|x|=10000 |y|=10000
1.4615631540000322 5132
0.3123309249999693 3615 9.999756
1
2

As you can see, the faiss neighbourhood is smaller than the sklearn one. How can I fix this?
Also, is there a way to make the faiss code run even faster? (I'm not sure I used all the built-in options to optimize)

@mdouze
Copy link
Contributor

mdouze commented Jan 18, 2021

Did you use the squared radius in Faiss?
This may just be demo code, but Faiss is not particularly efficient for low-dim data (2D or 3D).

@lampretl
Copy link
Author

Aaah, you're right, replacing _,d,l2=ind.range_search(x,R); with _,d,l2=ind.range_search(x,R**2); d=np.sqrt(d); solves my problem : ), the results are then identical. Thank you!

Just one more thing. When x and y have 10^6 points, faiss already far outperforms sklearn: 160min vs 58min. I am just wondering if I have used all the available tools to make my program faster? The code above used only 1 of my 12 CPU threads. How can I use all of them? Can range_search work on GPU?

@wickedfoo
Copy link
Contributor

Faiss is not optimized for low-dimensional (<20 dimensions, say) use cases in general; i.e., it would likely take as much time to do this on 32 dimensional vectors as it would on 2 dimensional vectors.

range_search is not currently implemented on the GPU (it is planned for H1 2021). An alternative is using something like pytorch and matrix multiplying x against slices of y, then reducing the results based on your less than radius selector. Note that you can't calculate all the distances of x against y in one go, as that would be 10^12 results, you need to generate partial results and reduce them. But again, pytorch would still be using cuBLAS which is also not really optimized for such low dimensional data.

There might be a primitive in pytorch3d which is appropriate and more optimized for 2 dimensional data here for GPU.

@lampretl
Copy link
Author

Thank you for your reply and time!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants