sklearn and faiss give different result with range_search #1630

lampretl · 2021-01-17T09:47:24Z

Hello! This is probably not an issue but just me not using faiss properly. I could not reproduce the results of sklearn for finding the R-neighbourhood of x in y. Here is my Python 3 code, that compares the results:

import numpy as np
import pandas as pd
import time
import plotly.express as plt 
from sklearn import *
import faiss
def ti(): return time.perf_counter() #timing results in seconds
def plotS(xx,sz): #scatterplot in 2D; xx=PCDs; sz=markes sizes; 
    d=len(xx[0][0]); a='abcdefghijklmnopqrstuvwxyz';  #d=dimension of input points; a=alphabet
    if len(xx)>len(a): a=[c1+c2 for c1 in a for c2 in a]
    x=pd.DataFrame(np.concatenate(xx),columns=['x','y'])
    x['c']=a[0]; x['s']=sz[0]; l=[len(x) for x in xx]; l=[sum(l[:i+1]) for i in range(len(l))];
    for i in range(1,len(l)): x.iloc[l[i-1]:l[i],[2,3]]=[a[i],sz[i]]
    fig=plt.scatter(data_frame=x,x='x',y='y',color='c',size='s',size_max=max(sz)); fig.show()

r=100; R=10; #Given arrays x and y of points in 2D, find the R-neighbourhood of x in y.
x=np.random.multivariate_normal(mean=[50,20],cov=np.diag([150,50]),size=r**2); #Gaussian PCD
y=np.mgrid[0: 50:50/r,0:100:100/r]; y=np.array(list(zip(y[0].ravel(),y[1].ravel()))); #r x r grid of points
M,N=len(x),len(y); print('|x|=',M,' |y|=',N,sep=''); 

t0=ti(); kdt=neighbors.KDTree(y,metric='euclidean'); l1=np.array([],dtype=int); #via sklearn
for i in range(M):
    l=kdt.query_radius(x[i:i+1,:],r=R); 
    l1=np.unique(np.concatenate([l1,l[0]])); 
l1.sort(); z=y[l1]; t1=ti(); print(t1-t0,len(l1)); plotS([y,z,x],[7,4,4]); 

t0=ti(); x=x.astype('float32'); y=y.astype('float32'); #via faiss
ind=faiss.IndexFlatL2(x.shape[1]); ind.add(y); _,d,l2=ind.range_search(x,R); 
l2=np.sort(np.unique(l2)); z=y[l2]; t1=ti(); print(t1-t0,len(l2),max(d)); plotS([y,z,x],[7,4,4]);

The output on my HP ZBook 17 G5 laptop with i7-8850H CPU and 64GB DDR4 RAM is:

|x|=10000 |y|=10000
1.4615631540000322 5132
0.3123309249999693 3615 9.999756

As you can see, the faiss neighbourhood is smaller than the sklearn one. How can I fix this?
Also, is there a way to make the faiss code run even faster? (I'm not sure I used all the built-in options to optimize)

The text was updated successfully, but these errors were encountered:

mdouze · 2021-01-18T08:29:11Z

Did you use the squared radius in Faiss?
This may just be demo code, but Faiss is not particularly efficient for low-dim data (2D or 3D).

lampretl · 2021-01-18T15:50:30Z

Aaah, you're right, replacing _,d,l2=ind.range_search(x,R); with _,d,l2=ind.range_search(x,R**2); d=np.sqrt(d); solves my problem : ), the results are then identical. Thank you!

Just one more thing. When x and y have 10^6 points, faiss already far outperforms sklearn: 160min vs 58min. I am just wondering if I have used all the available tools to make my program faster? The code above used only 1 of my 12 CPU threads. How can I use all of them? Can range_search work on GPU?

wickedfoo · 2021-01-20T18:46:39Z

Faiss is not optimized for low-dimensional (<20 dimensions, say) use cases in general; i.e., it would likely take as much time to do this on 32 dimensional vectors as it would on 2 dimensional vectors.

range_search is not currently implemented on the GPU (it is planned for H1 2021). An alternative is using something like pytorch and matrix multiplying x against slices of y, then reducing the results based on your less than radius selector. Note that you can't calculate all the distances of x against y in one go, as that would be 10^12 results, you need to generate partial results and reduce them. But again, pytorch would still be using cuBLAS which is also not really optimized for such low dimensional data.

There might be a primitive in pytorch3d which is appropriate and more optimized for 2 dimensional data here for GPU.

lampretl · 2021-01-20T19:00:55Z

Thank you for your reply and time!

mdouze added the question label Jan 18, 2021

lampretl closed this as completed Jan 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sklearn and faiss give different result with range_search #1630

sklearn and faiss give different result with range_search #1630

lampretl commented Jan 17, 2021

mdouze commented Jan 18, 2021

lampretl commented Jan 18, 2021

wickedfoo commented Jan 20, 2021

lampretl commented Jan 20, 2021

sklearn and faiss give different result with range_search #1630

sklearn and faiss give different result with range_search #1630

Comments

lampretl commented Jan 17, 2021

mdouze commented Jan 18, 2021

lampretl commented Jan 18, 2021

wickedfoo commented Jan 20, 2021

lampretl commented Jan 20, 2021