Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compare performance to faiss #100

Open
redhog opened this issue Oct 5, 2023 · 6 comments
Open

Compare performance to faiss #100

redhog opened this issue Oct 5, 2023 · 6 comments

Comments

@redhog
Copy link

redhog commented Oct 5, 2023

Add https://github.com/facebookresearch/faiss to the performance comparison matrix

@djhoese
Copy link
Collaborator

djhoese commented Oct 5, 2023

It wasn't clear to me what an equivalent workflow would be for using faiss. I don't think the originally benchmarking code in the README was put into git so we might have to come up with an equivalent piece of code.

@redhog
Copy link
Author

redhog commented Oct 5, 2023

This is how I have compared scipy:s cKDTree to Faiss previously:

import faiss
import scipy.spatial
import numpy as np
import pandas as pd
import datetime


d = 64                           # dimension
nb = 100000                      # database size
nq = 10000                       # nb of queries
np.random.seed(1234)             # make reproducible
xb = np.random.random((nb, d)).astype('float32')
xb[:, 0] += np.arange(nb) / 1000.
xq = np.random.random((nq, d)).astype('float32')
xq[:, 0] += np.arange(nq) / 1000.

k = 4

fstart = datetime.datetime.now()
for a in range(10):
    index = faiss.IndexFlatL2(d)   # build the index
    index.add(xb)                  # add vectors to the index
    fD, fI = index.search(xq, k)     # actual search
    fend = datetime.datetime.now()
flen = fend - fstart

tstart = datetime.datetime.now()
for a in range(10):
    tree = scipy.spatial.cKDTree(xb)
    tD, tI = tree.query(xq, k)
tend = datetime.datetime.now()
tlen = tend - tstart

print("scipy.spatial.cKDTree", tlen, "faiss.IndexFlatL2", flen)

This should be directly portable to pykdtree:s query() function.

@redhog
Copy link
Author

redhog commented Oct 5, 2023

To be fair faiss has many other indices than L2, and faiss (and scipy?) are aimed at high dimensional data.

It would be nice to have the comparison code checked into git, say in a docs or contrib folder.

@djhoese
Copy link
Collaborator

djhoese commented Oct 5, 2023

Yes, I'm all for putting the code in the repository. Can L2 handle any dimensionality? Pykdtree's README used 3D data.

@redhog
Copy link
Author

redhog commented Oct 6, 2023

Yeah, faiss.IndexFlatL2 can handle any dimension. I think it's uncommon to use faiss for 2d and 3d data, much more common for > 100 dimensions (word embeddings etc). scipy.spatial.cKDTree can also handle multidimensional data, but I think it's more common to use it for spatial (2d, 3d) data, that's at least what I've been using it for.

@djhoese
Copy link
Collaborator

djhoese commented Oct 6, 2023

Yeah pykdtree is used heavily by the pyresample and satpy projects I maintain and they are entirely used for Earth geolocation (X, Y, Z). I don't think I'll have time to dedicate to running these comparisons and collecting the statistics for each variation so if someone else has time for this that'd be great. I think the timeit module should be used to run each case multiple time and each should probably be run in a separate script to avoid the memory usage by one persist and affecting the next.

That is of course only testing execution time. It'd be nice to have an estimate on memory usage too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants