You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I need to index the embeddings of a biencoder (as in BLINK). In the biencoder, the similarity between two vectors is calculated using dot product.
However, IndexHNSWFlat only supports L2 distance. To overcome this, BLINK adds an extra dimension and performs some mathematical transformations to convert the dot product space to L2 space.here is the code.
The problem with this approach is that it's not possible to incrementally add new content to the index, as the maximum norm might change.
Therefore, by L2 normalizing all vectors, I achieved the desired result.
index=faiss.index_factory(768, "L2norm,IVF16384_HNSW32,Flat")
index.train(embeddings)
index.add(embeddings)
result=index.search(input, 32)
# result is correct
However, after performing PCA dimensionality reduction on the vectors, I couldn’t retrieve the correct results.
index=faiss.index_factory(768, "PCA256,L2norm,IVF16384_HNSW32,Flat")
index.train(embeddings)
index.add(embeddings)
result=index.search(input, 32)
# result is wrong!
And, when I don't apply L2 normalization, the results I get with or without PCA dimensionality reduction are similar. Although they are all wrong, they are consistent in their mistakes.
I'm not sure where the problem lies. Can anyone offer some insights?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
I need to index the embeddings of a
biencoder
(as in BLINK). In thebiencoder
, the similarity between two vectors is calculated using dot product.However, IndexHNSWFlat only supports L2 distance. To overcome this, BLINK adds an extra dimension and performs some mathematical transformations to convert the dot product space to L2 space.here is the code.
The problem with this approach is that it's not possible to incrementally add new content to the index, as the maximum norm might change.
Therefore, by L2 normalizing all vectors, I achieved the desired result.
However, after performing PCA dimensionality reduction on the vectors, I couldn’t retrieve the correct results.
And, when I don't apply L2 normalization, the results I get with or without PCA dimensionality reduction are similar. Although they are all wrong, they are consistent in their mistakes.
I'm not sure where the problem lies. Can anyone offer some insights?
Beta Was this translation helpful? Give feedback.
All reactions