After dimensionality reduction with PCA, using dot product or L2 normalization, the search results are incorrect #3033

wh5938316 · 2023-08-31T10:13:17Z

wh5938316
Aug 31, 2023

I need to index the embeddings of a biencoder (as in BLINK). In the biencoder, the similarity between two vectors is calculated using dot product.
However, IndexHNSWFlat only supports L2 distance. To overcome this, BLINK adds an extra dimension and performs some mathematical transformations to convert the dot product space to L2 space.here is the code.
The problem with this approach is that it's not possible to incrementally add new content to the index, as the maximum norm might change.
Therefore, by L2 normalizing all vectors, I achieved the desired result.

index = faiss.index_factory(768, "L2norm,IVF16384_HNSW32,Flat")
index.train(embeddings)
index.add(embeddings)
result = index.search(input, 32)
# result is correct

However, after performing PCA dimensionality reduction on the vectors, I couldn’t retrieve the correct results.

index = faiss.index_factory(768, "PCA256,L2norm,IVF16384_HNSW32,Flat")
index.train(embeddings)
index.add(embeddings)
result = index.search(input, 32)
# result is wrong!

And, when I don't apply L2 normalization, the results I get with or without PCA dimensionality reduction are similar. Although they are all wrong, they are consistent in their mistakes.

I'm not sure where the problem lies. Can anyone offer some insights?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

After dimensionality reduction with PCA, using dot product or L2 normalization, the search results are incorrect #3033

{{title}}

Replies: 0 comments

Select a reply

After dimensionality reduction with PCA, using dot product or L2 normalization, the search results are incorrect #3033

wh5938316 Aug 31, 2023

Replies: 0 comments

wh5938316
Aug 31, 2023