How to increase the number of nodes sampled in training KMeans? #2563
-
I want to perform kmeans over 8M nodes into 10k clusters. Faiss defaults to sample 2560000 points but I'd like to use the entire dataset. What's the solution? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
For reference, see this link: https://github.com/facebookresearch/faiss/wiki/FAQ#can-i-ignore-warning-clustering-xxx-points-to-yyy-centroids |
Beta Was this translation helpful? Give feedback.
faiss.Kmeans
has a propertymax_points_per_centroid
which is set to256
by default. Withk
clusters, this means onlyk * 256
datapoints can be used for fitting kmeans. In your case, this turns out to be2560000
datapoints, which get subsampled from your full dataset. To use all 8M samples for fitting Kmeans, just passmax_points_per_centroid=800
to thefaiss.Kmeans()
constructor.For reference, see this link: https://github.com/facebookresearch/faiss/wiki/FAQ#can-i-ignore-warning-clustering-xxx-points-to-yyy-centroids