-
Notifications
You must be signed in to change notification settings - Fork 248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better PAM initialization with BUILD #667
Comments
Hello @kno10 , There is a kmeans++ based algorithm to initialize it on which you are referring in your article:
It could be used to initialize initial medoids and I expected that only quality is going to be affected. Nevertheless your outcome looks like a good point to implement Also I have a question for you, did you use installed pyclustering version or just cloned version? Because in case of installed version, C++ implementation of the algorithm should be used and in case of cloned version, the library uses Python implementation due to lack of binaries in git repository.
It is clear about the number of iterations. But I have question about the final loss after optimization. Do you mean total deviation (TD)? |
I tried using your k-means++ initialization, but I believe it only accepts a data matrix, not a distance matrix, as input. For that experiment I used "pip install pyclustering" on Google colab, whatever that does by default. Yes, TD (total deviation) is the loss. |
Thank you for the clarification. One more question:
Was a distance matrix used as an input for the algorithm in the experiment? |
Hello @kno10 , I have introduced PAM BUILD algorithm in line with your article and I have optimized C++ version (that should be used by default) of K-Medoids (PAM) algorithm. I have attached pyclustering package (pyclustering-0.11.0.tar.gz) with all binaries that was created from Just in case K-Means++ supports distance matrix now. Let me know if you need code example as well. There is an example how to use K-Medoids with PAM BUILD: from pyclustering.cluster.kmedoids import kmedoids, build
from pyclustering.cluster import cluster_visualizer
from pyclustering.utils import read_sample
from pyclustering.samples.definitions import FCPS_SAMPLES
# Load list of points `Tetra` for cluster analysis.
sample = read_sample(FCPS_SAMPLES.SAMPLE_TETRA)
# Initialize initial medoids using PAM BUILD algorithm
initial_medoids = build(sample, 4).initialize()
# Create instance of K-Medoids (PAM) algorithm.
kmedoids_instance = kmedoids(sample, initial_medoids)
# Run cluster analysis
kmedoids_instance.process()
# Display clustering results to console.
print("Clusters:", kmedoids_instance.get_clusters())
print("Labels:", kmedoids_instance.get_labels())
print("Medoids:", kmedoids_instance.get_medoids())
print("Total Deviation:", kmedoids_instance.get_total_deviation())
print("Iterations:", kmedoids_instance.get_iterations())
# Display clustering results.
visualizer = cluster_visualizer()
visualizer.append_clusters(kmedoids_instance.get_clusters(), sample)
visualizer.append_cluster(initial_medoids, sample, markersize=120, marker='*', color='dimgray')
visualizer.append_cluster(kmedoids_instance.get_medoids(), sample, markersize=120, marker='*', color='black')
visualizer.show() Output example:
|
@kno10 , I would be really pleased if you provide test code that you was using for performance testing. |
@kno10 , I decided to repeat your experiment using Results of the original PAM without distance matrix (list of points) still aren't good:
PAM BUILD implementation: https://github.com/annoviko/pyclustering/blob/master/ccore/src/cluster/pam_build.cpp I was using the following code to get data and run it on PAM. Is it similar to what you was doing? import time
from pyclustering.cluster.kmedoids import kmedoids, build
from sklearn.decomposition import TruncatedSVD
from sklearn.datasets import fetch_20newsgroups_vectorized
sample = fetch_20newsgroups_vectorized()
data = fetch_20newsgroups_vectorized()
X, y = data['data'], data['target']
X = TruncatedSVD().fit_transform(X)
t_start = time.process_time()
initial_medoids = build(X, 20).initialize()
pam_build_time = time.process_time() - t_start
t_start = time.process_time()
kmedoids_instance = kmedoids(X, initial_medoids)
kmedoids_instance.process()
pam_time = time.process_time() - t_start
print("Medoids:", kmedoids_instance.get_medoids())
print("Total Deviation:", kmedoids_instance.get_total_deviation())
print("Iterations:", kmedoids_instance.get_iterations())
print("PAM BUILD: %f sec." % pam_build_time)
print("PAM: %f sec." % pam_time) |
Please see https://colab.research.google.com/drive/1DNzGbQns5-kiyTVkDvAorcxqXZb5ukEI?usp=sharing |
@kno10 , thanks, your performance test is different and it results look much better. By default pyclustering always uses C++ code. I have run it my local machine and got the following results. Could you please take a look at them?
There is a big impact on the performance in data transfer procedure between Python and C++. I wanted to have absolutely independent C++ library from "python.h" and have to pay for this now. I have added additional time counter to the implementation to measure data packing time and results unpacking time on python side:
Thus, in case PAM BUILD: I think I will try to implement python dependent interface and use templates in C++ in order to optimize this flow and keep C++ static library independent from python. FastPAM is impressing me in its performance, I will implement it, when I finish classical PAM optimizations. |
The resulting loss looks good, but the other PAMs used 3 iterations, not 2. This could be an off by one in counting, or it could be due to tolerance > 0 (IMHO, tolerance=0 is a reasonable default for "hard" clustering approaches such as k-medoids and k-means, just not for "soft" Gaussian Mixture Modeling). |
The runtime on Google colab was:
So 43 minutes for BUILD + 92 min for SWAP. You may want to add a warning when the Python version is used on large data. |
@kno10 , thanks for these results, I think it is much better to optimize python code as well, the problem that I wasn't use numpy for processing in case of PAM BUILD and PAM itself, I think it doable. I will let you know when I update both implementations of the algorithm. About iterations, I am calculating it in the following way (pseudo-code):
So, I would say that I am counting the amount of |
The PAM/k-medoids implementation appears to implement SWAP, but not the BUILD part for initializing PAM. Instead you have to provide good starting medoids.
I tired benchmarking it on a larger data set (well-known 20news data set in the
sklearn.datasets.fetch_20newsgroups_vectorized
version; with cosine distance and k=20) and the runtime of pyclustering kmedoids was extremely high (2000 sec), likely because of the poor random initialization.With BUILD from the
kmedoids
package, I can reduce the run time to 338 sec. Nevertheless,pam
fromkmedoids
is just 37 seconds, including 4.6 seconds for BUILD. Thefasterpam
variant finishes in 336 msec with random initialization.It would also be good to get access to the final loss after optimization as well as the number of iterations. Then I could check if it ran into the maximum iteration limit, and if at least the result quality is comparable.
The text was updated successfully, but these errors were encountered: