Better PAM initialization with BUILD #667

kno10 · 2021-02-05T23:10:13Z

The PAM/k-medoids implementation appears to implement SWAP, but not the BUILD part for initializing PAM. Instead you have to provide good starting medoids.

I tired benchmarking it on a larger data set (well-known 20news data set in the sklearn.datasets.fetch_20newsgroups_vectorized version; with cosine distance and k=20) and the runtime of pyclustering kmedoids was extremely high (2000 sec), likely because of the poor random initialization.

With BUILD from the kmedoids package, I can reduce the run time to 338 sec. Nevertheless, pam from kmedoids is just 37 seconds, including 4.6 seconds for BUILD. The fasterpam variant finishes in 336 msec with random initialization.

It would also be good to get access to the final loss after optimization as well as the number of iterations. Then I could check if it ran into the maximum iteration limit, and if at least the result quality is comparable.

The text was updated successfully, but these errors were encountered:

annoviko · 2021-02-06T13:02:24Z

Hello @kno10 ,

There is a kmeans++ based algorithm to initialize it on which you are referring in your article:

In the experiments, we will also study whether a clever sampling-based approach similar to k-means++ (Arthur and Vassilvitskii, 2007) that needs only O(nk) time but produces a worse starting point is an interesting alternative.

It could be used to initialize initial medoids and I expected that only quality is going to be affected. Nevertheless your outcome looks like a good point to implement PAM BUILD algorithm for initialization. I am not sure if it is convenient to make a part of the interface of PAM algorithm.

Also I have a question for you, did you use installed pyclustering version or just cloned version? Because in case of installed version, C++ implementation of the algorithm should be used and in case of cloned version, the library uses Python implementation due to lack of binaries in git repository.

It would also be good to get access to the final loss after optimization as well as the number of iterations

It is clear about the number of iterations. But I have question about the final loss after optimization. Do you mean total deviation (TD)?

kno10 · 2021-02-06T14:30:52Z

I tried using your k-means++ initialization, but I believe it only accepts a data matrix, not a distance matrix, as input.

For that experiment I used "pip install pyclustering" on Google colab, whatever that does by default.

Yes, TD (total deviation) is the loss.

annoviko · 2021-02-06T14:35:27Z

Thank you for the clarification. One more question:

I tried using your k-means++ initialization, but I believe it only accepts a data matrix, not a distance matrix, as input.

Was a distance matrix used as an input for the algorithm in the experiment?

…oduced.

annoviko · 2021-02-12T19:17:41Z

Hello @kno10 ,

I have introduced PAM BUILD algorithm in line with your article and I have optimized C++ version (that should be used by default) of K-Medoids (PAM) algorithm. I have attached pyclustering package (pyclustering-0.11.0.tar.gz) with all binaries that was created from master branch (not a release version) where these changes are available. Could you please check it again?

Just in case K-Means++ supports distance matrix now. Let me know if you need code example as well.

There is an example how to use K-Medoids with PAM BUILD:

from pyclustering.cluster.kmedoids import kmedoids, build
from pyclustering.cluster import cluster_visualizer
from pyclustering.utils import read_sample
from pyclustering.samples.definitions import FCPS_SAMPLES

# Load list of points `Tetra` for cluster analysis.
sample = read_sample(FCPS_SAMPLES.SAMPLE_TETRA)

# Initialize initial medoids using PAM BUILD algorithm
initial_medoids = build(sample, 4).initialize()

# Create instance of K-Medoids (PAM) algorithm.
kmedoids_instance = kmedoids(sample, initial_medoids)

# Run cluster analysis
kmedoids_instance.process()

# Display clustering results to console.
print("Clusters:", kmedoids_instance.get_clusters())
print("Labels:", kmedoids_instance.get_labels())
print("Medoids:", kmedoids_instance.get_medoids())
print("Total Deviation:", kmedoids_instance.get_total_deviation())
print("Iterations:", kmedoids_instance.get_iterations())

# Display clustering results.
visualizer = cluster_visualizer()
visualizer.append_clusters(kmedoids_instance.get_clusters(), sample)
visualizer.append_cluster(initial_medoids, sample, markersize=120, marker='*', color='dimgray')
visualizer.append_cluster(kmedoids_instance.get_medoids(), sample, markersize=120, marker='*', color='black')
visualizer.show()

Output example:

Clusters: [[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99], [100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199], [200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285, 286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298, 299], [300, 301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311, 312, 313, 314, 315, 316, 317, 318, 319, 320, 321, 322, 323, 324, 325, 326, 327, 328, 329, 330, 331, 332, 333, 334, 335, 336, 337, 338, 339, 340, 341, 342, 343, 344, 345, 346, 347, 348, 349, 350, 351, 352, 353, 354, 355, 356, 357, 358, 359, 360, 361, 362, 363, 364, 365, 366, 367, 368, 369, 370, 371, 372, 373, 374, 375, 376, 377, 378, 379, 380, 381, 382, 383, 384, 385, 386, 387, 388, 389, 390, 391, 392, 393, 394, 395, 396, 397, 398, 399]]
Labels: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]
Medoids: [0, 100, 201, 300]
Total Deviation: 239.25250279395595
Iterations: 3

annoviko · 2021-02-12T19:52:13Z

@kno10 , I would be really pleased if you provide test code that you was using for performance testing.

annoviko · 2021-02-12T21:21:57Z

@kno10 , I decided to repeat your experiment using sklearn.datasets.fetch_20newsgroups_vectorized.

Results of the original PAM without distance matrix (list of points) still aren't good:

Medoids: [6953, 10558, 489, 1034, 6910, 3522, 10246, 2692, 1965, 9385, 6535, 7433, 6540, 2760, 459, 10860, 6233, 4006, 484, 11189]
Total Deviation: 29.767144356027746
Iterations: 102
PAM BUILD: 36.062500 sec.
PAM: 2187.406250 sec.

PAM BUILD implementation: https://github.com/annoviko/pyclustering/blob/master/ccore/src/cluster/pam_build.cpp

I was using the following code to get data and run it on PAM. Is it similar to what you was doing?

import time
from pyclustering.cluster.kmedoids import kmedoids, build
from sklearn.decomposition import TruncatedSVD
from sklearn.datasets import fetch_20newsgroups_vectorized

sample = fetch_20newsgroups_vectorized()
data = fetch_20newsgroups_vectorized()
X, y = data['data'], data['target']
X = TruncatedSVD().fit_transform(X)

t_start = time.process_time()
initial_medoids = build(X, 20).initialize()
pam_build_time = time.process_time() - t_start

t_start = time.process_time()
kmedoids_instance = kmedoids(X, initial_medoids)
kmedoids_instance.process()
pam_time = time.process_time() - t_start

print("Medoids:", kmedoids_instance.get_medoids())
print("Total Deviation:", kmedoids_instance.get_total_deviation())
print("Iterations:", kmedoids_instance.get_iterations())

print("PAM BUILD: %f sec." % pam_build_time)
print("PAM: %f sec." % pam_time)

kno10 · 2021-02-12T23:32:54Z

Please see https://colab.research.google.com/drive/1DNzGbQns5-kiyTVkDvAorcxqXZb5ukEI?usp=sharing
It still appears to take a very long time. It does not appear to use ccore?

annoviko · 2021-02-13T15:10:13Z

@kno10 , thanks, your performance test is different and it results look much better. By default pyclustering always uses C++ code. I have run it my local machine and got the following results. Could you please take a look at them?

(11314, 130107)
(11314, 11314) took 7.680890083312988 s
['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_build__amount', '_build__calculate_first_medoid', '_build__calculate_next_medoid', '_build__ccore', '_build__create_distance_calculator', '_build__data', '_build__data_type', '_build__distance_calculator', '_build__initialize_by_ccore', '_build__initialize_by_python', '_build__metric', '_build__verify_arguments', 'initialize']
Build took: {} 37.16072940826416
<pyclustering.cluster.kmedoids.kmedoids object at 0x000001B85C41F780> ['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_kmedoids__calculate_swap_cost', '_kmedoids__ccore', '_kmedoids__clusters', '_kmedoids__create_distance_calculator', '_kmedoids__data_type', '_kmedoids__distance_calculator', '_kmedoids__distance_first_medoid', '_kmedoids__distance_second_medoid', '_kmedoids__erase_empty_clusters', '_kmedoids__initializer', '_kmedoids__iterations', '_kmedoids__itermax', '_kmedoids__labels', '_kmedoids__medoid_indexes', '_kmedoids__metric', '_kmedoids__pointer_data', '_kmedoids__swap_medoids', '_kmedoids__tolerance', '_kmedoids__total_deviation', '_kmedoids__update_clusters', '_kmedoids__verify_arguments', 'get_cluster_encoding', 'get_clusters', 'get_iterations', 'get_labels', 'get_medoids', 'get_total_deviation', 'predict', 'process']
Runtime: min=90973.16 mean=90973.16 ±nan
D:\programs\miniconda3\envs\pyclustering-environment\lib\site-packages\numpy\core\_methods.py:140: RuntimeWarning: Degrees of freedom <= 0 for slice
  keepdims=keepdims)
D:\programs\miniconda3\envs\pyclustering-environment\lib\site-packages\numpy\core\_methods.py:130: RuntimeWarning: invalid value encountered in true_divide
  ret, rcount, out=ret, casting='unsafe', subok=False)
Loss:    min= 4994.75 mean= 4994.75 ±nan
N Iter:  min=    2.00 mean=    2.00 ±nan

There is a big impact on the performance in data transfer procedure between Python and C++. I wanted to have absolutely independent C++ library from "python.h" and have to pay for this now. I have added additional time counter to the implementation to measure data packing time and results unpacking time on python side:

Data packing time for C++ on python side: 21.093750 sec.
Results unpacking from C++ on python side: 0.000000 sec.
Build took: {} 31.153797149658203

Thus, in case PAM BUILD: 21 seconds is data packing and 10 seconds is processing on C++ side. The same is for PAM algorithm itself, so 42 seconds are used for data transfer only. Thus, in general I would expect 90 - 42 = 48 seconds for processing.

I think I will try to implement python dependent interface and use templates in C++ in order to optimize this flow and keep C++ static library independent from python.

FastPAM is impressing me in its performance, I will implement it, when I finish classical PAM optimizations.

kno10 · 2021-02-13T15:34:53Z

The resulting loss looks good, but the other PAMs used 3 iterations, not 2. This could be an off by one in counting, or it could be due to tolerance > 0 (IMHO, tolerance=0 is a reasonable default for "hard" clustering approaches such as k-medoids and k-means, just not for "soft" Gaussian Mixture Modeling).
When I try to run the notebook on google colab, it still takes forever and when I interrupt it, it appears to run in python, not in ccore based on the trace (in __calculate_next_medoid(self)). But I am not an expert on Python, so I don't know how to resolve this; I can only mention that it does not appear to use ccore.

kno10 · 2021-02-13T18:29:15Z

The runtime on Google colab was:

Build took: 2596859.432220459
Runtime: min=8129146.14 mean=8129146.14 ±nan
Loss:    min= 4994.75 mean= 4994.75 ±nan
N Iter:  min=    2.00 mean=    2.00 ±nan

So 43 minutes for BUILD + 92 min for SWAP. You may want to add a warning when the Python version is used on large data.

annoviko · 2021-02-14T17:50:12Z

@kno10 , thanks for these results, I think it is much better to optimize python code as well, the problem that I wasn't use numpy for processing in case of PAM BUILD and PAM itself, I think it doable. I will let you know when I update both implementations of the algorithm.

About iterations, I am calculating it in the following way (pseudo-code):

if (itermax > 0) then update_clusters();

for (iterations = 0; iterations < itermax;) then
	iterations++;
	cost = swap_medoids();
	if (cost != NOTHING_TO_SWAP) then
		update_clusters();

So, I would say that I am counting the amount of swap_medoids().

annoviko added Optimization Tasks related to code optimization Proposal Tasks that have been proposed by users labels Feb 6, 2021

annoviko added the Investigation Tasks related to investigation of found issues label Feb 6, 2021

annoviko self-assigned this Feb 6, 2021

annoviko added a commit that referenced this issue Feb 6, 2021

#667: Parallel optimization of PAM algorithm.

7e43f48

annoviko added a commit that referenced this issue Feb 8, 2021

#667: PAM BUILD algorithm implementation (Python version).

feb5ce7

annoviko added a commit that referenced this issue Feb 9, 2021

#667: PAM BUILD algorithm implementation (C++ version).

c7e56a6

annoviko added a commit that referenced this issue Feb 10, 2021

#667: Warning correction.

0052ad0

annoviko added a commit that referenced this issue Feb 12, 2021

#667: 'get_iterations()' and 'get_total_deviation()' methods are intr…

3806e74

…oduced.

annoviko added a commit that referenced this issue Feb 12, 2021

#667: Correct GCC warnings.

91e37f6

annoviko added a commit that referenced this issue Feb 12, 2021

#667: GCC warnings in UT. Update documentation.

1ad1b57

annoviko added a commit that referenced this issue Feb 12, 2021

#667: Additional C++ optimizations for PAM BUILD and PAM.

f573361

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better PAM initialization with BUILD #667

Better PAM initialization with BUILD #667

kno10 commented Feb 5, 2021 •

edited

Loading

annoviko commented Feb 6, 2021

kno10 commented Feb 6, 2021

annoviko commented Feb 6, 2021

annoviko commented Feb 12, 2021 •

edited

Loading

annoviko commented Feb 12, 2021

annoviko commented Feb 12, 2021 •

edited

Loading

kno10 commented Feb 12, 2021

annoviko commented Feb 13, 2021 •

edited

Loading

kno10 commented Feb 13, 2021

kno10 commented Feb 13, 2021

annoviko commented Feb 14, 2021

Better PAM initialization with BUILD #667

Better PAM initialization with BUILD #667

Comments

kno10 commented Feb 5, 2021 • edited Loading

annoviko commented Feb 6, 2021

kno10 commented Feb 6, 2021

annoviko commented Feb 6, 2021

annoviko commented Feb 12, 2021 • edited Loading

annoviko commented Feb 12, 2021

annoviko commented Feb 12, 2021 • edited Loading

kno10 commented Feb 12, 2021

annoviko commented Feb 13, 2021 • edited Loading

kno10 commented Feb 13, 2021

kno10 commented Feb 13, 2021

annoviko commented Feb 14, 2021

kno10 commented Feb 5, 2021 •

edited

Loading

annoviko commented Feb 12, 2021 •

edited

Loading

annoviko commented Feb 12, 2021 •

edited

Loading

annoviko commented Feb 13, 2021 •

edited

Loading