kmedoids returns empty cluster lists for version 0.10.1 #659

laurenleesc · 2020-11-24T15:29:23Z

Hi,

Previously, code working on one server with version 0.9.3.1 worked as expected. However, the same code run on a different server with version 0.10.1 returned some empty clusters for the same dataset and initial medoids.

initial_medoids=[0,1,2,3]
kmedoids_instance=kmedoids(df2,initial_medoids,metric=metric)
kmedoids_instance.process()
clusters=kmedoids_instance.get_clusters()
medoids=kmedoids_instance.get_medoids()
print(clusters)

The above would return indices for clusters 0 and 1 but empty lists for clusters 2 and 3, despite there not being any missing in my data df2. I would expect at the very least, the medoids themselves to be in clusters 2 and 3.

Thank you, this is a great package, I really appreciate it.

Lauren

annoviko · 2020-11-24T16:26:37Z

Hello @laurenleesc ,

There were changes regarded to K-Medoids, it was aligned with the following paper: Erich Schubert and Peter J. Rousseeuw. Faster k-medoids clustering: Improving the pam, clara, and clarans algorithms. In Similarity Search and Applications, pages 171–187, Cham, 2019. Springer International Publishing..

I have reviewed the code and found the problem in C++ implementation (that is used by default). I will release 0.10.1.1 as soon as possible with the hotfix. Thanks you for the reporting, I am really appreciate that.

As a workaround, you can use option ccore=False, it will force the library to use Python implementation. But be aware, that it is going to slower than C++ implementation.

annoviko · 2020-11-24T20:07:27Z

Hi @laurenleesc ,

I have just released the library (version 0.10.1.1) with the correction for the problem. Could you please try this one?
Link on PyPi: https://pypi.org/project/pyclustering/

laurenleesc · 2020-11-24T22:15:32Z

@annoviko,

Unfortunately, it is still an issue. I have attached the code and the simulated dataset I used...

Thank you very much!

Lauren
troubleshoot.zip

annoviko · 2020-11-25T10:00:43Z

@laurenleesc ,

Thank for the collaboration! I will investigate the issue on your dataset.

…ids.

annoviko · 2020-11-25T22:54:44Z

Hello @laurenleesc ,

I have corrected the issue. The correction is available in 0.10.1.2 (pip3 install pyclustering). Now pyclustering will eliminate empty clusters if nothing was 'captured' by them in line with greedy strategy that is described in the paper. Previously it was keeping medoids that were represented by points that are totally identical, but it contradicts to the paper a bit.

In case of your code the behavior is going to the following:

import pandas as pd
import numpy as np

from pyclustering.cluster.kmedoids import kmedoids
import warnings
warnings.filterwarnings("ignore")

import nltk

def split(word):
    return [char for char in word]

def dist_matrix(data):
    for k in range(0,len(data)):
        #print(split(data[k]))
        for m in range(0,len(data)):
            Matrix[m,k]=nltk.jaccard_distance(set(split(data[m])),set(split(data[k])))

klist=[2,3,4,5,6,7,8,9,10,11,12,13,14,15]

df = pd.read_csv('df_train_sim13_exp0.csv')
df['true_index'] = df.index
df2 = df['String'].values
l = (len(df), len(df))
Matrix = np.zeros(l, dtype=np.float)
dist_matrix(df2)
# print(Matrix)
np.save('jaccard_exp0_simulated_set13.npy', Matrix)

data = np.load('jaccard_exp0_simulated_set13.npy')

for k in klist:
        initial_medoids=list(range(0,k))

        kmedoids_instance=kmedoids(data,initial_medoids,data_type='distance_matrix')
        kmedoids_instance.process()
        clusters=kmedoids_instance.get_clusters()
        medoids=kmedoids_instance.get_medoids()
        print("- Data length: %d" % len(data))
        print("- Amount clusters: %d" % len(clusters))
        print(clusters)

Output is the following:

- Data length: 912
- Amount clusters: 2
[[0, 1, 3, 4, 5, 6, 7, 8, 9, 10, ...
- Data length: 912
- Amount clusters: 3
[[0, 2, 4, 5, 6, 7, 8, 9, 10, ...
- Data length: 912
- Amount clusters: 4
[[0, 4, 5, 6, 7, 8, 9, 10, ...
- Data length: 912
- Amount clusters: 4
[[0, 4, 5, 6, 7, 8, 9, 10, ...
- Data length: 912
- Amount clusters: 4
[[0, 4, 5, 6, 7, 8, 9, 10, ...
- Data length: 912
- Amount clusters: 4
[[0, 4, 5, 6, 7, 8, 9, 10, ...
- Data length: 912
- Amount clusters: 4
[[0, 4, 5, 6, 7, 8, 9, 10, ...
- Data length: 912
- Amount clusters: 4
[[0, 4, 5, 6, 7, 8, 9, 10, ...
- Data length: 912
- Amount clusters: 4
[[0, 4, 5, 6, 7, 8, 9, 10, ...

laurenleesc · 2020-11-26T14:26:42Z

Thanks! This works great.

I've attempted to read your documentation on the k-medoids but would you mind mentioning the paper you're referencing?

annoviko · 2020-11-26T14:48:06Z

@laurenleesc , all algorithms are followed by references to corresponding papers. You have to check namespace (probably I should duplicate it for classes as well). There is an example from the current documentation:

Make sure that you are reading the latest documentation: https://pyclustering.github.io/docs/0.10.1/html/

annoviko self-assigned this Nov 24, 2020

annoviko added the Bug Tasks related to found bugs label Nov 24, 2020

annoviko added a commit that referenced this issue Nov 24, 2020

#659: [Hotfix] kmedoids returns empty cluster lists for version 0.10.1.

a063ca9

annoviko added a commit that referenced this issue Nov 24, 2020

#659: [Hotfix] Meta-data for 0.10.1.1 release with the hotfix.

384baeb

annoviko added a commit that referenced this issue Nov 25, 2020

#659: [Hotfix] Empty clusters and their medoids are erased for K-Medo…

19a1d6c

…ids.

annoviko added a commit that referenced this issue Nov 25, 2020

#659: Unit-test 'utest_kmedoids.totally_similar_data' correction.

5b80a67

laurenleesc closed this as completed Nov 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kmedoids returns empty cluster lists for version 0.10.1 #659

kmedoids returns empty cluster lists for version 0.10.1 #659

laurenleesc commented Nov 24, 2020

annoviko commented Nov 24, 2020 •

edited

Loading

annoviko commented Nov 24, 2020 •

edited

Loading

laurenleesc commented Nov 24, 2020

annoviko commented Nov 25, 2020

annoviko commented Nov 25, 2020 •

edited

Loading

laurenleesc commented Nov 26, 2020

annoviko commented Nov 26, 2020 •

edited

Loading

kmedoids returns empty cluster lists for version 0.10.1 #659

kmedoids returns empty cluster lists for version 0.10.1 #659

Comments

laurenleesc commented Nov 24, 2020

annoviko commented Nov 24, 2020 • edited Loading

annoviko commented Nov 24, 2020 • edited Loading

laurenleesc commented Nov 24, 2020

annoviko commented Nov 25, 2020

annoviko commented Nov 25, 2020 • edited Loading

laurenleesc commented Nov 26, 2020

annoviko commented Nov 26, 2020 • edited Loading

annoviko commented Nov 24, 2020 •

edited

Loading

annoviko commented Nov 24, 2020 •

edited

Loading

annoviko commented Nov 25, 2020 •

edited

Loading

annoviko commented Nov 26, 2020 •

edited

Loading