Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kmedoids returns empty cluster lists for version 0.10.1 #659

Closed
laurenleesc opened this issue Nov 24, 2020 · 7 comments
Closed

kmedoids returns empty cluster lists for version 0.10.1 #659

laurenleesc opened this issue Nov 24, 2020 · 7 comments
Assignees
Labels
Bug Tasks related to found bugs

Comments

@laurenleesc
Copy link

Hi,

Previously, code working on one server with version 0.9.3.1 worked as expected. However, the same code run on a different server with version 0.10.1 returned some empty clusters for the same dataset and initial medoids.

initial_medoids=[0,1,2,3]
kmedoids_instance=kmedoids(df2,initial_medoids,metric=metric)
kmedoids_instance.process()
clusters=kmedoids_instance.get_clusters()
medoids=kmedoids_instance.get_medoids()
print(clusters)

The above would return indices for clusters 0 and 1 but empty lists for clusters 2 and 3, despite there not being any missing in my data df2. I would expect at the very least, the medoids themselves to be in clusters 2 and 3.

Thank you, this is a great package, I really appreciate it.

Lauren

@annoviko annoviko self-assigned this Nov 24, 2020
@annoviko annoviko added the Bug Tasks related to found bugs label Nov 24, 2020
@annoviko
Copy link
Owner

annoviko commented Nov 24, 2020

Hello @laurenleesc ,

There were changes regarded to K-Medoids, it was aligned with the following paper: Erich Schubert and Peter J. Rousseeuw. Faster k-medoids clustering: Improving the pam, clara, and clarans algorithms. In Similarity Search and Applications, pages 171–187, Cham, 2019. Springer International Publishing..

I have reviewed the code and found the problem in C++ implementation (that is used by default). I will release 0.10.1.1 as soon as possible with the hotfix. Thanks you for the reporting, I am really appreciate that.

As a workaround, you can use option ccore=False, it will force the library to use Python implementation. But be aware, that it is going to slower than C++ implementation.

@annoviko
Copy link
Owner

annoviko commented Nov 24, 2020

Hi @laurenleesc ,

I have just released the library (version 0.10.1.1) with the correction for the problem. Could you please try this one?
Link on PyPi: https://pypi.org/project/pyclustering/

@laurenleesc
Copy link
Author

@annoviko,

Unfortunately, it is still an issue. I have attached the code and the simulated dataset I used...

Thank you very much!

Lauren
troubleshoot.zip

@annoviko
Copy link
Owner

@laurenleesc ,

Thank for the collaboration! I will investigate the issue on your dataset.

@annoviko
Copy link
Owner

annoviko commented Nov 25, 2020

Hello @laurenleesc ,

I have corrected the issue. The correction is available in 0.10.1.2 (pip3 install pyclustering). Now pyclustering will eliminate empty clusters if nothing was 'captured' by them in line with greedy strategy that is described in the paper. Previously it was keeping medoids that were represented by points that are totally identical, but it contradicts to the paper a bit.

In case of your code the behavior is going to the following:

import pandas as pd
import numpy as np

from pyclustering.cluster.kmedoids import kmedoids
import warnings
warnings.filterwarnings("ignore")

import nltk

def split(word):
    return [char for char in word]

def dist_matrix(data):
    for k in range(0,len(data)):
        #print(split(data[k]))
        for m in range(0,len(data)):
            Matrix[m,k]=nltk.jaccard_distance(set(split(data[m])),set(split(data[k])))

klist=[2,3,4,5,6,7,8,9,10,11,12,13,14,15]

df = pd.read_csv('df_train_sim13_exp0.csv')
df['true_index'] = df.index
df2 = df['String'].values
l = (len(df), len(df))
Matrix = np.zeros(l, dtype=np.float)
dist_matrix(df2)
# print(Matrix)
np.save('jaccard_exp0_simulated_set13.npy', Matrix)

data = np.load('jaccard_exp0_simulated_set13.npy')

for k in klist:
        initial_medoids=list(range(0,k))

        kmedoids_instance=kmedoids(data,initial_medoids,data_type='distance_matrix')
        kmedoids_instance.process()
        clusters=kmedoids_instance.get_clusters()
        medoids=kmedoids_instance.get_medoids()
        print("- Data length: %d" % len(data))
        print("- Amount clusters: %d" % len(clusters))
        print(clusters)

Output is the following:

- Data length: 912
- Amount clusters: 2
[[0, 1, 3, 4, 5, 6, 7, 8, 9, 10, ...
- Data length: 912
- Amount clusters: 3
[[0, 2, 4, 5, 6, 7, 8, 9, 10, ...
- Data length: 912
- Amount clusters: 4
[[0, 4, 5, 6, 7, 8, 9, 10, ...
- Data length: 912
- Amount clusters: 4
[[0, 4, 5, 6, 7, 8, 9, 10, ...
- Data length: 912
- Amount clusters: 4
[[0, 4, 5, 6, 7, 8, 9, 10, ...
- Data length: 912
- Amount clusters: 4
[[0, 4, 5, 6, 7, 8, 9, 10, ...
- Data length: 912
- Amount clusters: 4
[[0, 4, 5, 6, 7, 8, 9, 10, ...
- Data length: 912
- Amount clusters: 4
[[0, 4, 5, 6, 7, 8, 9, 10, ...
- Data length: 912
- Amount clusters: 4
[[0, 4, 5, 6, 7, 8, 9, 10, ...

@laurenleesc
Copy link
Author

Thanks! This works great.

I've attempted to read your documentation on the k-medoids but would you mind mentioning the paper you're referencing?

@annoviko
Copy link
Owner

annoviko commented Nov 26, 2020

@laurenleesc , all algorithms are followed by references to corresponding papers. You have to check namespace (probably I should duplicate it for classes as well). There is an example from the current documentation:

Make sure that you are reading the latest documentation: https://pyclustering.github.io/docs/0.10.1/html/

image
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Tasks related to found bugs
Projects
None yet
Development

No branches or pull requests

2 participants