Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all() #570

Closed
nabilEM opened this issue Dec 22, 2019 · 13 comments
Assignees
Labels
Bug Tasks related to found bugs Question Tasks that are questions from users

Comments

@nabilEM
Copy link

nabilEM commented Dec 22, 2019

Thank you for your library, it is very useful for me and the data mining community. I wanted to run birch algorithm but I had this error from the cftree.py: if (merged_entry.get_diameter() > self.__threshold): ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all().

Also when I want to use the parameter diameter when I instantiate the birch algorithm, I get this error: birch_instance = birch(x,3,diameter=0.1)
TypeError: init() got an unexpected keyword argument 'diameter'.

One last question, would it be possible to leave the parameter number_clusters optional to let the user use other clustering algorithms in the last step of birch instead of the hierarchical method?

@annoviko
Copy link
Owner

annoviko commented Dec 23, 2019

Hi, @nabilEM ,

Looks like you you have tried to use new API on old version of the library.

There is a new version 0.9.3 that consists of a lot of changes related to BIRCH, I would strongly recommend to use it. Here is documentation with example: https://pyclustering.github.io/docs/0.9.3/html/d6/d00/classpyclustering_1_1cluster_1_1birch_1_1birch.html

An example from the documentation:

from pyclustering.cluster.birch import birch
from pyclustering.cluster import cluster_visualizer
from pyclustering.utils import read_sample
from pyclustering.samples.definitions import FAMOUS_SAMPLES

# Sample for cluster analysis (represented by list)
sample = read_sample(FAMOUS_SAMPLES.SAMPLE_OLD_FAITHFUL)

# Create BIRCH algorithm
birch_instance = birch(sample, 2, diameter=3.0)

# Cluster analysis
birch_instance.process()

# Obtain results of clustering
clusters = birch_instance.get_clusters()

# Visualize allocated clusters
visualizer = cluster_visualizer()
visualizer.append_clusters(clusters, sample)
visualizer.show()

Regarding to:

One last question, would it be possible to leave the parameter number_clusters optional to let the user use other clustering algorithms in the last step of birch instead of the hierarchical method?

I will think about it. But in this case it seems to me to use CF-tree directly is much more logical. Because BIRCH stores data in CF-tree (with re-scale if it is required) at the first phase and then apply hierarchical algorithm. Also you can use get_cf_entries() method to get all CF-entries to cluster them by another algorithm. If you need an example how to apply another algorithm for CF-entries, I can provide it.

@annoviko annoviko self-assigned this Dec 23, 2019
@annoviko annoviko added the Question Tasks that are questions from users label Dec 23, 2019
@nabilEM
Copy link
Author

nabilEM commented Dec 23, 2019

Thank you for your response. I finally installed pyclustering version 0.9.3. But when I ran the Birch algorithm, I got this error: lib\site-packages\pyclustering\container\cftree.py", line 878, in insert
node = leaf_node(entry, None, [entry], None)
TypeError: init() takes 4 positional arguments but 5 were given

annoviko added a commit that referenced this issue Dec 23, 2019
@annoviko annoviko added the Bug Tasks related to found bugs label Dec 23, 2019
@nabilEM
Copy link
Author

nabilEM commented Dec 23, 2019

I installed the version 0.9.3.1 but I got the same error as in my first question of this issue: birch_instance.process()
File "C:\ProgramData\Anaconda3\envs\ha\lib\site-packages\pyclustering\cluster\birch.py", line 160, in process
self.__insert_data()
File "C:\ProgramData\Anaconda3\envs\ha\lib\site-packages\pyclustering\cluster\birch.py", line 279, in __insert_data
self.__tree.insert_point(point)
File "C:\ProgramData\Anaconda3\envs\ha\lib\site-packages\pyclustering\container\cftree.py", line 866, in insert_point
self.insert(entry)
File "C:\ProgramData\Anaconda3\envs\ha\lib\site-packages\pyclustering\container\cftree.py", line 888, in insert
child_node_updation = self.__recursive_insert(entry, self.__root)
File "C:\ProgramData\Anaconda3\envs\ha\lib\site-packages\pyclustering\container\cftree.py", line 938, in __recursive_insert
return self.__insert_for_leaf_node(entry, search_node)
File "C:\ProgramData\Anaconda3\envs\ha\lib\site-packages\pyclustering\container\cftree.py", line 960, in __insert_for_leaf_node
if merged_entry.get_diameter() > self.__threshold:
File "C:\ProgramData\Anaconda3\envs\ha\lib\site-packages\pyclustering\container\cftree.py", line 292, in get_diameter
if diameter_part < 0.000000001:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all().

It seems that diameter_part variable is of type <class 'numpy.ndarray'>. In my case it contains this value when I ran Birch: [-484852. -467572. -540116. -463808. -526004. -506580. -466084. -532588.
-514772. -541428. -541428. -537008. -527316. -509644. -488884. -463012.]

@annoviko
Copy link
Owner

@nabilEM , I have just uploaded hotfix to pypi, you can upgrade it, but it helps only for the second problem. About the first one, I have to see your code, to understand the problem - could you please show how do you use the algorithm? And what kind of data do you use?

@nabilEM
Copy link
Author

nabilEM commented Dec 23, 2019

Below my code. I used the pendigits data downloaded from https://archive.ics.uci.edu/ml/machine-learning-databases/pendigits/

from pyclustering.cluster.birch import birch
import numpy as np
import pandas as ps

def load_data():
    data1=ps.read_csv("datasets/pendigits.tes",sep=",",header=None)
    data2=ps.read_csv("datasets/pendigits.tra",sep=",",header=None)
    data=ps.concat([data1,data2])
    #print(data)
    labels = data.iloc[:,-1]
    data.drop(data.columns[len(data.columns)-1], axis=1, inplace=True)
    x=np.array(data)
    y=np.array(labels)
    return x,y
x,y=load_data()
# Create BIRCH algorithm
birch_instance = birch(x,3,diameter=0.1)
# Cluster analysis
birch_instance.process()

# Obtain results of clustering
clusters = birch_instance.get_clusters()

# Obtain information how does the 'Lsun' sample is encoded in the CF-tree.
cf_entries = birch_instance.get_cf_entries()
cf_clusters = birch_instance.get_cf_cluster()

cf_centroids = [entry.get_centroid() for entry in cf_entries]

# Visualize allocated clusters
visualizer = cluster_visualizer(2, 2, titles=["Encoded data by CF-entries", "Data clusters"])
visualizer.append_clusters(cf_clusters, cf_centroids, canvas=0)
visualizer.append_clusters(clusters, sample, canvas=1)
visualizer.show()

@annoviko
Copy link
Owner

annoviko commented Dec 23, 2019

@nabilEM , you have to convert x to list, numpy.array is not supported for BIRCH.

[in] data (list): An input data represented as a list of points (objects) where each point is be represented by list of coordinates.

@nabilEM
Copy link
Author

nabilEM commented Dec 23, 2019

A very big thank you, it worked. Thank you again for your wonderful library!

@nabilEM
Copy link
Author

nabilEM commented Dec 23, 2019

It would be interesting if you could add the point indices contained in each entry of the leaf nodes. This will allow users to directly manipulate the micro-clusters in addition to the aggregated calculations such as for example the linear sum.

@annoviko
Copy link
Owner

annoviko commented Dec 24, 2019

@nabilEM ,

There was such feature, but it was useless, you shouldn't rely on these indexes, because clustering results would be wrong. It is much better to calculate distance to CF-entries and to choose shortest (apply K-Means, X-Means or G-Means). This is the reason, why BIRCH performs cluster analysis at the end.

@annoviko
Copy link
Owner

@nabilEM ,

But if you need it, I can provide you a patch with these changes.

@nabilEM
Copy link
Author

nabilEM commented Dec 24, 2019

@annoviko Thanks for your help. It would be interesting for me to know the reason that distorts the clustering resulting from the use of point indexes contained in the entries instead of the aggregated calculations of the entries (LS,SS). Perhaps the fact of not taking into account the points individually will lead to not correctly identifying the outliers.

@annoviko
Copy link
Owner

annoviko commented Jan 8, 2020

Hi, @nabilEM ,

If it is still relevant, here is a patch for '0.9.3.rel' branch with opportunity to get indexes from CF-entries:

birch_instance.process()
cf_entries = birch_instance.get_cf_entries()

for entry in cf_entries:
    print(entry.indexes)

cf_tree_index_patch_for_private_usage.zip

@annoviko annoviko closed this as completed Jan 9, 2020
@nabilEM
Copy link
Author

nabilEM commented Jan 20, 2020

Hi, @nabilEM ,

If it is still relevant, here is a patch for '0.9.3.rel' branch with opportunity to get indexes from CF-entries:

birch_instance.process()
cf_entries = birch_instance.get_cf_entries()

for entry in cf_entries:
    print(entry.indexes)

cf_tree_index_patch_for_private_usage.zip

Thank you @annoviko !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Tasks related to found bugs Question Tasks that are questions from users
Projects
None yet
Development

No branches or pull requests

2 participants