Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent attribution of individuals to clusters #120

Open
aloboa opened this issue Jul 29, 2024 · 5 comments
Open

Inconsistent attribution of individuals to clusters #120

aloboa opened this issue Jul 29, 2024 · 5 comments
Labels

Comments

@aloboa
Copy link

aloboa commented Jul 29, 2024

Given

hc <- hclust(dist(USArrests[c(1, 6, 13, 20, 23), ]), "ave")
dend <- as.dendrogram(hc)
plot(dend)

I think the following difference should be considered as a bug:

a <- cutree(dend, h=50)
b <- cutree(dend, h=50, order_clusters_as_data = FALSE)
table(a)
a
1 2 3 
3 1 1 

table(b)
b
1 2 3 
1 1 3 

One thing is changing the order of the labels in the vector, and another one is changing the cluster to which a given element has
been classified. In this example, Minnesota should be in the same cluster in both cases, and the number of individuals within each cluster should be the same.

@aloboa aloboa added the bug label Jul 29, 2024
@talgalili
Copy link
Owner

talgalili commented Jul 29, 2024 via email

@aloboa
Copy link
Author

aloboa commented Jul 30, 2024

If you do not fix this issue, please clarify asap the documentation of your dendextend::cuttre()
It should be:
order_clusters_as_data
logical, defaults to TRUE. There are two ways by which to name and order the clusters: 1) By the order of the original data. 2) by the order of the labels in the dendrogram.
TRUE: clusters are named and ordered according to their sequence in the data.
FALSE: clusters are named and ordered according to their sequence in the dendrogram.

If you fix the issue, you probably want to create a new function named cutdend(dend), where dend
should be a dendrogram (eg. dend <- as.dendrogram(hc)) to avoid confusion with base R.

The documentation would be:
order_clusters_as_data
logical, defaults to TRUE. Clusters are always named according to their sequence in the dendrogram. There are two ways by which to order the clusters: 1) By the order of the original data. 2) by the order of the labels in the dendrogram.
TRUE: clusters are ordered according to their sequence in the data.
FALSE: clusters are ordered according to their sequence in the dendrogram.

In case
b <- cutdend(dend,order_clusters_as_data = TRUE)
b would be the same as
a <- cutdend(dend,order_clusters_as_data = FALSE)
except that a would be reordered.

The fix is very simple, just look at this example:

d <- USArrests[c(1, 6, 13, 20, 23), ]
d
          Murder Assault UrbanPop Rape
Alabama     13.2     236       58 21.2
Colorado     7.9     204       78 38.7
Illinois    10.4     249       83 24.0
Maryland    11.3     300       67 27.8
Minnesota    2.7      72       66 14.9

hc <- hclust(dist(d), "ave")
dend <- as.dendrogram(hc)
a  <-  cutree(dend,h=50,order_clusters_as_data = FALSE)
y <- row.names(d)
x <- names(a)
y
[1] "Alabama"   "Colorado"  "Illinois"  "Maryland"  "Minnesota"

a
Minnesota  Maryland  Colorado   Alabama  Illinois 
        1         2         3         3         3 

a[order(match(x,y))]
  Alabama  Colorado  Illinois  Maryland Minnesota 
        3         3         3         2         1 

@jefferis
Copy link
Collaborator

@aloboa sorry to see that things did not behave as you expected, but I am slightly confused by your opening description of this issue.

One thing is changing the order of the labels in the vector, and another one is changing the cluster to which a given element has been classified. In this example, Minnesota should be in the same cluster in both cases, and the number of individuals within each cluster should be the same.

the whole point of the option (order_clusters_as_data = FALSE) is to assign the clusters (labelled 1 ... k) as they appear in the dendrogram. This means that the numeric labels of the clusters must be different and therefore that the membership of the clusters identified by a given label in 1...k will be different.

Now I agree that for some purposes you might wish to return the integer cluster membership vector for each individual observation ordered by the input data rather than by the dendrogram. But that is a choice and because this is doing something different to base R I don't think you can say that one behaviour or another is a bug. I suppose one could add yet another argument asking to return the cluster membership in data order (e.g. order_return_as_data, order_membership_as_data or similar)

In other words Minnesota should be in a different cluster in the two cases. But you could discuss the ordering of the return vector.

@jefferis
Copy link
Collaborator

Also although I understand the intent behind your suggestion to change the docs:

clusters are named and ordered according to their sequence in the data.

I don't think it works because for the clusters naming/ordering are the same thing. What you want is to change the sort order of the returned cluster membership vector for observations.

If you want some ideas about proposing changes to the docs to avoid surprise then maybe take a look at dendroextras::slice which does the same thing as cutree(order_clusters_as_data = FALSE). Note also the example

slice(hc,k=5)[order(hc$order)] 

which give the output you want. The group membership vector is not just an ascending or descending set of cluster ids as in your smaller examples. This may help to highlight that observations!=clusters.

As a side note, I have to say that anyone I have ever tried to teach clustering methods to finds it very strange that clusters in base R are not assigned in the order they appear in the dendrogram.

@talgalili
Copy link
Owner

talgalili commented Jul 30, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants