Inconsistent attribution of individuals to clusters #120

aloboa · 2024-07-29T09:35:34Z

Given

hc <- hclust(dist(USArrests[c(1, 6, 13, 20, 23), ]), "ave")
dend <- as.dendrogram(hc)
plot(dend)

I think the following difference should be considered as a bug:

a <- cutree(dend, h=50)
b <- cutree(dend, h=50, order_clusters_as_data = FALSE)
table(a)
a
1 2 3 
3 1 1 

table(b)
b
1 2 3 
1 1 3

One thing is changing the order of the labels in the vector, and another one is changing the cluster to which a given element has
been classified. In this example, Minnesota should be in the same cluster in both cases, and the number of individuals within each cluster should be the same.

The text was updated successfully, but these errors were encountered:

talgalili · 2024-07-29T19:38:57Z

Thanks. I'm not likely to address this in the near future. But if you propose a fix, I'd be happy to review it. Thanks.

…

On Mon, Jul 29, 2024 at 12:35 PM aloboa ***@***.***> wrote: Given hc <- hclust(dist(USArrests[c(1, 6, 13, 20, 23), ]), "ave") dend <- as.dendrogram(hc) plot(dend) I think the following difference should be considered as a bug: a <- cutree(dend, h=50) b <- cutree(dend, h=50, order_clusters_as_data = FALSE) table(a) a 1 2 3 3 1 1 table(b) b 1 2 3 1 1 3 One thing is changing the order of the labels in the vector, and another one is changing the cluster to which a given element has been classified. In this example, Minnesota should be in the same cluster in both cases, and the number of individuals within each cluster should be the same. — Reply to this email directly, view it on GitHub <#120>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAHOJBTU7JFJAU2IY5JJTC3ZOYEHXAVCNFSM6AAAAABLT4OU2SVHI2DSMVQWIX3LMV43ASLTON2WKOZSGQZTIOJTGMYDMMI> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

aloboa · 2024-07-30T16:46:59Z

If you do not fix this issue, please clarify asap the documentation of your dendextend::cuttre()
It should be:
order_clusters_as_data
logical, defaults to TRUE. There are two ways by which to name and order the clusters: 1) By the order of the original data. 2) by the order of the labels in the dendrogram.
TRUE: clusters are named and ordered according to their sequence in the data.
FALSE: clusters are named and ordered according to their sequence in the dendrogram.

If you fix the issue, you probably want to create a new function named cutdend(dend), where dend
should be a dendrogram (eg. dend <- as.dendrogram(hc)) to avoid confusion with base R.

The documentation would be:
order_clusters_as_data
logical, defaults to TRUE. Clusters are always named according to their sequence in the dendrogram. There are two ways by which to order the clusters: 1) By the order of the original data. 2) by the order of the labels in the dendrogram.
TRUE: clusters are ordered according to their sequence in the data.
FALSE: clusters are ordered according to their sequence in the dendrogram.

In case
b <- cutdend(dend,order_clusters_as_data = TRUE)
b would be the same as
a <- cutdend(dend,order_clusters_as_data = FALSE)
except that a would be reordered.

The fix is very simple, just look at this example:

d <- USArrests[c(1, 6, 13, 20, 23), ]
d
          Murder Assault UrbanPop Rape
Alabama     13.2     236       58 21.2
Colorado     7.9     204       78 38.7
Illinois    10.4     249       83 24.0
Maryland    11.3     300       67 27.8
Minnesota    2.7      72       66 14.9

hc <- hclust(dist(d), "ave")
dend <- as.dendrogram(hc)
a  <-  cutree(dend,h=50,order_clusters_as_data = FALSE)
y <- row.names(d)
x <- names(a)
y
[1] "Alabama"   "Colorado"  "Illinois"  "Maryland"  "Minnesota"

a
Minnesota  Maryland  Colorado   Alabama  Illinois 
        1         2         3         3         3 

a[order(match(x,y))]
  Alabama  Colorado  Illinois  Maryland Minnesota 
        3         3         3         2         1

jefferis · 2024-07-30T18:41:10Z

@aloboa sorry to see that things did not behave as you expected, but I am slightly confused by your opening description of this issue.

One thing is changing the order of the labels in the vector, and another one is changing the cluster to which a given element has been classified. In this example, Minnesota should be in the same cluster in both cases, and the number of individuals within each cluster should be the same.

the whole point of the option (order_clusters_as_data = FALSE) is to assign the clusters (labelled 1 ... k) as they appear in the dendrogram. This means that the numeric labels of the clusters must be different and therefore that the membership of the clusters identified by a given label in 1...k will be different.

Now I agree that for some purposes you might wish to return the integer cluster membership vector for each individual observation ordered by the input data rather than by the dendrogram. But that is a choice and because this is doing something different to base R I don't think you can say that one behaviour or another is a bug. I suppose one could add yet another argument asking to return the cluster membership in data order (e.g. order_return_as_data, order_membership_as_data or similar)

In other words Minnesota should be in a different cluster in the two cases. But you could discuss the ordering of the return vector.

jefferis · 2024-07-30T18:42:36Z

Also although I understand the intent behind your suggestion to change the docs:

clusters are named and ordered according to their sequence in the data.

I don't think it works because for the clusters naming/ordering are the same thing. What you want is to change the sort order of the returned cluster membership vector for observations.

If you want some ideas about proposing changes to the docs to avoid surprise then maybe take a look at dendroextras::slice which does the same thing as cutree(order_clusters_as_data = FALSE). Note also the example

slice(hc,k=5)[order(hc$order)]

which give the output you want. The group membership vector is not just an ascending or descending set of cluster ids as in your smaller examples. This may help to highlight that observations!=clusters.

As a side note, I have to say that anyone I have ever tried to teach clustering methods to finds it very strange that clusters in base R are not assigned in the order they appear in the dendrogram.

talgalili · 2024-07-30T19:00:31Z

Thanks Aloboa. Related to what Gregory wrote, I don't think it's a bug but rather a behaviour which is not documented well enough to avoid all possible confusion. I'll keep this issue open and take a look at it in the coming weeks (assuming nothing critical would stop me from taking a look).

…

On Tue, 30 Jul 2024, 21:42 Gregory Jefferis, ***@***.***> wrote: Also although I understand the intent behind your suggestion to change the docs: clusters are named and ordered according to their sequence in the data. I don't think it works because for the *clusters* naming/ordering are the same thing. What you want is to change the sort order of the returned cluster membership vector for *observations*. If you want some ideas about proposing changes to the docs to avoid surprise then maybe take a look at dendroextras::slice <https://rdrr.io/cran/dendroextras/man/slice.html> which does the same thing as cutree(order_clusters_as_data = FALSE). Note also the example slice(hc,k=5)[order(hc$order)] which give the output you want. The group membership vector is not just an ascending or descending set of cluster ids as in your smaller examples. This may help to highlight that observations!=clusters. As a side note, I have to say that anyone I have ever tried to teach clustering methods to finds it very strange that clusters in base R are not assigned in the order they appear in the dendrogram. — Reply to this email directly, view it on GitHub <#120 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAHOJBWRVUVIRNOFH4EFHJTZO7NDFAVCNFSM6AAAAABLT4OU2SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENJYHE4DCMRSGM> . You are receiving this because you commented.Message ID: ***@***.***>

aloboa added the bug label Jul 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent attribution of individuals to clusters #120

Inconsistent attribution of individuals to clusters #120

aloboa commented Jul 29, 2024

talgalili commented Jul 29, 2024 via email

aloboa commented Jul 30, 2024

jefferis commented Jul 30, 2024

jefferis commented Jul 30, 2024

talgalili commented Jul 30, 2024 via email

Inconsistent attribution of individuals to clusters #120

Inconsistent attribution of individuals to clusters #120

Comments

aloboa commented Jul 29, 2024

talgalili commented Jul 29, 2024 via email

aloboa commented Jul 30, 2024

jefferis commented Jul 30, 2024

jefferis commented Jul 30, 2024

talgalili commented Jul 30, 2024 via email