Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HierarchicalClustering Losing Values #11

Closed
Telofy opened this issue Dec 6, 2013 · 7 comments
Closed

HierarchicalClustering Losing Values #11

Telofy opened this issue Dec 6, 2013 · 7 comments
Labels

Comments

@Telofy
Copy link

Telofy commented Dec 6, 2013

Hi, and thanks for the library!

I’m losing values though:

In [1]: from cluster import HierarchicalClustering
In [2]: HierarchicalClustering([10, 30, 10, 35], lambda x,y: abs(x-y)).getlevel(10)
Out[2]: [[35], [10, 10, 10]]

I’ll try to understand the algorithm to see where the 30 went, but maybe you’re quicker.

In fluffiness,
Telofy

@Telofy
Copy link
Author

Telofy commented Dec 6, 2013

The bug occurs when there are two equal values in the list (unless they’re consecutive, so that the second one gets the index of the first when the first is removed).

Instead of this (cluster.py around line 637):

self._data.remove(self._data[max(smallestpair[0],
    smallestpair[1])])  # remove item 1
self._data.remove(self._data[min(smallestpair[0],
    smallestpair[1])])  # remove item 2

Try this:

del self._data[max(smallestpair[0], smallestpair[1])]
del self._data[min(smallestpair[0], smallestpair[1])]

You could also obtain both the values first and only then delete them, but the above seems simpler.

@exhuma
Copy link
Owner

exhuma commented Dec 6, 2013

Thanks for the update. I'll see to it in the near future.

Have you tested your proposed solution? I'm not on my dev-box right now. Also... this is a fairly old piece of library. I'll have to dig bag in an understand my own code again... with any luck it should not take all too long ^_^

Also, you could propose a pull-request to speed things up ;)

@Telofy
Copy link
Author

Telofy commented Dec 6, 2013

It’s working for me, but I have not thoroughly tested it, which is also the reason why I chose to propose it here rather than filing a pull request right away.

@Telofy
Copy link
Author

Telofy commented Dec 9, 2013

I think you could also optimize the performance if you found a way to not recalculate the parts of the matrix that could not have changed.

@exhuma exhuma added the bug label Aug 28, 2014
@exhuma
Copy link
Owner

exhuma commented Aug 28, 2014

I tried your proposed fix, but that caused another test to fail. I have been looking into this and I find that my initial implementation of the algorithm could be revised. And more importantly should be better tested. I am looking into this.

exhuma added a commit that referenced this issue Aug 29, 2014
@exhuma
Copy link
Owner

exhuma commented Sep 8, 2014

Sorry for the delay... We had a death in the family, and I still am a bit shaken.

But this problem in keeps me awake at night... I have been working on a better unit-test (see http://michel.albert.lu). Alas, implementing a test with that data set did not reproduce the error!

I have now taken up the original values from the unit-test, and while writing up I realised that one value is duplicated (365) which might explain the error. I'm on it...

@Telofy
Copy link
Author

Telofy commented Sep 8, 2014

Oh, I’m sorry to hear that, my condolences.

exhuma added a commit that referenced this issue Sep 10, 2014
References #11
@exhuma exhuma closed this as completed in dc9f779 Sep 10, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants