Add max_comm_size to constrain the community size. #46

orenbenkiki · 2020-09-04T16:19:12Z

First, thanks for publishing this package, and for supporting the "Surprise" goal function. I was working along these lines and just having it all be ready-made (and published) helped a lot.

I'm re-implementing the Metacell algorithm for analyzing single-cell RNA sequence data, and scaling it up to efficiently deal with very large data (many millions of cells) by converting it to a divide-and-conquer algorithm. I'm working on this as my PhD project in the Weizmann Institure for Science in Amos Tanay's lab.

This pull request implements a minor but important (for us) tweak to the basic clustering algorithm - restricting the maximal community size (sum of node sizes). Since metacells aren't "classical" clusters (in particular, they are defined as "the minimal sized community" that satisfies some conditions), this tweak makes it efficient to generate candidate metacells based on some KNN-graph, as part of the complete (complex) pipeline. I found that using this modified version saved CPU time and gave good results.

I tried to make this as seamless as possible with the existing code; two points where I had to tweak things were ensuring all vertex partitions allowed specifying node sizes, even if they did not directly use them, to allow passing them on to the algorithm itself; and change the parameters order for create_graph_from_py (since node_sizes became mandatory).

orenbenkiki · 2020-09-18T11:17:47Z

Just fixed a typo in the documentation - had "used used" in several places (the dangers of copy-and-paste), changed it to a single "used".

Happy new (Jewish) year!

vtraag · 2020-09-18T14:49:38Z

Hi, thanks! You too! Thanks for the PR btw, I am working on a review, I didn't get around to finishing it earlier on. I hope to get around to it this weekend somewhere...

vtraag

Thanks for submitting this! It generally looks good, but I would change a few things, and I have some minor other comments.

Perhaps also a more general question: what do you think should happen if a node would be inside a community that is larger than max_comm_size? At the moment this is not caught, because it might be that moving it to a smaller community would actually decrease the quality. What do you think of this? I have some ideas for how to solve it, but I would be interested in hearing what you think.

include/python_partition_interface.h

src/python_optimiser_interface.cpp

src/Optimiser.cpp

src/Optimiser.py

src/functions.py

tests/test_Optimiser.py

orenbenkiki · 2020-09-20T11:40:12Z

Fixed all issues (I think) as far as this "weak form" pull request goes.
As for the "strong form", I suggest that if we want to pursue it, it should be done not here but in a dedicated "issue".

Thanks for the review!

vtraag · 2020-09-20T12:17:28Z

Looks good, thanks for the quick response!

I'll think about your comment on the "strong form", perhaps the solution I have in mind also does not work well. The idea I had was quite simple in fact: we do not set the current community as the maximum one, and we set the initial maximum improvement to -infinity, while adding the current community to the set of communities to consider. That way, if the current community is too large it will not be considered. There will always be a possibility to move it to an empty community (assuming that option is turned on). What do you think?

We also have to deal with too small clusters in our work (clustering scientific publications). In that case, we post-process to join too small communities with larger communities, such that is minimises the loss of forcing the solution to exclude too small communities. If you are interested, I can talk you through this. It is more difficult to enforce this during the running of the algorithm however, since initially all communities are too small.

orenbenkiki · 2020-09-20T12:39:13Z

Thanks for the quick reply.

I'll think about your comment on the "strong form", perhaps the solution I have in mind also does not work well. The idea I had was quite simple in fact: we do not set the current community as the maximum one, and we set the initial maximum improvement to -infinity, while adding the current community to the set of communities to consider. That way, if the current community is too large it will not be considered. There will always be a possibility to move it to an empty community (assuming that option is turned on). What do you think?

So, if I get this right, this effectively will split the 1st several nodes from the too-large community, forcing them to either join other existing or create new communities. An elegant solution and easy to implement.

EDIT: I can add this to the code if you want, assuming you are OK with the caveats below.

EDIT 2: Ok, I did implement this, it was too simple and elegant to pass. The following caveats still bother me, though:

There's a nasty edge case when all nodes belong to a single too-large community and we are not allowed to create new communities, though.

EDIT 2: Turns out this isn't an issue if I initialize max_comm to v_comm.

That aside, I'm a bit worried in that the nodes chosen to be split from the too-large community are essentially chosen at random, and this may impact the optimality / efficiency of the algorithm.

A more serious potential concern is if we have "very" too-large communities (e.g., several times the maximal size). It might be more efficient to run the overall algorithm (or a simplified version of it) on the too-large community to split it into several smaller ones? If that is an overkill, perhaps even a simple "split to random subsets of roughly equal size" step might end up being more efficient?

EDIT 2: I still think the above two are a concern, but I agree it is more consistent to enforce max_comm_size even if done in a sub-optimal way. I'd still like to hear what you think about possible improvements to this.

We also have to deal with too small clusters in our work (clustering scientific publications). In that case, we post-process to join too small communities with larger communities, such that is minimises the loss of forcing the solution to exclude too small communities. If you are interested, I can talk you through this. It is more difficult to enforce this during the running of the algorithm however, since initially all communities are too small.

What I do right now is to convert each too-small community into a node, and run the algorithm on the resulting graph (containing only the too-small community nodes) to cluster them into larger ones. This seems to be "in the spirit" of the overall algorithm. Typically it solves the problem; sometimes I am left with "stubborn" too-small communities which really resist being clustered with anything else, which I basically consider to contain "outlier" nodes.

I would be interested in a better approach - as you say, I discovered it is difficult to enforce this during the algorithm (creating new communities becomes an issue).

vtraag

Thanks again for submitting the fixes.

I think it is good to also try to enforce the max_comm_size on the current community, just for consistency. I agree that enforcing this constraints works better when starting from a singleton partition, growing the communities up until the max_comm_size limit, than when having to shave off single nodes one by one from a too large community. Thanks for including this.

There is one small minor issue dealing with the correct type of the max_comm_size issue. After that, I'd be happy to merge this in! After this is merged in, I will also make a new release.

src/python_optimiser_interface.cpp

vtraag · 2020-09-23T09:57:07Z

Thanks! I will merge after the CI has finished.

orenbenkiki · 2020-09-23T15:24:13Z

Thanks!

orenbenkiki added 2 commits September 4, 2020 19:03

Add max_comm_size to constrain the community size.

e491d80

Fix typo.

8bac236

vtraag reviewed Sep 19, 2020

View reviewed changes

Fix review issues.

f85e140

orenbenkiki added 3 commits September 20, 2020 17:27

Enforce max_comm_size when given too-large communities.

c33ffdd

Allow specifying max_comm_size for find_partition functions.

ed2a0dc

Remove redundant check for negative max_comm_size value.

ee34714

vtraag reviewed Sep 23, 2020

View reviewed changes

src/python_optimiser_interface.cpp Outdated Show resolved Hide resolved

src/python_optimiser_interface.cpp Outdated Show resolved Hide resolved

src/python_optimiser_interface.cpp Outdated Show resolved Hide resolved

orenbenkiki added 2 commits September 23, 2020 10:40

Fix data type of max_comm_size to size_t.

0cdef8c

Use PyLong_FromSize_t for max_comm_size.

c9d040d

vtraag merged commit 9a3e365 into vtraag:master Sep 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add max_comm_size to constrain the community size. #46

Add max_comm_size to constrain the community size. #46

orenbenkiki commented Sep 4, 2020

orenbenkiki commented Sep 18, 2020

vtraag commented Sep 18, 2020

vtraag left a comment

orenbenkiki commented Sep 20, 2020

vtraag commented Sep 20, 2020

orenbenkiki commented Sep 20, 2020 •

edited

Loading

vtraag left a comment

vtraag commented Sep 23, 2020

orenbenkiki commented Sep 23, 2020

Add max_comm_size to constrain the community size. #46

Add max_comm_size to constrain the community size. #46

Conversation

orenbenkiki commented Sep 4, 2020

orenbenkiki commented Sep 18, 2020

vtraag commented Sep 18, 2020

vtraag left a comment

Choose a reason for hiding this comment

orenbenkiki commented Sep 20, 2020

vtraag commented Sep 20, 2020

orenbenkiki commented Sep 20, 2020 • edited Loading

vtraag left a comment

Choose a reason for hiding this comment

vtraag commented Sep 23, 2020

orenbenkiki commented Sep 23, 2020

orenbenkiki commented Sep 20, 2020 •

edited

Loading