Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add max_comm_size to constrain the community size. #46

Merged
merged 8 commits into from
Sep 23, 2020
Merged

Add max_comm_size to constrain the community size. #46

merged 8 commits into from
Sep 23, 2020

Conversation

orenbenkiki
Copy link
Contributor

First, thanks for publishing this package, and for supporting the "Surprise" goal function. I was working along these lines and just having it all be ready-made (and published) helped a lot.

I'm re-implementing the Metacell algorithm for analyzing single-cell RNA sequence data, and scaling it up to efficiently deal with very large data (many millions of cells) by converting it to a divide-and-conquer algorithm. I'm working on this as my PhD project in the Weizmann Institure for Science in Amos Tanay's lab.

This pull request implements a minor but important (for us) tweak to the basic clustering algorithm - restricting the maximal community size (sum of node sizes). Since metacells aren't "classical" clusters (in particular, they are defined as "the minimal sized community" that satisfies some conditions), this tweak makes it efficient to generate candidate metacells based on some KNN-graph, as part of the complete (complex) pipeline. I found that using this modified version saved CPU time and gave good results.

I tried to make this as seamless as possible with the existing code; two points where I had to tweak things were ensuring all vertex partitions allowed specifying node sizes, even if they did not directly use them, to allow passing them on to the algorithm itself; and change the parameters order for create_graph_from_py (since node_sizes became mandatory).

@orenbenkiki
Copy link
Contributor Author

Just fixed a typo in the documentation - had "used used" in several places (the dangers of copy-and-paste), changed it to a single "used".

Happy new (Jewish) year!

@vtraag
Copy link
Owner

vtraag commented Sep 18, 2020

Hi, thanks! You too! Thanks for the PR btw, I am working on a review, I didn't get around to finishing it earlier on. I hope to get around to it this weekend somewhere...

Copy link
Owner

@vtraag vtraag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for submitting this! It generally looks good, but I would change a few things, and I have some minor other comments.

Perhaps also a more general question: what do you think should happen if a node would be inside a community that is larger than max_comm_size? At the moment this is not caught, because it might be that moving it to a smaller community would actually decrease the quality. What do you think of this? I have some ideas for how to solve it, but I would be interested in hearing what you think.

include/python_partition_interface.h Outdated Show resolved Hide resolved
src/python_optimiser_interface.cpp Outdated Show resolved Hide resolved
src/Optimiser.cpp Show resolved Hide resolved
src/Optimiser.cpp Outdated Show resolved Hide resolved
src/Optimiser.py Outdated Show resolved Hide resolved
src/Optimiser.py Outdated Show resolved Hide resolved
src/functions.py Outdated Show resolved Hide resolved
tests/test_Optimiser.py Outdated Show resolved Hide resolved
tests/test_Optimiser.py Outdated Show resolved Hide resolved
tests/test_Optimiser.py Outdated Show resolved Hide resolved
@orenbenkiki
Copy link
Contributor Author

Fixed all issues (I think) as far as this "weak form" pull request goes.
As for the "strong form", I suggest that if we want to pursue it, it should be done not here but in a dedicated "issue".

Thanks for the review!

@vtraag
Copy link
Owner

vtraag commented Sep 20, 2020

Looks good, thanks for the quick response!

I'll think about your comment on the "strong form", perhaps the solution I have in mind also does not work well. The idea I had was quite simple in fact: we do not set the current community as the maximum one, and we set the initial maximum improvement to -infinity, while adding the current community to the set of communities to consider. That way, if the current community is too large it will not be considered. There will always be a possibility to move it to an empty community (assuming that option is turned on). What do you think?

We also have to deal with too small clusters in our work (clustering scientific publications). In that case, we post-process to join too small communities with larger communities, such that is minimises the loss of forcing the solution to exclude too small communities. If you are interested, I can talk you through this. It is more difficult to enforce this during the running of the algorithm however, since initially all communities are too small.

@orenbenkiki
Copy link
Contributor Author

orenbenkiki commented Sep 20, 2020

Thanks for the quick reply.

I'll think about your comment on the "strong form", perhaps the solution I have in mind also does not work well. The idea I had was quite simple in fact: we do not set the current community as the maximum one, and we set the initial maximum improvement to -infinity, while adding the current community to the set of communities to consider. That way, if the current community is too large it will not be considered. There will always be a possibility to move it to an empty community (assuming that option is turned on). What do you think?

So, if I get this right, this effectively will split the 1st several nodes from the too-large community, forcing them to either join other existing or create new communities. An elegant solution and easy to implement.

EDIT: I can add this to the code if you want, assuming you are OK with the caveats below.

EDIT 2: Ok, I did implement this, it was too simple and elegant to pass. The following caveats still bother me, though:

There's a nasty edge case when all nodes belong to a single too-large community and we are not allowed to create new communities, though.

EDIT 2: Turns out this isn't an issue if I initialize max_comm to v_comm.

That aside, I'm a bit worried in that the nodes chosen to be split from the too-large community are essentially chosen at random, and this may impact the optimality / efficiency of the algorithm.

A more serious potential concern is if we have "very" too-large communities (e.g., several times the maximal size). It might be more efficient to run the overall algorithm (or a simplified version of it) on the too-large community to split it into several smaller ones? If that is an overkill, perhaps even a simple "split to random subsets of roughly equal size" step might end up being more efficient?

EDIT 2: I still think the above two are a concern, but I agree it is more consistent to enforce max_comm_size even if done in a sub-optimal way. I'd still like to hear what you think about possible improvements to this.

We also have to deal with too small clusters in our work (clustering scientific publications). In that case, we post-process to join too small communities with larger communities, such that is minimises the loss of forcing the solution to exclude too small communities. If you are interested, I can talk you through this. It is more difficult to enforce this during the running of the algorithm however, since initially all communities are too small.

What I do right now is to convert each too-small community into a node, and run the algorithm on the resulting graph (containing only the too-small community nodes) to cluster them into larger ones. This seems to be "in the spirit" of the overall algorithm. Typically it solves the problem; sometimes I am left with "stubborn" too-small communities which really resist being clustered with anything else, which I basically consider to contain "outlier" nodes.

I would be interested in a better approach - as you say, I discovered it is difficult to enforce this during the algorithm (creating new communities becomes an issue).

Copy link
Owner

@vtraag vtraag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for submitting the fixes.

I think it is good to also try to enforce the max_comm_size on the current community, just for consistency. I agree that enforcing this constraints works better when starting from a singleton partition, growing the communities up until the max_comm_size limit, than when having to shave off single nodes one by one from a too large community. Thanks for including this.

There is one small minor issue dealing with the correct type of the max_comm_size issue. After that, I'd be happy to merge this in! After this is merged in, I will also make a new release.

src/python_optimiser_interface.cpp Outdated Show resolved Hide resolved
src/python_optimiser_interface.cpp Outdated Show resolved Hide resolved
src/python_optimiser_interface.cpp Outdated Show resolved Hide resolved
@vtraag
Copy link
Owner

vtraag commented Sep 23, 2020

Thanks! I will merge after the CI has finished.

@vtraag vtraag merged commit 9a3e365 into vtraag:master Sep 23, 2020
@orenbenkiki
Copy link
Contributor Author

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants