-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add max_comm_size to constrain the community size. #46
Conversation
Just fixed a typo in the documentation - had "used used" in several places (the dangers of copy-and-paste), changed it to a single "used". Happy new (Jewish) year! |
Hi, thanks! You too! Thanks for the PR btw, I am working on a review, I didn't get around to finishing it earlier on. I hope to get around to it this weekend somewhere... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for submitting this! It generally looks good, but I would change a few things, and I have some minor other comments.
Perhaps also a more general question: what do you think should happen if a node would be inside a community that is larger than max_comm_size
? At the moment this is not caught, because it might be that moving it to a smaller community would actually decrease the quality. What do you think of this? I have some ideas for how to solve it, but I would be interested in hearing what you think.
Fixed all issues (I think) as far as this "weak form" pull request goes. Thanks for the review! |
Looks good, thanks for the quick response! I'll think about your comment on the "strong form", perhaps the solution I have in mind also does not work well. The idea I had was quite simple in fact: we do not set the current community as the maximum one, and we set the initial maximum improvement to -infinity, while adding the current community to the set of communities to consider. That way, if the current community is too large it will not be considered. There will always be a possibility to move it to an empty community (assuming that option is turned on). What do you think? We also have to deal with too small clusters in our work (clustering scientific publications). In that case, we post-process to join too small communities with larger communities, such that is minimises the loss of forcing the solution to exclude too small communities. If you are interested, I can talk you through this. It is more difficult to enforce this during the running of the algorithm however, since initially all communities are too small. |
Thanks for the quick reply.
So, if I get this right, this effectively will split the 1st several nodes from the too-large community, forcing them to either join other existing or create new communities. An elegant solution and easy to implement. EDIT: I can add this to the code if you want, assuming you are OK with the caveats below. EDIT 2: Ok, I did implement this, it was too simple and elegant to pass. The following caveats still bother me, though: There's a nasty edge case when all nodes belong to a single too-large community and we are not allowed to create new communities, though. EDIT 2: Turns out this isn't an issue if I initialize That aside, I'm a bit worried in that the nodes chosen to be split from the too-large community are essentially chosen at random, and this may impact the optimality / efficiency of the algorithm. A more serious potential concern is if we have "very" too-large communities (e.g., several times the maximal size). It might be more efficient to run the overall algorithm (or a simplified version of it) on the too-large community to split it into several smaller ones? If that is an overkill, perhaps even a simple "split to random subsets of roughly equal size" step might end up being more efficient? EDIT 2: I still think the above two are a concern, but I agree it is more consistent to enforce
What I do right now is to convert each too-small community into a node, and run the algorithm on the resulting graph (containing only the too-small community nodes) to cluster them into larger ones. This seems to be "in the spirit" of the overall algorithm. Typically it solves the problem; sometimes I am left with "stubborn" too-small communities which really resist being clustered with anything else, which I basically consider to contain "outlier" nodes. I would be interested in a better approach - as you say, I discovered it is difficult to enforce this during the algorithm (creating new communities becomes an issue). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again for submitting the fixes.
I think it is good to also try to enforce the max_comm_size
on the current community, just for consistency. I agree that enforcing this constraints works better when starting from a singleton partition, growing the communities up until the max_comm_size
limit, than when having to shave off single nodes one by one from a too large community. Thanks for including this.
There is one small minor issue dealing with the correct type of the max_comm_size
issue. After that, I'd be happy to merge this in! After this is merged in, I will also make a new release.
Thanks! I will merge after the CI has finished. |
Thanks! |
First, thanks for publishing this package, and for supporting the "Surprise" goal function. I was working along these lines and just having it all be ready-made (and published) helped a lot.
I'm re-implementing the Metacell algorithm for analyzing single-cell RNA sequence data, and scaling it up to efficiently deal with very large data (many millions of cells) by converting it to a divide-and-conquer algorithm. I'm working on this as my PhD project in the Weizmann Institure for Science in Amos Tanay's lab.
This pull request implements a minor but important (for us) tweak to the basic clustering algorithm - restricting the maximal community size (sum of node sizes). Since metacells aren't "classical" clusters (in particular, they are defined as "the minimal sized community" that satisfies some conditions), this tweak makes it efficient to generate candidate metacells based on some KNN-graph, as part of the complete (complex) pipeline. I found that using this modified version saved CPU time and gave good results.
I tried to make this as seamless as possible with the existing code; two points where I had to tweak things were ensuring all vertex partitions allowed specifying node sizes, even if they did not directly use them, to allow passing them on to the algorithm itself; and change the parameters order for create_graph_from_py (since node_sizes became mandatory).