Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GroupSingletons behavior #1655

Closed
andrewhill157 opened this issue Jun 7, 2019 · 3 comments
Closed

GroupSingletons behavior #1655

andrewhill157 opened this issue Jun 7, 2019 · 3 comments

Comments

@andrewhill157
Copy link

I have two related questions about GroupSingletons:

  1. For some larger datasets we have come across cases where in addition to a reasonable number of singletons there will also be a number of clusters composed of 2 or 3 cells, etc. and these are not easily fixed by adjusting resolution. Ideally, I'd like to merge these into their closest clusters in a manner similar to singletons (preferably at the same time as singletons). Was wondering if you would consider one of two options:
  • Option 1: Add an extra argument to FindClusters that allows you to retain singleton assignments. This would then allow people to deal with them however they want.
  • Option 2: Add something like a min.cluster.size argument to FindClusters that would be 2 by default to match current behavior, but would then merge all cells in clusters smaller than this size in the manner currently used to merge singletons. I have tested this out myself and would be happy to submit a PR if it sounds like something you'd consider.
  1. At https://github.com/satijalab/seurat/blob/master/R/clustering.R#L453, presumably you are setting up a new variable, "new.ids", so that you can change singleton ids while maintaining original cluster assignments when assessing connectivity. But then on https://github.com/satijalab/seurat/blob/master/R/clustering.R#L469 you are using "ids" when presumably you mean to use "new.ids". The way it works now it seems like as you process singletons they will be counted towards connectivity calculations for the clusters they join, which could, in theory, results dependent on the ordering of the data. Probably very rarely an issue, but wanted to mention as it seems like it might not have been intentional.
@andrewwbutler
Copy link
Collaborator

Hi,

  1. Option 1 has recently been added to the develop branch as a group.singletons parameter which if set to FALSE, will assign all singletons to their own "singleton" cluster. You could then process however you wanted. We would be happy to add a min.cluster.size parameter if you put together a PR.
  2. I believe this was intentional when this was initially implemented several years ago as I was interested in testing out the stability of singleton assignments. I also might have been trying to prevent errors in the case where every cell was a singleton (datasets used to have far fewer cells :) ). However, thinking about this again now, it probably makes more sense to not include the processed singletons in the connectivity calculations of future singletons. That would also allow for a more efficient implementation of that function.

@andrewhill157
Copy link
Author

andrewhill157 commented Jul 5, 2019 via email

@zacharyzale
Copy link

Hey I modified the function as described in this post. Didn't have time to make a formal PR.

  # identify singletons
  singletons <- c()
  singletons <- names(x = which(x = table(ids) <= small_size))
  singletons <- intersect(x = unique(x = ids), singletons)
  #print(singletons)
  if (!group.singletons) {
    ids[which(ids %in% singletons)] <- "singleton"
    return(ids)
  }
  # calculate connectivity of singletons to other clusters, add singleton
  # to cluster it is most connected to
  cluster_names <- as.character(x = unique(x = ids))
  cluster_names <- setdiff(x = cluster_names, y = singletons)
  #cat("cluster_names", cluster_names, "\n")
  connectivity <- vector(mode = "numeric", length = length(x = cluster_names))
  names(x = connectivity) <- cluster_names
  new.ids <- ids
  for (i in singletons) {
    #cat("i single", i, "\n")
    i.cells <- names(ids[ids == i])
    #cat("i cells", i.cells, "\n")
    for (j in cluster_names) {
      #cat("j", j, "\n")
      j.cells <- names(ids[ids ==j])
      #cat("j.cells", j, "\n")
      subSNN <- SNN[i.cells, j.cells]
      set.seed(1) # to match previous behavior, random seed being set in WhichCells
      if (is.object(x = subSNN)) {
        connectivity[j] <- sum(subSNN) / (nrow(x = subSNN) * ncol(x = subSNN))
      } else {
        connectivity[j] <- mean(x = subSNN)
      }
    }
    m <- max(connectivity, na.rm = T)
    mi <- which(x = connectivity == m, arr.ind = TRUE)
    closest_cluster <- sample(x = names(x = connectivity[mi]), 1)
    ids[i.cells] <- closest_cluster
  }
  if (length(x = singletons) > 0 && verbose) {
    message(paste(
      length(x = singletons),
      "singletons identified.",
      length(x = unique(x = ids)),
      "final clusters."
    ))
  }
  return(ids)
}

#example implementation

group_small_out <- GroupSmall_clus(ids = merged_pbmc$seurat_clusters, SNN = merged_pbmc@graphs$harm_wsnn)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants