Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What do you recommend for single cell RNAseq data: counts, normalized counts in log scale, other? #21

Open
apeleraux opened this issue Oct 25, 2022 · 5 comments

Comments

@apeleraux
Copy link

You indicated in your publication that your method is relatively robust to various preprocessing and normalization steps. However I tested it on a single cell RNAseq dataset using counts or normalized log-transformed counts as input data matrix and found quite different cell type prioritization results. What would you generally recommend to use?

@skinnider
Copy link
Collaborator

We almost exclusively run Augur on raw counts. The exception is for very acute perturbations (e.g. mice walking on a treadmill for 15 min prior to sample collection) where we found that running estimates of RNA velocity provide more information than raw counts.

@apeleraux
Copy link
Author

Thanks for your fast answer. I understand the need for RNA velocity estimates in certain cases. We are mostly interested in longer time frames, so raw counts would be our choice. In such case, does Augur include normalization by total counts per cell or other similar normalization optimized for single cell RNAseq data? Intuitively, it would seem to me that classification between 2 conditions should be performing better on normalized data, and that therefore Augur may work better using normalized data. But of course I may be wrong ! Have you investigated this question or do you know relevant papers on this topic ?

@skinnider
Copy link
Collaborator

It's important to consider that 'better classification' is not really the goal of Augur - instead we are trying to identify cell types that are showing a transcriptional response to a perturbation, and so what's really of interest are the relative differences in classification accuracy between cell types. In our initial experiments, we saw minimal changes in the relative rankings of individual cell types when normalizing gene expression (e.g. with log-TP10K). However, we did find that there was less separation between cell types when running Augur on normalized gene expression values and so we generally run Augur on untransformed counts.
In terms of understanding why Augur is so robust to running on untransformed counts, Extended Data Fig. 10 in the Nat. Biotechnol. paper might be useful in thinking about the kinds of scenarios that would be required for sequencing depth to be a confounding factor in the analysis.

@apeleraux
Copy link
Author

Thanks a lot for your answer. When I was speaking of better classification, I actually meant higher accuracy of classification between unperturbed and perturbed cells. So I believe that we are on the same page. Thanks for pointing me to Fig 10 of the extended data, I will have a further look at it.

@kaizen89
Copy link

@skinnider looking at the code of augur when using seurat object it seems that the default slot used is data which corresponds to normalized data and not the raw counts as you recommend, might be worth changing the default behaviour?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants