Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scanvi user guide and fix bullets #1193

Merged
merged 13 commits into from
Oct 2, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
100 changes: 36 additions & 64 deletions docs/user_guide/background/differential_expression.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,11 @@ Differential Expression
Under construction.

Problem statement
==========================================
==================

Differential expression analyses aim to quantify and detect expression differences of some quantity between conditions, e.g., cell types.
In single-cell experiments, such quantity can correspond to transcripts, protein expression, or chromatin accessibility.
A central notion when comparing expression levels of two cell states
A central notion when comparing expression levels of two cell states
is the log fold-change

.. math::
Expand All @@ -19,18 +19,18 @@ is the log fold-change
\beta_g := \log h_{g}^B - \log h_{g}^A,
\end{align}

where
where
:math:`\log h_{g}^A, \log h_{g}^B`
respectively denote the mean expression levels in subpopulations :math:`A`
and
:math:`B`.



Motivations to use scVI-tools for differential expression
======================================================================
Motivation
==========

In the particular case of single-cell RNA-seq data, existing differential expression models often model that the mean expression level
In the particular case of single-cell RNA-seq data, existing differential expression models often model that the mean expression level
:math:`\log h_{g}^C`.
as a linear function of the cell-state and batch assignments.
These models face two notable limitations to detect differences in expression between cell-states in large-scale scRNA-seq datasets.
Expand All @@ -49,11 +49,9 @@ This guide has two objectives.
First, it aims to provide insight as to how scVI-tools' differential expression module works for transcript expression (``scVI``), surface protein expression (``TOTALVI``), or chromatin accessibility (``PeakVI``).
More precisely, we explain how it can:

+ approximate population-specific normalized expression levels

+ detect biologically relevant features

+ provide easy-to-interpret predictions
- approximate population-specific normalized expression levels
- detect biologically relevant features
- provide easy-to-interpret predictions

More importantly, this guide explains the function of the hyperparameters of the ``differential_expression`` method.

Expand All @@ -70,64 +68,42 @@ More importantly, this guide explains the function of the hyperparameters of the
* - ``idx1``, ``idx2``
- Mask or queries for the compared populations :math:`A` and :math:`B`.
- yes
-
-
-
-
* - ``mode``
- Characterizes the null hypothesis.
-
-
- yes
-
-
* - ``delta``
- composite hypothesis characteristics (when ``mode="change"``).
-
-
- yes
-
-
* - ``fdr_target``
- desired FDR significance level
-
-
-
-
- yes
* - ``importance_sampling``
- Precises if expression levels are estimated using importance sampling
- yes
-
-
-
-

Notations and model assumptions
======================================================================
================================
While considering different modalities, scVI, TOTALVI, and PeakVI share similar properties, allowing us to perform differential expression of transcripts, surface proteins, or chromatin accessibility, similarly.
We first introduce some notations that will be useful in the remainder of this guide.
In particular, we consider a deep generative model where a latent variable with prior :math:`z_n \sim \mathcal{N}_d(0, I_d)` represents cell :math:`n`'s identity.
In turn, a neural network :math:`f^h_\theta` maps this low-dimensional representation to normalized, expression levels.
The following table recaps which names the scVI-tools codebase uses.

.. list-table::
:widths: 20 50 15 15
:header-rows: 1

* - Model
- Type of expression
- latent variable name
- Normaled expression name
* - scVI
- Gene expression.
- ``z``
- ``px_scale``
* - TOTALVI
- Gene & surface protein expression.
- ``z``
- ``px_scale`` (gene) and ``py_scale`` (surface protein)
* - PEAKVI
- Chromatin accessibility.
- ``z``
- ``p``


Approximating population-specific normalized expression levels
====================================================================================
===============================================================

A first step to characterize differences in expression consists in estimating state-specific expression levels.
For several reasons, most ``scVI-tools`` models do not explicitly model discrete cell types.
For several reasons, most ``scVI-tools`` models do not explicitly model discrete cell types.
A given cell's state often is unknown in the first place, and inferred with ``scvi-tools``.
In some cases, states may also have an intricate structure that would be difficult to model.
The class of models we consider here assumes that a latent variable :math:`z` characterizes cells' biological identity.
Expand All @@ -142,7 +118,7 @@ In particular, we will represent state :math:`C` latent representation with the
\begin{align}
\hat P^C(
Z
) =
) =
\frac
{1}
{
Expand All @@ -161,7 +137,7 @@ We note :math:`h^A_f, h^B_f` the respective expression levels in states :math:`A


Detecting biologically relevant features
========================================================
========================================
Once we have expression levels distributions for each condition, scvi-tools constructs an effect size, which will characterize expression differences.
When considering gene or surface protein expression, log-normalized counts are a traditional choice to characterize expression levels.
. Consequently, the canonical effect size for feature :math:`f` is the log fold-change, defined as the difference between log expression between conditions,
Expand All @@ -171,22 +147,23 @@ When considering gene or surface protein expression, log-normalized counts are a

\begin{align}
\beta_f
=
\log_2 h_^B{f} - \log_2 h_^A{f}.
=
\log_2 h_{f}^B - \log_2 h_{f}^A.
\end{align}
As chromatin accessibility cannot be interpreted in the same way, we take :math:`\beta_f = h_^B{f}- h_^A{f}` instead.

As chromatin accessibility cannot be interpreted in the same way, we take :math:`\beta_f = h_{f}^B- h_{f}^A` instead.

scVI-tools provides several ways to formulate the competing hypotheses from the effect sizes to detect DE features.
When ``mode = "vanilla"``, we consider point null hypotheses of the form :math:`\mathcal{H}_{0f}: \beta_f = 0`.
To avoid detecting features of little practical interest, e.g., when expression differences between conditions are significant but very subtle, we recommend users to use ``mode = "change"`` instead.
In this formulation, we consider null hypotheses instead, such that
In this formulation, we consider null hypotheses instead, such that

.. math::
:nowrap:

\begin{align}
\lvert \beta_f \rvert
\leq
\leq
\delta.
\end{align}

Expand All @@ -196,7 +173,7 @@ A straightforward decision consists in detecting genes for which the posterior d


Providing easy-to-interpret predictions
========================================================
=======================================
The obtained gene sets may be difficult to interpret for some users.
For this reason, we provide a data-supported way to select :math:`\epsilon`, such that the posterior expected False Discovery Proportion (FDP) is below a significance level :math:`\alpha`.
To clarify how to compute the posterior expectation, we introduce two notations.
Expand All @@ -217,7 +194,7 @@ the decision rule tagging :math:`k` features of highest :math:`p_f` as DE.
We also note :math:`d^f` the binary random variable taking value 1 if feature :math:`f` is differentially expressed.

The False Discovery Proportion is a random variable corresponding to the ratio of the number of false positives over the total number of predicted positives.
For the specific family of decision rules :math:`\mu^k, k` that we consider here, the FDP can be written as
For the specific family of decision rules :math:`\mu^k, k` that we consider here, the FDP can be written as

.. math::
:nowrap:
Expand All @@ -230,20 +207,15 @@ For the specific family of decision rules :math:`\mu^k, k` that we consider here
{\sum_f \mu_f^k}
.
\end{align}
However, note that the posterior expectation of :math:`d^f`, denoted as :math:`\mathbb{E}_{post}[]`, verifies :math:`\mathbb{E}_{post}[FDP_{d^f}] = p^f`.
Hence, by linearity of the expectation, we can estimate the false discovery rate corresponding to :math:`k` detected features as

However, note that the posterior expectation of :math:`d^f`, denoted as :math:`\mathbb{E}_{post}[.]`, verifies :math:`\mathbb{E}_{post}[FDP_{d^f}] = p^f`.
Hence, by linearity of the expectation, we can estimate the false discovery rate corresponding to :math:`k` detected features as

.. math::
:nowrap:

\begin{align}
\mathbb{E}_{post}[FDP_{\mu^k}]
=
\frac
{\sum_f (1 - p^f) \mu_f^k}
{\sum_f \mu_f^k}
.
\mathbb{E}_{post}[FDP_{\mu^k}] = \frac{\sum_f (1 - p^f) \mu_f^k}{\sum_f \mu_f^k}.
\end{align}

Hence, for a given significance level :math:`\alpha`, we select the maximum detections :math:`k^*`, such that :math:`\mathbb{E}_{post}[FDP_{\mu^k}] \leq \alpha`, as illustrated below.
Expand Down
3 changes: 3 additions & 0 deletions docs/user_guide/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,9 @@ scRNA-seq analysis
* - :doc:`/user_guide/models/scvi`
- [Lopez18]_
- Dimensionality reduction, removal of unwanted variation, integration across replicates, donors, and technologies, differential expression, imputation, normalization of other cell- and sample-level confounding factors
* - :doc:`/user_guide/models/scanvi`
- [Xu21]_
- scVI tasks with cell type transfer from reference, seed labeling
* - :doc:`/user_guide/models/linearscvi`
- [Svensson20]_
- scVI tasks with linear decoder
Expand Down
10 changes: 4 additions & 6 deletions docs/user_guide/models/cellassign.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,15 +10,13 @@ such that CellAssign will scale to very large datasets.

The advantages of CellAssign are:

+ Lightweight model that can be fit quickly.

+ Ability to control for nuisance factors.
- Lightweight model that can be fit quickly.
- Ability to control for nuisance factors.

The limitations of CellAssign include:

+ Requirement for a cell types by gene markers binary matrix.

+ The simple linear model may not handle non-linear batch effects.
- Requirement for a cell types by gene markers binary matrix.
- The simple linear model may not handle non-linear batch effects.


.. topic:: Tutorials:
Expand Down
9 changes: 4 additions & 5 deletions docs/user_guide/models/destvi.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,13 +8,12 @@ can be used to explore the spatial organization of a tissue and understanding ge

The advantages of DestVI are:

+ Can stratify cells into discrete cell types and model continuous sub-cell-type variation.

+ Scalable to very large datasets (>1 million cells).
- Can stratify cells into discrete cell types and model continuous sub-cell-type variation.
- Scalable to very large datasets (>1 million cells).

The limitations of DestVI include:

+ Effectively requires a GPU for fast inference.
- Effectively requires a GPU for fast inference.

.. topic:: Tutorial:

Expand Down Expand Up @@ -240,4 +239,4 @@ can be found in the DestVI paper.
`bioRxiv <https://doi.org/10.1101/2021.05.10.443517>`__.
.. [#ref2] Jakub Tomczak, Max Welling (2018),
*VAE with a VampPrior*,
`International Conference on Artificial Intelligence and Statistics <http://proceedings.mlr.press/v84/tomczak18a/tomczak18a.pdf`__.
`International Conference on Artificial Intelligence and Statistics <http://proceedings.mlr.press/v84/tomczak18a/tomczak18a.pdf`__.
Binary file added docs/user_guide/models/figures/fdr_control.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/user_guide/models/figures/scanvi_pgm.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
10 changes: 4 additions & 6 deletions docs/user_guide/models/linearscvi.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,15 +7,13 @@ is a flavor of scVI with a linear decoder.

The advantages of LDVAE are:

+ Can be used to interpret latent dimensions with factor loading matrix.

+ Scalable to very large datasets (>1 million cells).
- Can be used to interpret latent dimensions with factor loading matrix.
- Scalable to very large datasets (>1 million cells).

The limitations of LDVAE include:

+ Less capacity than scVI, which uses a neural network decoder.

+ Less capable of integrating data with complex batch effects.
- Less capacity than scVI, which uses a neural network decoder.
- Less capable of integrating data with complex batch effects.


.. topic:: Tutorials:
Expand Down
Loading