diff --git a/docs/user_guide/background/differential_expression.rst b/docs/user_guide/background/differential_expression.rst
index d9e19f98a1..956da047b7 100644
--- a/docs/user_guide/background/differential_expression.rst
+++ b/docs/user_guide/background/differential_expression.rst
@@ -5,11 +5,11 @@ Differential Expression
 Under construction.
 
 Problem statement
-==========================================
+==================
 
 Differential expression analyses aim to quantify and detect expression differences of some quantity between conditions, e.g., cell types.
 In single-cell experiments, such quantity can correspond to transcripts, protein expression, or chromatin accessibility.
-A central notion when comparing expression levels of two cell states 
+A central notion when comparing expression levels of two cell states
 is the log fold-change
 
 .. math::
@@ -19,7 +19,7 @@ is the log fold-change
       \beta_g := \log h_{g}^B - \log h_{g}^A,
    \end{align}
 
-where 
+where
 :math:`\log h_{g}^A, \log h_{g}^B`
 respectively denote the mean expression levels in subpopulations :math:`A`
 and
@@ -27,10 +27,10 @@ and
 
 
 
-Motivations to use scVI-tools for differential expression 
-======================================================================
+Motivation
+==========
 
-In the particular case of single-cell RNA-seq data, existing differential expression models often model that the mean expression level 
+In the particular case of single-cell RNA-seq data, existing differential expression models often model that the mean expression level
 :math:`\log h_{g}^C`.
 as a linear function of the cell-state and batch assignments.
 These models face two notable limitations to detect differences in expression between cell-states in large-scale scRNA-seq datasets.
@@ -49,11 +49,9 @@ This guide has two objectives.
 First, it aims to provide insight as to how scVI-tools' differential expression module works for transcript expression (``scVI``), surface protein expression (``TOTALVI``), or chromatin accessibility (``PeakVI``).
 More precisely, we explain how it can:
 
-    + approximate population-specific normalized expression levels
-
-    + detect biologically relevant features
-
-    + provide easy-to-interpret predictions
+- approximate population-specific normalized expression levels
+- detect biologically relevant features
+- provide easy-to-interpret predictions
 
 More importantly, this guide explains the function of the hyperparameters of the ``differential_expression`` method.
 
@@ -70,64 +68,42 @@ More importantly, this guide explains the function of the hyperparameters of the
    * - ``idx1``, ``idx2``
      - Mask or queries for the compared populations :math:`A` and :math:`B`.
      - yes
-     - 
-     - 
+     -
+     -
    * - ``mode``
      - Characterizes the null hypothesis.
-     - 
+     -
      - yes
-     - 
+     -
    * - ``delta``
      - composite hypothesis characteristics (when ``mode="change"``).
-     - 
+     -
      - yes
-     - 
+     -
    * - ``fdr_target``
      - desired FDR significance level
-     - 
-     - 
+     -
+     -
      - yes
    * - ``importance_sampling``
      - Precises if expression levels are estimated using importance sampling
      - yes
-     - 
-     - 
+     -
+     -
 
 Notations and model assumptions
-======================================================================
+================================
 While considering different modalities, scVI, TOTALVI, and PeakVI share similar properties, allowing us to perform differential expression of transcripts, surface proteins, or chromatin accessibility, similarly.
 We first introduce some notations that will be useful in the remainder of this guide.
 In particular, we consider a deep generative model where a latent variable with prior :math:`z_n \sim \mathcal{N}_d(0, I_d)` represents cell :math:`n`'s identity.
 In turn, a neural network :math:`f^h_\theta` maps this low-dimensional representation to normalized, expression levels.
-The following table recaps which names the scVI-tools codebase uses.
-
-.. list-table::
-   :widths: 20 50 15 15
-   :header-rows: 1
-
-   * - Model
-     - Type of expression
-     - latent variable name
-     - Normaled expression name
-   * - scVI
-     - Gene expression.
-     - ``z``
-     - ``px_scale``
-   * - TOTALVI
-     - Gene & surface protein expression.
-     - ``z``
-     - ``px_scale`` (gene) and ``py_scale`` (surface protein)
-   * - PEAKVI
-     - Chromatin accessibility.
-     - ``z``
-     - ``p``
 
 
 Approximating population-specific normalized expression levels
-====================================================================================
+===============================================================
 
 A first step to characterize differences in expression consists in estimating state-specific expression levels.
-For several reasons, most ``scVI-tools`` models do not explicitly model discrete cell types. 
+For several reasons, most ``scVI-tools`` models do not explicitly model discrete cell types.
 A given cell's state often is unknown in the first place, and inferred with ``scvi-tools``.
 In some cases, states may also have an intricate structure that would be difficult to model.
 The class of models we consider here assumes that a latent variable :math:`z` characterizes cells' biological identity.
@@ -142,7 +118,7 @@ In particular, we will represent state :math:`C` latent representation with the
    \begin{align}
       \hat P^C(
         Z
-      ) = 
+      ) =
       \frac
       {1}
       {
@@ -161,7 +137,7 @@ We note :math:`h^A_f, h^B_f` the respective expression levels in states :math:`A
 
 
 Detecting biologically relevant features
-========================================================
+========================================
 Once we have expression levels distributions for each condition, scvi-tools constructs an effect size, which will characterize expression differences.
 When considering gene or surface protein expression, log-normalized counts are a traditional choice to characterize expression levels.
 . Consequently, the canonical effect size for feature :math:`f` is the log fold-change, defined as the difference between log expression between conditions,
@@ -171,22 +147,23 @@ When considering gene or surface protein expression, log-normalized counts are a
 
    \begin{align}
       \beta_f
-      = 
-      \log_2 h_^B{f} - \log_2 h_^A{f}.
+      =
+      \log_2 h_{f}^B - \log_2 h_{f}^A.
    \end{align}
-As chromatin accessibility cannot be interpreted in the same way, we take :math:`\beta_f = h_^B{f}- h_^A{f}` instead.
+
+As chromatin accessibility cannot be interpreted in the same way, we take :math:`\beta_f = h_{f}^B- h_{f}^A` instead.
 
 scVI-tools provides several ways to formulate the competing hypotheses from the effect sizes to detect DE features.
 When ``mode = "vanilla"``, we consider point null hypotheses of the form :math:`\mathcal{H}_{0f}: \beta_f = 0`.
 To avoid detecting features of little practical interest, e.g., when expression differences between conditions are significant but very subtle, we recommend users to use ``mode = "change"`` instead.
-In this formulation, we consider null hypotheses instead, such that 
+In this formulation, we consider null hypotheses instead, such that
 
 .. math::
    :nowrap:
 
    \begin{align}
       \lvert \beta_f \rvert
-      \leq 
+      \leq
       \delta.
    \end{align}
 
@@ -196,7 +173,7 @@ A straightforward decision consists in detecting genes for which the posterior d
 
 
 Providing easy-to-interpret predictions
-========================================================
+=======================================
 The obtained gene sets may be difficult to interpret for some users.
 For this reason, we provide a data-supported way to select :math:`\epsilon`, such that the posterior expected False Discovery Proportion (FDP) is below a significance level :math:`\alpha`.
 To clarify how to compute the posterior expectation, we introduce two notations.
@@ -217,7 +194,7 @@ the decision rule tagging :math:`k` features of highest :math:`p_f` as DE.
 We also note :math:`d^f` the binary random variable taking value 1 if feature :math:`f` is differentially expressed.
 
 The False Discovery Proportion is a random variable corresponding to the ratio of the number of false positives over the total number of predicted positives.
-For the specific family of decision rules :math:`\mu^k, k` that we consider here, the FDP can be written as 
+For the specific family of decision rules :math:`\mu^k, k` that we consider here, the FDP can be written as
 
 .. math::
    :nowrap:
@@ -230,20 +207,15 @@ For the specific family of decision rules :math:`\mu^k, k` that we consider here
       {\sum_f \mu_f^k}
       .
    \end{align}
-  
-However, note that the posterior expectation of :math:`d^f`, denoted as :math:`\mathbb{E}_{post}[]`, verifies :math:`\mathbb{E}_{post}[FDP_{d^f}] = p^f`.
-Hence, by linearity of the expectation, we can estimate the false discovery rate corresponding to :math:`k` detected features as 
+
+However, note that the posterior expectation of :math:`d^f`, denoted as :math:`\mathbb{E}_{post}[.]`, verifies :math:`\mathbb{E}_{post}[FDP_{d^f}] = p^f`.
+Hence, by linearity of the expectation, we can estimate the false discovery rate corresponding to :math:`k` detected features as
 
 .. math::
    :nowrap:
 
    \begin{align}
-      \mathbb{E}_{post}[FDP_{\mu^k}]
-      =
-      \frac
-      {\sum_f (1 - p^f) \mu_f^k}
-      {\sum_f \mu_f^k}
-      .
+      \mathbb{E}_{post}[FDP_{\mu^k}] = \frac{\sum_f (1 - p^f) \mu_f^k}{\sum_f \mu_f^k}.
    \end{align}
 
  Hence, for a given significance level :math:`\alpha`, we select the maximum detections :math:`k^*`, such that :math:`\mathbb{E}_{post}[FDP_{\mu^k}] \leq \alpha`, as illustrated below.
diff --git a/docs/user_guide/index.rst b/docs/user_guide/index.rst
index 5188dd1ca8..ec85ca2da5 100644
--- a/docs/user_guide/index.rst
+++ b/docs/user_guide/index.rst
@@ -15,6 +15,9 @@ scRNA-seq analysis
    * - :doc:`/user_guide/models/scvi`
      - [Lopez18]_
      - Dimensionality reduction, removal of unwanted variation, integration across replicates, donors, and technologies, differential expression, imputation, normalization of other cell- and sample-level confounding factors
+   * - :doc:`/user_guide/models/scanvi`
+     - [Xu21]_
+     - scVI tasks with cell type transfer from reference, seed labeling
    * - :doc:`/user_guide/models/linearscvi`
      - [Svensson20]_
      - scVI tasks with linear decoder
diff --git a/docs/user_guide/models/cellassign.rst b/docs/user_guide/models/cellassign.rst
index f6b1839f98..e878eb9a45 100644
--- a/docs/user_guide/models/cellassign.rst
+++ b/docs/user_guide/models/cellassign.rst
@@ -10,15 +10,13 @@ such that CellAssign will scale to very large datasets.
 
 The advantages of CellAssign are:
 
-    + Lightweight model that can be fit quickly.
-
-    + Ability to control for nuisance factors.
+- Lightweight model that can be fit quickly.
+- Ability to control for nuisance factors.
 
 The limitations of CellAssign include:
 
-    + Requirement for a cell types by gene markers binary matrix.
-
-    + The simple linear model may not handle non-linear batch effects.
+- Requirement for a cell types by gene markers binary matrix.
+- The simple linear model may not handle non-linear batch effects.
 
 
 .. topic:: Tutorials:
diff --git a/docs/user_guide/models/destvi.rst b/docs/user_guide/models/destvi.rst
index 381255cf34..993d2b245c 100644
--- a/docs/user_guide/models/destvi.rst
+++ b/docs/user_guide/models/destvi.rst
@@ -8,13 +8,12 @@ can be used to explore the spatial organization of a tissue and understanding ge
 
 The advantages of DestVI are:
 
-    + Can stratify cells into discrete cell types and model continuous sub-cell-type variation.
-
-    + Scalable to very large datasets (>1 million cells).
+- Can stratify cells into discrete cell types and model continuous sub-cell-type variation.
+- Scalable to very large datasets (>1 million cells).
 
 The limitations of DestVI include:
 
-    + Effectively requires a GPU for fast inference.
+- Effectively requires a GPU for fast inference.
 
 .. topic:: Tutorial:
 
@@ -240,4 +239,4 @@ can be found in the DestVI paper.
         `bioRxiv <https://doi.org/10.1101/2021.05.10.443517>`__.
     .. [#ref2] Jakub Tomczak, Max Welling (2018),
         *VAE with a VampPrior*,
-        `International Conference on Artificial Intelligence and Statistics <http://proceedings.mlr.press/v84/tomczak18a/tomczak18a.pdf`__.
\ No newline at end of file
+        `International Conference on Artificial Intelligence and Statistics <http://proceedings.mlr.press/v84/tomczak18a/tomczak18a.pdf`__.
diff --git a/docs/user_guide/models/figures/fdr_control.png b/docs/user_guide/models/figures/fdr_control.png
new file mode 100644
index 0000000000..1a1082fa54
Binary files /dev/null and b/docs/user_guide/models/figures/fdr_control.png differ
diff --git a/docs/user_guide/models/figures/scanvi_pgm.png b/docs/user_guide/models/figures/scanvi_pgm.png
new file mode 100644
index 0000000000..6e7d5526e4
Binary files /dev/null and b/docs/user_guide/models/figures/scanvi_pgm.png differ
diff --git a/docs/user_guide/models/linearscvi.rst b/docs/user_guide/models/linearscvi.rst
index 6381c924c2..313812b5b2 100644
--- a/docs/user_guide/models/linearscvi.rst
+++ b/docs/user_guide/models/linearscvi.rst
@@ -7,15 +7,13 @@ is a flavor of scVI with a linear decoder.
 
 The advantages of LDVAE are:
 
-    + Can be used to interpret latent dimensions with factor loading matrix.
-
-    + Scalable to very large datasets (>1 million cells).
+- Can be used to interpret latent dimensions with factor loading matrix.
+- Scalable to very large datasets (>1 million cells).
 
 The limitations of LDVAE include:
 
-    + Less capacity than scVI, which uses a neural network decoder.
-
-    + Less capable of integrating data with complex batch effects.
+- Less capacity than scVI, which uses a neural network decoder.
+- Less capable of integrating data with complex batch effects.
 
 
 .. topic:: Tutorials:
diff --git a/docs/user_guide/models/scanvi.rst b/docs/user_guide/models/scanvi.rst
new file mode 100644
index 0000000000..be471fb29b
--- /dev/null
+++ b/docs/user_guide/models/scanvi.rst
@@ -0,0 +1,324 @@
+======
+scANVI
+======
+
+**scANVI** [#ref1]_ (single-cell ANnotation using Variational Inference; Python class :class:`~scvi.model.SCANVI`) is a semi-supervised model for single-cell transcriptomics data.
+In a sense, it can be seen as a scVI extension that can leverage the cell type knowledge for a subset of the cells present in the data sets to infer the states of the rest of the cells.
+For this reason, scANVI can help annotate a data set of unlabelled cells from manually annotated atlases, e.g., Tabula Sapiens [#refTS]_.
+
+The advantages of scANVI are:
+
+- Comprehensive in capabilities.
+- Scalable to very large datasets (>1 million cells).
+
+The limitations of scANVI include:
+
+- Effectively requires a GPU for fast inference.
+- Latent space is not interpretable, unlike that of a linear method.
+- May not scale to very large number of cell types.
+
+
+.. topic:: Tutorials:
+
+ - :doc:`/tutorials/notebooks/harmonization`
+ - :doc:`/tutorials/notebooks/scarches_scvi_tools`
+
+
+Preliminaries
+==============
+scANVI takes as input a scRNA-seq gene expression matrix :math:`X` with :math:`N` cells and :math:`G` genes,
+as well as a vector :math:`\mathbf{c}` containing the partially observed cell type annotations.
+Let :math:`C` be the number of observed cell types in the data.
+Additionally, a design matrix :math:`S` containing :math:`p` observed covariates, such as day, donor, etc, is an optional input.
+While :math:`S` can include both categorical covariates and continuous covariates, in the following, we assume it contains only one
+categorical covariate with :math:`K` categories, which represents the common case of having multiple batches of data.
+
+
+
+Generative process
+============================
+
+scANVI extends the scVI model by making use of observed cell types :math:`c_n` following a
+graphical model inspired by works on semi-supervised VAEs [#ref2]_.
+
+
+.. math::
+   :nowrap:
+
+   \begin{align}
+    c_n &\sim \mathrm{Categorical}(1/C, ..., 1/C) \\
+    u_n &\sim \mathcal{N}(0, I) \\
+    z_n &\sim \mathcal{N}(f_z^\mu(c_n, u_n), f_z^\sigma(c_n, u_n) \odot I) \\
+    \ell_n &\sim \mathrm{LogNormal}\left( \ell_\mu^\top s_n ,\ell_{\sigma^2}^\top s_n \right) \\
+    \rho _n &= f_w\left( z_n, s_n \right) \\
+    \pi_{ng} &= f_h^g(z_n, s_n) \\
+    x_{ng} &\sim \mathrm{ObservationModel}(\ell_n \rho_n, \theta_g, \pi_{ng})
+    \end{align}
+
+We assume no knowledge over the distribution of cell types in the data (i.e.,
+uniform probabilities for categorical distribution on :math:`c_n`).
+This modeling choice helps ensure a proper handling of rare cell types in the data.
+We assume that the within-cell-type characterization of the cell follows a  Normal distribution, s.t. :math:`u_n \sim \mathcal{N}(0, I_d)`.
+The distribution over the random vector :math:`z_n` contains learnable parameters in the form of
+the neural networks :math:`f_z^\mu`, :math:`f_z^\sigma`. Qualitatively, :math:`z_n` characterizes each cell
+cellular state as a continuous, low-dimensional random variable, and has the same interpretation as in the scVI model.
+However, the prior for this variable takes into account the partial cell-type information to better structure the latent space.
+
+The rest of the model closely follows scVI. In particular, it represents the library size as a random variable,
+and gene expression likelihoods as negative binomial distributions parameterized by functions of :math:`z_n, l_n`,
+condition to the batch assignments :math:`s_n`.
+
+.. figure:: figures/scanvi_pgm.png
+   :class: img-fluid
+   :align: center
+   :alt: scANVI graphical model
+
+   scANVI graphical model for the ZINB likelihood model. Note that this graphical model contains more latent variables than the presentation above. Marginalization of these latent variables leads to the ZINB observation model (math shown in publication supplement).
+
+
+In addition to the table in :doc:`/user_guide/models/scvi`,
+we have the following in scANVI.
+
+.. list-table::
+   :widths: 20 90 15
+   :header-rows: 1
+
+   * - Latent variable
+     - Description
+     - Code variable (if different)
+   * - :math:`c_n \in \Delta^{C-1}`
+     - Cell type.
+     - ``y``
+   * - :math:`z_n \in \mathbb{R}^{d}`
+     - Latent cell state
+     - ``z_1``
+   * - :math:`u_n \in \mathbb{R}^{d}`
+     - Latent cell-type specific state
+     - ``z_2``
+
+Inference
+========================
+
+scANVI assumes the following factorization for the inference model
+
+.. math::
+   :nowrap:
+
+   \begin{align}
+      q_\eta(z_n, l_n, u_n, c_n \mid x_n)
+      =
+      q_\eta(z_n \mid x_n)
+      q_\eta(l_n \mid x_n)
+      q_\eta(c_n \mid z_n)
+      q_\eta(u_n \mid c_n, z_n)
+   \end{align}
+
+We make several observations here.
+First, each of those variational distributions will be parameterized by neural networks.
+Second, while :math:`q_\eta(z_n, x_n)` and :math:`q_\eta(u_n \mid c_n, z_n)` are assumed Gaussian, :math:`q_\eta(c_n \mid z_n)` corresponds to a Categorical distribution over cell types.
+In particular, the variational distribution :math:`q_\eta(c_n \mid z_n)` can predict cell types for any cell.
+
+Behind the scenes, scANVI's classifier uses the mean of a cell's variational distribution :math:`q_\eta(z_n \mid x_n)`
+for classification.
+
+Training details
+----------------
+
+scANVI optimizes evidence lower bounds (ELBO) on the log evidence.
+For the sake of clarity, we ignore the library size and batch assignments below.
+We note that the evidence and hence the ELBO have a different expression for cells with observed and unobserved cell types.
+
+First, assume that we observe both gene expressions :math:`x_n` and type assignments :math:`c_n`.
+In that case, we bound the log evidence as
+
+.. math::
+   :nowrap:
+
+   \begin{align}
+    \log p_\theta(x_n, c_n)
+    \geq
+    \mathbb{E}_{q_\eta(z_n \mid x_n)
+        q_\eta(u_n \mid z_n, c_n)}
+    \left[
+        \log
+        \frac
+        {
+        p_\theta(x_n, c_n, z_n, u_n)
+        }
+        {
+        q_\eta(z_n \mid x_n)
+        q_\eta(u_n \mid z_n, c_n)
+        }
+    \right]
+    =: \mathcal{L}_S
+   \end{align}
+
+We aim to optimize for :math:`\theta, \eta` the right-hand side of this equation using stochastic gradient descent.
+Gradient updates for the generative model parameters :math:`\theta` are easy to get.
+In that case, the gradient of the expectation corresponds to the expectation of the gradients.
+
+However, this is not the case when we differentiate for :math:`\eta`.
+The reparameterization trick solves this issue and applies to the (Gaussian) distributions associated with :math:`q_\eta(z_n \mid x_n)
+,q_\eta(u_n \mid z_n, c_n)`.
+In particular, we can write :math:`\mathcal{L}_S` as an expectation under noise distributions independent of :math:`\eta`.
+For convenience, we will write expectations of the form :math:`\mathbb{E}_{\epsilon_v}` to denote expectation under the variational distribution using the reparameterization trick.
+We refer the reader to [#ref3]_ for additional insight on the reparameterization trick.
+
+.. math::
+   :nowrap:
+
+   \begin{align}
+    \nabla_\eta \mathcal{L}_S
+    :=
+    \mathbb{E}_{\epsilon_z, \epsilon_u}
+    \left[
+        \nabla_\eta
+        \log
+        \frac
+        {
+        p_\theta(x_n, c_n, z_n, u_n)
+        }
+        {
+        q_\eta(z_n \mid x_n)
+        q_\eta(u_n \mid z_n, c_n)
+        }
+    \right]
+    =: \mathcal{L}_S
+   \end{align}
+
+Things get trickier in the unobserved cell type case.
+In this setup, the ELBO corresponds to the right-hand side of
+
+.. math::
+   :nowrap:
+
+   \begin{align}
+    p_\theta(x_n)
+    \geq
+    \mathbb{E}_{
+        q_\eta(z_n \mid x_n)
+        q_\eta(c_n \mid z_n)
+        q_\eta(u_n \mid z_n, c_n)
+    }
+    \left[
+        \log
+        \frac
+        {
+        p_\theta(x_n, c_n, z_n, u_n)
+        }
+        {
+        q_\eta(z_n \mid x_n)
+        q_\eta(c_n \mid z_n)
+        q_\eta(u_n \mid z_n, c_n)
+        }
+    \right]=:\mathcal{L}_u
+   \end{align}
+
+Unfortunately, the reparameterization trick does not apply naturally to :math:`q_\eta(c_n \mid z_n)`.
+As an alternative, we observe that
+
+.. math::
+   :nowrap:
+
+   \begin{align}
+    \mathcal{L}_u
+    =
+    \mathbb{E}_{
+        \epsilon_z
+    }
+    \left[
+        \sum_{c=1}^C
+        q_\eta(c_n=c \mid z_n)
+        \mathbb{E}_{\epsilon_u}
+            \left[
+            \log
+            \frac
+            {
+            p_\theta(x_n, c_n=c, z_n, u_n)
+            }
+            {
+            q_\eta(z_n \mid x_n)
+            q_\eta(c_n \mid z_n)
+            q_\eta(u_n \mid z_n, c_n=c)
+            }
+        \right]
+    \right]
+   \end{align}
+
+In this form, we can differentiate :math:`\mathcal{L}_u` with respect to the inference network parameters, as
+
+.. math::
+   :nowrap:
+
+   \begin{align}
+    \nabla_\eta \mathcal{L}_u
+    =
+    \mathbb{E}_{
+        \epsilon_z
+    }
+    \left[
+        \sum_{c=1}^C
+        \nabla_\eta
+        \left(
+            q_\eta(c_n=c \mid z_n)
+            \mathbb{E}_{\epsilon_u}
+                \left[
+                \log
+                \frac
+                {
+                p_\theta(x_n, c_n=c, z_n, u_n)
+                }
+                {
+                q_\eta(z_n \mid x_n)
+                q_\eta(c_n \mid z_n)
+                q_\eta(u_n \mid z_n, c_n=c)
+                }
+        \right)
+        \right]
+    \right]
+   \end{align}
+
+In other words, we will need to marginalize :math:`c_n` out to circumvent the fact that categorical distributions cannot use the reparameterization trick.
+
+
+Overall, we optimize :math:`\mathcal{L} = \mathcal{L}_U + \mathcal{L}_S` to train the model on both labeled and unlabelled data.
+
+
+
+
+Tasks
+=====
+
+scANVI can perform all the same tasks as scVI (see :doc:`/user_guide/models/scvi`). In addition,
+scANVI can do the following:
+
+
+Prediction
+----------
+
+For prediction, scANVI returns :math:`q_\eta(c_n \mid z_n)` in the following function:
+
+
+    >>> adata.obs["scanvi_prediction"] = model.predict()
+
+
+
+.. topic:: References:
+
+    .. [#ref1] Xu Chenling, Romain Lopez, Edouard Mehlman, Jeffrey Regier, Michael I. Jordan, Nir Yosef (2021),
+        *Probabilistic harmonization and annotation of single‐cell transcriptomics data with deep generative models*,
+        `Molecular systems biology 17.1 <https://www.embopress.org/doi/epdf/10.15252/msb.20209620>`__.
+
+    .. [#refTS] Tabula Sapiens Consortium (2021),
+        *The Tabula Sapiens: a single cell transcriptomic atlas of multiple organs from individual human donors*,
+        `BioRxiv <https://www.biorxiv.org/content/10.1101/2021.07.19.452956v1.full.pdf>`__.
+
+
+    .. [#ref2] Diederik P. Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling (2014),
+        *Semi-supervised learning with deep generative models*,
+        `Advances in neural information processing systems <https://proceedings.neurips.cc/paper/2014/file/d523773c6b194f37b938d340d5d02232-Paper.pdf>`__.
+
+
+    .. [#ref3] Diederik P. Kingma, Max Welling (2013) (2014),
+        *Auto-Encoding Variational Bayes*,
+        `Arxiv <https://arxiv.org/abs/1312.6114>`__.
diff --git a/docs/user_guide/models/scvi.rst b/docs/user_guide/models/scvi.rst
index 0570985f14..0907486be5 100644
--- a/docs/user_guide/models/scvi.rst
+++ b/docs/user_guide/models/scvi.rst
@@ -7,15 +7,13 @@ be used for many common downstream tasks.
 
 The advantages of scVI are:
 
-    + Comprehensive in capabilities.
-
-    + Scalable to very large datasets (>1 million cells).
+- Comprehensive in capabilities.
+- Scalable to very large datasets (>1 million cells).
 
 The limitations of scVI include:
 
-    + Effectively requires a GPU for fast inference.
-
-    + Latent space is not interpretable, unlike that of a linear method.
+- Effectively requires a GPU for fast inference.
+- Latent space is not interpretable, unlike that of a linear method.
 
 
 .. topic:: Tutorials:
@@ -107,13 +105,13 @@ Inference
 scVI uses variational inference and specifically auto-encoding variational bayes (see :doc:`/user_guide/background/variational_inference`) to learn both the model parameters (the
 neural network params, dispersion params, etc.) and an approximate posterior distribution with the following factorization:
 
- .. math::
-    :nowrap:
+.. math::
+   :nowrap:
 
-    \begin{align}
-       q_\eta(z_n, \ell_n \mid x_n) :=
-       q_\eta(z_n \mid x_n, s_n)q_\eta(\ell_n \mid x_n).
-    \end{align}
+   \begin{align}
+      q_\eta(z_n, \ell_n \mid x_n) :=
+      q_\eta(z_n \mid x_n, s_n)q_\eta(\ell_n \mid x_n).
+   \end{align}
 
 Here :math:`\eta` is a set of parameters corresponding to inference neural networks (encoders), which we do not describe in detail here,
 but are described in the scVI paper. The underlying class used as the encoder for scVI is :class:`~scvi.nn.Encoder`.
diff --git a/docs/user_guide/models/solo.rst b/docs/user_guide/models/solo.rst
index 413d588398..220ac134ff 100644
--- a/docs/user_guide/models/solo.rst
+++ b/docs/user_guide/models/solo.rst
@@ -7,13 +7,13 @@ be used for many common downstream tasks.
 
 The advantages of Solo are:
 
-    + Can perform doublet detection on pre-trained :class:`~scvi.model.SCVI` models
+- Can perform doublet detection on pre-trained :class:`~scvi.model.SCVI` models
+- Scalable to very large datasets (>1 million cells).
 
-    + Scalable to very large datasets (>1 million cells).
 
 The limitations of Solo include:
 
-    + For an analysis seeking to only do doublet detection, Solo will be slower than other methods.
+- For an analysis seeking to only do doublet detection, Solo will be slower than other methods.
 
 
 
diff --git a/docs/user_guide/models/stereoscope.rst b/docs/user_guide/models/stereoscope.rst
index bc04b6b8f1..2bad1bde24 100644
--- a/docs/user_guide/models/stereoscope.rst
+++ b/docs/user_guide/models/stereoscope.rst
@@ -7,13 +7,12 @@ method for the deconvoluton of cell type profiles using a single-cell RNA sequen
 
 The advantages of Stereoscope are:
 
-    + Can stratify cells into discrete cell types.
-
-    + Scalable to very large datasets (>1 million cells).
+- Can stratify cells into discrete cell types.
+- Scalable to very large datasets (>1 million cells).
 
 The limitations of Stereoscope include:
 
-    + Effectively requires a GPU for fast inference.
+- Effectively requires a GPU for fast inference.
 
 .. topic:: Tutorial:
 
@@ -23,7 +22,7 @@ The limitations of Stereoscope include:
 Preliminaries
 =============
 Stereoscope requires training two latent variable models (LVMs): one for the single-cell reference
-dataset and one for the spatial transcriptomics dataset, which incorporates the learned parameters of the 
+dataset and one for the spatial transcriptomics dataset, which incorporates the learned parameters of the
 single-cell reference LVM. The first LVM takes in as input a scRNA-seq gene expression matrix of UMI counts
 :math:`Y` with :math:`N` cells and :math:`G` genes, along with a vector of cell type labels :math:`\vec{z}`.
 Subsequently, the second LVM takes in the learned parameters of the first LVM, along with a spatial gene
@@ -183,4 +182,4 @@ Subsequently for a given cell type, users can plot a heatmap of the cell type pr
 
     .. [#ref1] Alma Andersson, Joseph Bergenstråhle, Michaela Asp, Ludvig Bergenstråhle, Aleksandra Jurek, José Fernández Navarro & Joakim Lundeberg (2020),
        *Single-cell and spatial transcriptomics enables probabilistic inference of cell type topography*,
-       `Communications Biology <https://www.nature.com/articles/s42003-020-01247-y>`__.
\ No newline at end of file
+       `Communications Biology <https://www.nature.com/articles/s42003-020-01247-y>`__.
diff --git a/docs/user_guide/models/totalvi.rst b/docs/user_guide/models/totalvi.rst
index 7263003c2b..1b633021cb 100644
--- a/docs/user_guide/models/totalvi.rst
+++ b/docs/user_guide/models/totalvi.rst
@@ -7,15 +7,13 @@ be used for many common downstream tasks.
 
 The advantages of totalVI are:
 
-    + Comprehensive in capabilities.
-
-    + Scalable to very large datasets (>1 million cells).
+- Comprehensive in capabilities.
+- Scalable to very large datasets (>1 million cells).
 
 The limitations of totalVI include:
 
-    + Effectively requires a GPU for fast inference.
-
-    + Difficult to understand the balance between RNA and protein data in the low-dimensional representation of cells.
+- Effectively requires a GPU for fast inference.
+- Difficult to understand the balance between RNA and protein data in the low-dimensional representation of cells.
 
 .. topic:: Tutorials:
 
@@ -139,13 +137,13 @@ totalVI uses variational inference, and specifically auto-encoding variational b
 neural network params, dispersion params, etc.), and an approximate posterior distribution with the following factorization:
 
 
- .. math::
-    :nowrap:
+.. math::
+  :nowrap:
 
-    \begin{align}
-       q_\eta(\beta_n, z_n, l_n \mid x_n, y_n, s_n) :=
-       q_\eta(\beta_n \mid z_n,s_n)q_\eta(z_n \mid x_n, y_n,s_n)q_\eta(l_n \mid x_n, y_n, s_n).
-    \end{align}
+  \begin{align}
+     q_\eta(\beta_n, z_n, l_n \mid x_n, y_n, s_n) :=
+     q_\eta(\beta_n \mid z_n,s_n)q_\eta(z_n \mid x_n, y_n,s_n)q_\eta(l_n \mid x_n, y_n, s_n).
+  \end{align}
 
 Here :math:`\eta` is a set of parameters corresponding to inference neural networks, which we do not describe in detail here,
 but are described in the totalVI paper. totalVI can also handle missing proteins (i.e., a dataset comprised of