You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
.italic["The brain has about $10^{14}$ synapses and we only live for about $10^9$
seconds. So we have a lot more parameters than data. This
motivates the idea that we must do a lot of unsupervised learning
since the perceptual input (including proprioception) is the only
place we can get $10^5$ dimensions of constraint per second."]
.italic["We need tremendous amount of information to build machines that have common sense and generalize."]
.pull-right[Yann LeCun, 2016.]
class: middle
Deep unsupervised learning
Deep unsupervised learning is about learning a model of the data, explicitly or implicitly, without requiring labels.
Generative models: recreate the raw data distribution (e.g., the distribution of natural images).
Self-supervised learning: solve puzzle tasks that require semantic understanding (e.g., predict a missing word in a sequence).
class: middle
Generative models
A (deep) generative model is a probabilistic model $p_\theta$ that can be used as a simulator of the data.
Formally, a generative model defines a probability distribution $p_\theta(\mathbf{x})$ over the data $\mathbf{x} \in \mathcal{X}$, where the parameters $\theta$ are learned to match the (unknown) data distribution $p(\mathbf{x})$.
.center.width-60[]
???
This is conceptually identical to what we already did in Lecture 10 when we wanted to learn $p(y|x)$. We still want to learn a distribution, but this time it is the distribution of the input data itself.
.footnote[Credits: Francois Fleuret, Deep Learning, UNIGE/EPFL.]
class: middle
count: false
.center.width-90[]
.footnote[Credits: Francois Fleuret, Deep Learning, UNIGE/EPFL.]
class: middle
count: false
.center.width-90[]
.footnote[Credits: Francois Fleuret, Deep Learning, UNIGE/EPFL.]
Auto-encoders
An auto-encoder is a composite function made of
an encoder$f$ from the original space $\mathcal{X}$ to a latent space $\mathcal{Z}$,
a decoder$g$ to map back to $\mathcal{X}$,
such that $g \circ f$ is close to the identity on the data.
.center.width-80[]
.footnote[Credits: Francois Fleuret, Deep Learning, UNIGE/EPFL.]
class: middle
Let $p(\mathbf{x})$ be the data distribution over $\mathcal{X}$. A good auto-encoder could be characterized with the reconstruction loss
$$\mathbb{E}_{\mathbf{x} \sim p(\mathbf{x})} \left[ || \mathbf{x} - g \circ f(\mathbf{x}) ||^2 \right] \approx 0.$$
Given two parameterized mappings $f(\cdot; \theta_f)$ and $g(\cdot;\theta_g)$, training consists of minimizing an empirical estimate of that loss,
$$\theta_f, \theta_g = \arg \min_{\theta_f, \theta_g} \frac{1}{N} \sum_{i=1}^N || \mathbf{x}_i - g(f(\mathbf{x}_i,\theta_f), \theta_g) ||^2.$$
.footnote[Credits: Francois Fleuret, Deep Learning, UNIGE/EPFL.]
class: middle
For example, when the auto-encoder is linear,
$$
\begin{aligned}
f: \mathbf{z} &= \mathbf{U}^T \mathbf{x} \\
g: \hat{\mathbf{x}} &= \mathbf{U} \mathbf{z},
\end{aligned}
$$
with $\mathbf{U} \in \mathbb{R}^{p\times d}$, the reconstruction error reduces to
$$\mathbb{E}_{\mathbf{x} \sim p(\mathbf{x})} \left[ || \mathbf{x} - \mathbf{U}\mathbf{U}^T \mathbf{x} ||^2 \right].$$
In this case, an optimal solution is given by PCA.
class: middle
Deep auto-encoders
.center.width-80[ ]
Better results can be achieved with more sophisticated classes of mappings than linear projections: use deep neural networks for $f$ and $g$.
For instance,
by combining a multi-layer perceptron encoder $f : \mathbb{R}^p \to \mathbb{R}^d$ with a multi-layer perceptron decoder $g: \mathbb{R}^d \to \mathbb{R}^p$.
by combining a convolutional network encoder $f : \mathbb{R}^{w\times h \times c} \to \mathbb{R}^d$ with a decoder $g : \mathbb{R}^d \to \mathbb{R}^{w\times h \times c}$ composed of the reciprocal transposed convolutional layers.
class: middle
.center.width-60[]
.footnote[Credits: Francois Fleuret, Deep Learning, UNIGE/EPFL.]
class: middle
.center.width-60[]
.footnote[Credits: Francois Fleuret, Deep Learning, UNIGE/EPFL.]
class: middle
.center.width-60[]
.footnote[Credits: Francois Fleuret, Deep Learning, UNIGE/EPFL.]
class: middle
Interpolation
To get an intuition of the learned latent representation, we can pick two samples $\mathbf{x}$ and $\mathbf{x}'$ at random and interpolate samples along the line in the latent space.
.footnote[Credits: Francois Fleuret, Deep Learning, UNIGE/EPFL.]
class: middle
.center.width-60[]
.footnote[Credits: Francois Fleuret, Deep Learning, UNIGE/EPFL.]
class: middle
.center.width-60[]
.footnote[Credits: Francois Fleuret, Deep Learning, UNIGE/EPFL.]
Denoising auto-encoders
Besides dimension reduction, auto-encoders can capture dependencies between signal components to restore degraded or noisy signals.
In this case, the composition $$h = g \circ f : \mathcal{X} \to \mathcal{X}$$ is a denoising auto-encoder.
The goal is to optimize $h$ such that a perturbation $\tilde{\mathbf{x}}$ of the signal $\mathbf{x}$ is restored to $\mathbf{x}$, hence $$h(\tilde{\mathbf{x}}) \approx \mathbf{x}.$$
.footnote[Credits: Francois Fleuret, Deep Learning, UNIGE/EPFL.]
class: middle
.center.width-60[]
.footnote[Credits: Francois Fleuret, Deep Learning, UNIGE/EPFL.]
class: middle
.center.width-60[]
.footnote[Credits: Francois Fleuret, Deep Learning, UNIGE/EPFL.]
class: middle
.center.width-60[]
.footnote[Credits: Francois Fleuret, Deep Learning, UNIGE/EPFL.]
class: middle
.center.width-60[]
.footnote[Credits: Francois Fleuret, Deep Learning, UNIGE/EPFL.]
class: middle
A fundamental weakness of denoising auto-encoders is that the posterior $p(\mathbf{x}|\tilde{\mathbf{x}})$ is possibly multi-modal.
If we train an auto-encoder with the quadratic loss (i.e., implicitly assuming a Gaussian likelihood), then the best reconstruction is
$$h(\tilde{\mathbf{x}}) = \mathbb{E}[\mathbf{x}|\tilde{\mathbf{x}}],$$
which may be very unlikely under $p(\mathbf{x}|\tilde{\mathbf{x}})$.
.center.width-60[]
.footnote[Credits: Francois Fleuret, Deep Learning, UNIGE/EPFL.]
???
Also, the quadratic loss leads to blurry and unrealistic reconstructions, for the reason that the quadratic loss minimizer may be very unlikely under the posterior.
Sampling from an AE's latent space
The generative capability of the decoder $g$ in an auto-encoder can be assessed by introducing a (simple) density model $q$ over the latent space $\mathcal{Z}$, sample there, and map the samples into the data space $\mathcal{X}$ with $g$.
.center.width-80[]
.footnote[Credits: Francois Fleuret, Deep Learning, UNIGE/EPFL.]
class: middle
For instance, a factored Gaussian model with diagonal covariance matrix,
$$q(\mathbf{z}) = \mathcal{N}(\hat{\mu}, \hat{\Sigma}),$$
where both $\hat{\mu}$ and $\hat{\Sigma}$ are estimated on training data.
class: middle
.center.width-60[]
.footnote[Credits: Francois Fleuret, Deep Learning, UNIGE/EPFL.]
class: middle
These results are not satisfactory because the density model on the latent space is too simple and inadequate.
Building a good model in latent space amounts to our original problem of modeling an empirical distribution, although it may now be in a lower dimension space.
.footnote[Credits: Francois Fleuret, Deep Learning, UNIGE/EPFL.]
class: middle
count: false
Variational inference
???
Switch to BB.
class: middle
Latent variable model
.center.width-20[]
Consider for now a prescribed latent variable model that relates a set of observable variables $\mathbf{x} \in \mathcal{X}$ to a set of unobserved variables $\mathbf{z} \in \mathcal{Z}$.
The probabilistic model defines a joint probability distribution $p_\theta(\mathbf{x}, \mathbf{z})$, which decomposes as
$$p_\theta(\mathbf{x}, \mathbf{z}) = p_\theta(\mathbf{x}|\mathbf{z}) p(\mathbf{z}).$$
???
The probabilistic model is given and motivated by domain knowledge assumptions.