R: Code the GMM example
R: Code the NF with coupling layers and visualize the transformations
.italic["Every time a scientific paper presents a bit of data, it's accompanied
by an .bold[error bar] – a quiet but insistent reminder that no knowledge is complete or perfect. It's a .bold[calibration of how much we trust what we think we know]."]
Knowledge is an artefact. It is a mental construct.
Uncertainty is how much we trust this construct.
Uncertainty refers to situations where there is .bold[imperfect or unknown information]. It can arise in predictions of future events, in physical measurements, or in situations where information is unknown.
Accounting for uncertainty is necessary for making optimal decisions. Not accounting for uncertainty can lead to suboptimal, wrong, or even catastrophic decisions.
.italic[Case 1]. First assisted driving fatality in May 2016: Perception system mistook trailer's white side for bright sky.
]
class: middle
.center.width-60[ ]
.italic[Case 2]. An image classification system erroneously identifies two African Americans as gorillas, raising concerns of racial discrimination.
.footnote[Credits: Kendall and Gal, What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? , 2017.]
class: middle
.alert[The systems that made these errors were likely confident in their predictions. They did not account for uncertainty.]
class: middle
class: middle
Aleatoric uncertainty refers to the uncertainty arising from the inherent stochasticity of the true data generating process. This uncertainty .bold[cannot be reduced] with more data.
A common example is observational noise due to the limitations of the measurement devices. Collecting more data will not reduce the noise.
class: middle
Assumptions about the data generating process can help in distinguishing between different types of aleatoric uncertainty:
Homoscedastic uncertainty, which is constant across the input space.
Heteroscedastic uncertainty, which varies across the input space.
.center.width-90[![](figures/lec10/homo-vs-hetero.png)]
Neural density estimation
Consider training data $(\mathbf{x}, y) \sim p(\mathbf{x}, y)$ , with
$\mathbf{x} \in \mathbb{R}^p$ ,
$y \in \mathbb{R}$ .
We do not wish to learn a function $\hat{y} = f(\mathbf{x})$ , which would only produce point estimates.
Instead we want to learn the full conditional density $$p(y|\mathbf{x}).$$
class: middle
NN with Gaussian output layer
We can model aleatoric uncertainty in the output by modelling the conditional distribution as a Gaussian distribution,
$$p(y|\mathbf{x}) = \mathcal{N}(y; \mu(\mathbf{x}), \sigma^2(\mathbf{x})),$$
where $\mu(x)$ and $\sigma^2(\mathbf{x})$ are parametric functions to be learned, such as neural networks.
Note: The Gaussian distribution is a modelling choice. Other parametric distributions can be used.
class: middle
.center.width-80[ ]
.center[Case 1: Homoscedastic aleatoric uncertainty]
class: middle
We have,
$$\begin{aligned}
&\arg \max_{\theta,\sigma^2} p(\mathbf{d}|\theta,\sigma^2) \\
&= \arg \max_{\theta,\sigma^2} \prod_{\mathbf{x}_i, y_i \in \mathbf{d}} p(y_i|\mathbf{x}_i, \theta,\sigma^2) \\
&= \arg \max_{\theta,\sigma^2} \prod_{\mathbf{x}_i, y_i \in \mathbf{d}} \frac{1}{\sqrt{2\pi} \sigma} \exp\left(-\frac{(y_i-\mu(\mathbf{x}_i))^2}{2\sigma^2}\right) \\
&= \arg \min_{\theta,\sigma^2} \sum_{\mathbf{x}_i, y_i \in \mathbf{d}} \frac{(y_i-\mu(\mathbf{x}_i))^2}{2\sigma^2} + \log(\sigma) + C
\end{aligned}$$
.question[What if $\sigma^2$ was fixed?]
class: middle
.center.width-80[ ]
.center[Case 2: Heteroscedastic aleatoric uncertainty]
class: middle
Same as for the homoscedastic case, except that that $\sigma^2$ is now a function of $\mathbf{x}_i$ :
$$\begin{aligned}
&\arg \max_{\theta} p(\mathbf{d}|\theta) \\
&= \arg \max_{\theta} \prod_{\mathbf{x}_i, y_i \in \mathbf{d}} p(y_i|\mathbf{x}_i, \theta) \\
&= \arg \max_{\theta} \prod_{\mathbf{x}_i, y_i \in \mathbf{d}} \frac{1}{\sqrt{2\pi} \sigma(\mathbf{x}_i)} \exp\left(-\frac{(y_i-\mu(\mathbf{x}_i))^2}{2\sigma^2(\mathbf{x}_i)}\right) \\
&= \arg \min_{\theta} \sum_{\mathbf{x}_i, y_i \in \mathbf{d}} \frac{(y_i-\mu(\mathbf{x}_i))^2}{2\sigma^2(\mathbf{x}_i)} + \log(\sigma(\mathbf{x}_i)) + C
\end{aligned}$$
.question[What is the purpose of $2\sigma^2(\mathbf{x}_i)$ ? What about $\log(\sigma(\mathbf{x}_i))$ ?]
???
Take care of properly parametrizing $\sigma^2(\mathbf{x}_i)$ to ensure that it is positive.
class: middle
Modelling $p(y|\mathbf{x})$ as a unimodal (Gaussian) distribution can be inadequate since the conditional distribution may be .bold[multimodal].
???
Illustrate on the blackboard.
class: middle
A Gaussian mixture model (GMM) defines instead $p(y|\mathbf{x})$ as a mixture of $K$ Gaussian components,
$$p(y|\mathbf{x}) = \sum_{k=1}^K \pi_k \mathcal{N}(y;\mu_k, \sigma_k^2),$$
where $0 \leq \pi_k \leq 1$ for all $k$ and $\sum_{k=1}^K \pi_k = 1$ .
.center.width-60[ ]
class: middle
A .bold[mixture density network] (MDN) is a neural network implementation of the Gaussian mixture model.
.center.width-100[ ]
class: middle
Let us consider training data generated randomly as
$$y_i = \mathbf{x}_i + 0.3\sin(4\pi \mathbf{x}_i) + \epsilon_i$$
with $\epsilon_i \sim \mathcal{N}$ .
class: middle
.center[
.width-55[ ]
The data can be fit with a 2-layer network producing point estimates for $y$
(demo ).
]
.footnote[Credits: David Ha, Mixture Density Networks , 2015.]
class: middle
.center[
.width-55[ ]
If we flip $\mathbf{x}_i$ and $y_i$ , the network faces issues since for each input, there are multiple outputs that can work. It produces an average of the correct values
(demo ).
]
.footnote[Credits: David Ha, Mixture Density Networks , 2015.]
class: middle
.center[
.width-55[ ]
A mixture density network models the data correctly, as it predicts for each input a distribution for the output, rather than a point estimate
(demo ).
]
.footnote[Credits: David Ha, Mixture Density Networks , 2015.]
Gaussian mixture models are a flexible way to model multimodal distributions, but they are limited by the number of components $K$ , which must be large to model complex distributions.
Normalizing flows are a more flexible way to model complex distributions.
class: middle
.center.width-80[ ]
Assume $p(\mathbf{z})$ is a uniformly distributed unit cube in $\mathbb{R}^3$ and $\mathbf{x} = f(\mathbf{z}) = 2\mathbf{z}$ .
Since the total probability mass must be conserved,
$$p(\mathbf{x})=p(\mathbf{x}=f(\mathbf{z})) = p(\mathbf{z})\frac{V_\mathbf{z}}{V_\mathbf{x}}=p(\mathbf{z}) \frac{1}{8},$$
where $\frac{1}{8} = \left| \det \left( \begin{matrix}
2 & 0 & 0 \\
0 & 2 & 0 \\
0 & 0 & 2
\end{matrix} \right)\right|^{-1}$ represents the inverse determinant of the Jacobian of the linear transformation $f$ .
???
Motivate that picking a parametric family of distributions is not always easy. We want something more flexible.
class: middle
What if $f$ is non-linear?
.center.width-70[ ]
.footnote[Image credits: Simon J.D. Prince, Understanding Deep Learning , 2023.]
class: middle
Change of variables theorem
If $f$ is non-linear,
the Jacobian $J_f(\mathbf{z})$ of $\mathbf{x} = f(\mathbf{z})$ represents the infinitesimal linear transformation in the neighborhood of $\mathbf{z}$ ;
if the function is a bijective map, then the mass must be conserved locally.
Therefore, the local change of density yields
$$p(\mathbf{x}=f(\mathbf{z})) = p(\mathbf{z})\left| \det J_f(\mathbf{z}) \right|^{-1}.$$
Similarly, for $g = f^{-1}$ , we have $$p(\mathbf{x})=p(\mathbf{z}=g(\mathbf{x}))\left| \det J_g(\mathbf{x}) \right|.$$
???
The Jacobian matrix of a function f: R^n -> R^m at a point z in R^n is an m x n matrix that represents the linear transformation induced by the function at that point. Geometrically, the Jacobian matrix can be thought of as a matrix of partial derivatives that describes how the function locally stretches or shrinks areas and volumes in the vicinity of the point z.
The determinant of the Jacobian matrix of f at z has a geometric interpretation as the factor by which the function locally scales areas or volumes. Specifically, if the determinant is positive, then the function locally expands areas and volumes, while if it is negative, the function locally contracts areas and volumes. The absolute value of the determinant gives the factor by which the function scales the areas or volumes.
class: middle
Assume $\mathbf{z} = (\mathbf{z}_a, \mathbf{z}_b)$ and $\mathbf{x} = (\mathbf{x}_a, \mathbf{x}_b)$ . Then,
Forward mapping $\mathbf{x} = f(\mathbf{z})$ :
$$\mathbf{x}_a = \mathbf{z}_a, \quad \mathbf{x}_b = \mathbf{z}_b \odot \exp(s(\mathbf{z}_a)) + t(\mathbf{z}_a),$$
Inverse mapping $\mathbf{z} = g(\mathbf{x})$ :
$$\mathbf{z}_a = \mathbf{x}_a, \quad \mathbf{z}_b = (\mathbf{x}_b - t(\mathbf{x}_a)) \odot \exp(-s(\mathbf{x}_a)),$$
where $s$ and $t$ are arbitrary neural networks.
???
Draw the coupling layer on the blackboard.
class: middle
For $\mathbf{x} = (\mathbf{x}_a, \mathbf{x}_b)$ , the log-likelihood is
$$\begin{aligned}\log p(\mathbf{x}) &= \log p(\mathbf{z}) \left| \det J_f(\mathbf{z}) \right|^{-1}\end{aligned}$$
where the Jacobian $J_f(\mathbf{z}) = \frac{\partial \mathbf{x}}{\partial \mathbf{z}}$ is a lower triangular matrix $$\left( \begin{matrix}
\mathbf{I} & 0 \\
\frac{\partial \mathbf{x}_b}{\partial \mathbf{z}_a} & \text{diag}(\exp(s(\mathbf{z}_a))) \end{matrix} \right),$$
such that $\left| \det J_f(\mathbf{z}) \right| = \prod_i \exp(s(\mathbf{z}_a))_i = \exp(\sum_i s(\mathbf{z}_a)_i)$ .
Therefore, the log-likelihood is
$$\begin{aligned}\log p(\mathbf{x}) &= \log p(\mathbf{z}) - \sum_i s(\mathbf{z}_a)_i\end{aligned}$$
class: middle
A normalizing flow is a change of variable $f$ that transforms a base distribution $p(\mathbf{z})$ into $p(\mathbf{x})$ through a discrete sequence of invertible transformations.
.center.width-100[![](figures/lec10/FlowTransformLayers.svg)]
.footnote[Image credits: Simon J.D. Prince, Understanding Deep Learning , 2023.]
class: middle
Formally,
$$\begin{aligned}
&\mathbf{z}_0 \sim p(\mathbf{z}) \\
&\mathbf{z}_k = f_k(\mathbf{z}_{k-1}), \quad k=1,...,K \\
&\mathbf{x} = \mathbf{z}_K = f_K \circ ... \circ f_1(\mathbf{z}_0).
\end{aligned}$$
The change of variable theorem yields
$$\log p(\mathbf{x}) = \log p(\mathbf{z}_0) - \sum_{k=1}^K \log \left| \det J_{f_k}(\mathbf{z}_{k-1}) \right|.$$
class: middle
.center.width-90[ ]
.center[Normalizing flows can fit complex multimodal discontinuous densities.]
.footnote[Image credits: Wehenkel and Louppe , 2019.]
class: middle
Conditional normalizing flows
Normalizing flows can also estimate densities $p(\mathbf{x} | c)$ conditioned on a context $c$ .
Transformations are made conditional by taking $c$ as an additional input. For example, in a coupling layer, the networks can be upgraded to $s(\mathbf{z}, c)$ and $t(\mathbf{z}, c)$ .
Optionally, the base distribution $p(\mathbf{z})$ can also be made conditional on $c$ .
(Accordingly, aleatoric uncertainty of some output $y$ conditioned on an input $\mathbf{x}$ can be modelled by a conditional normalizing flow $p(y|\mathbf{x})$ where the context $c$ is the input $\mathbf{x}$ .)
class: middle
.center.width-100[ ]
.footnote[Image credits: Winkler et al , 2019.]
class: middle
Continuous-time normalizing flows
.grid[
.kol-1-2[
Replace the discrete sequence of transformations with a neural ODE with reversible dynamics such that
$$\begin{aligned}
&\mathbf{z}_0 \sim p(\mathbf{z})\\
&\frac{d\mathbf{z}(t)}{dt} = f(\mathbf{z}(t), t, \theta)\\
&\mathbf{x} = \mathbf{z}(1) = \mathbf{z}_0 + \int_0^1 f(\mathbf{z}(t), t) dt.
\end{aligned}$$
]
.kol-1-2.center[
]
]
The instantaneous change of variable yields
$$\log p(\mathbf{x}) = \log p(\mathbf{z}(0)) - \int_0^1 \text{Tr} \left( \frac{\partial f(\mathbf{z}(t), t, \theta)}{\partial \mathbf{z}(t)} \right) dt.$$
.footnote[Image credits: Grathwohl et al , 2018.]
class: middle
class: middle
Epistemic uncertainty accounts for uncertainty in the model or in its parameters.
It captures our ignorance about which model can best explain the collected data. It .bold[can be explained away] given enough data.
.footnote[Credits: Kendall and Gal, What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? , 2017.]
???
Once we have decided on a model of the true data generating process, we face uncertainty in how much we can trust the model or its parameters.
To capture epistemic uncertainty in a neural network, we model our ignorance with a prior distribution $p(\mathbf{\omega})$ over its weights and estimate the posterior distribution $p(\mathbf{\omega}|\mathbf{d})$ given the training set $\mathbf{d}$ .
.center[
.width-60[ ] .circle.width-30[ ]
]
class: middle
The prior predictive distribution at $\mathbf{x}$ is given by integrating over all possible weight configurations,
$$p(y|\mathbf{x}) = \int p(y|\mathbf{x}, \mathbf{\omega}) p(\mathbf{\omega}) d\mathbf{\omega}.$$
Given training data $\mathbf{d}=\{(\mathbf{x}_1, y_1), ..., (\mathbf{x}_N, y_N)\}$ a Bayesian update results in the posterior
$$p(\mathbf{\omega}|\mathbf{d}) = \frac{p(\mathbf{d}|\mathbf{\omega})p(\mathbf{\omega})}{p(\mathbf{d})}$$
where the likelihood $p(\mathbf{d}|\omega) = \prod_i p(y_i | \mathbf{x}_i, \omega).$
The posterior predictive distribution is then given by
$$p(y|\mathbf{x},\mathbf{d}) = \int p(y|\mathbf{x}, \mathbf{\omega}) p(\mathbf{\omega}|\mathbf{d}) d\mathbf{\omega}.$$
class: middle
Bayesian neural networks are easy to formulate, but notoriously .bold[difficult] to perform inference in.
$p(\mathbf{d})$ is intractable to evaluate, which results in the posterior $p(\mathbf{\omega}|\mathbf{d})$ not being tractable either.
Therefore, we must rely on approximations.
Variational inference can be used for building an approximation $q(\mathbf{\omega};\nu)$ of the posterior $p(\mathbf{\omega}|\mathbf{d})$ .
We can show that minimizing
$$\text{KL}(q(\mathbf{\omega};\nu) || p(\mathbf{\omega}|\mathbf{d}))$$
with respect to the variational parameters $\nu$ , is identical to maximizing the evidence lower bound objective (ELBO)
$$\text{ELBO}(\nu) = \mathbb{E}_{q(\mathbf{\omega};\nu)} \left[\log p(\mathbf{d}| \mathbf{\omega})\right] - \text{KL}(q(\mathbf{\omega};\nu) || p(\mathbf{\omega})).$$
???
Do it on the blackboard.
class: middle
The integral in the ELBO is not tractable for almost all $q$ , but it can be maximized with stochastic gradient ascent:
Sample $\hat{\omega} \sim q(\mathbf{\omega};\nu)$ .
Do one step of maximization with respect to $\nu$ on
$$\hat{L}(\nu) = \log p(\mathbf{d}|\hat{\omega}) - \log\frac{q(\hat{\omega};\nu)}{p(\hat{\omega})} $$
In the context of Bayesian neural networks, this procedure is also known as Bayes by backprop (Blundell et al, 2015).
Dropout is an empirical technique that was first proposed to avoid overfitting in neural networks.
At each training step :
Remove each node in the network with a probability $p$ .
Update the weights of the remaining nodes with backpropagation.
.center.width-70[ ]
???
Remind the students we used Dropout in Lec 8 when implementing a Transformer.
class: middle
At test time , either:
Make predictions using the trained network without dropout but rescaling the weights by the dropout probability $p$ (fast and standard).
Sample $T$ neural networks using dropout and average their predictions (slower but better principled).
class: middle, center
.width-100[ ]
class: middle
It makes the learned weights of a node less sensitive to the weights of the other nodes.
This forces the network to learn several independent representations of the patterns and thus decreases overfitting.
It approximates Bayesian model averaging .
class: middle
Dropout does variational inference
What variational family $q$ would correspond to dropout?
Let us split the weights $\omega$ per layer,
$\omega = \{ \mathbf{W}_1, ..., \mathbf{W}_L \},$
where $\mathbf{W}_i$ is further split per unit
$\mathbf{W}_i = \{ \mathbf{w}_{i,1}, ..., \mathbf{w}_{i,q_i} \}.$
Variational parameters $\nu$ are split similarly into $\nu = \{ \mathbf{M}_1, ..., \mathbf{M}_L \}$ , with $\mathbf{M}_i = \{ \mathbf{m}_{i,1}, ..., \mathbf{m}_{i,q_i} \}$ .
Then, the proposed $q(\omega;\nu)$ is defined as follows:
$$
\begin{aligned}
q(\omega;\nu) &= \prod_{i=1}^L q(\mathbf{W}_i; \mathbf{M}_i) \\
q(\mathbf{W}_i; \mathbf{M}_i) &= \prod_{k=1}^{q_i} q(\mathbf{w}_{i,k}; \mathbf{m}_{i,k}) \\
q(\mathbf{w}_{i,k}; \mathbf{m}_{i,k}) &= p\delta_0(\mathbf{w}_{i,k}) + (1-p)\delta_{\mathbf{m}_{i,k}}(\mathbf{w}_{i,k})
\end{aligned}
$$
where $\delta_a(x)$ denotes a (multivariate) Dirac distribution centered at $a$ .
???
Note that this assumes the parameterization $\mathbf{h} = \mathbf{W}\mathbf{x}$ , without the transpose on $\mathbf{W}$ .
class: middle
Given the previous definition for $q$ , sampling parameters $\hat{\omega} = \{ \hat{\mathbf{W}}_1, ..., \hat{\mathbf{W}}_L \}$ is done as follows:
Draw binary $z_{i,k} \sim \text{Bernoulli}(1-p)$ for each layer $i$ and unit $k$ .
Compute $\hat{\mathbf{W}}_i = \mathbf{M}_i \text{diag}([z_{i,k}]_{k=1}^{q_{i-1}})$ ,
where $\mathbf{M}_i$ denotes a matrix composed of the columns $\mathbf{m}_{i,k}$ .
.grid[
.kol-3-5[
That is, $\hat{\mathbf{W}}_i$ are obtained by setting columns of $\mathbf{M}_i$ to zero with probability $p$ .
This is strictly equivalent to dropout , i.e. removing units from the network with probability $p$ .
]
.kol-2-5[.center.width-100[ ]]
]
class: middle
Therefore, one step of stochastic gradient descent on the ELBO becomes:
Sample $\hat{\omega} \sim q(\mathbf{\omega};\nu)$ $\Leftrightarrow$ Randomly set units of the network to zero $\Leftrightarrow$ Dropout.
Do one step of maximization with respect to $\nu = \{ \mathbf{M}_i \}$ on
$$\hat{L}(\nu) = \log p(\mathbf{d}|\hat{\omega}) - \text{KL}(q(\mathbf{\omega};\nu) || p(\mathbf{\omega})).$$
class: middle
Maximizing $\hat{L}(\nu)$ is equivalent to minimizing
$$-\hat{L}(\nu) = -\log p(\mathbf{d}|\hat{\omega}) + \text{KL}(q(\mathbf{\omega};\nu) || p(\mathbf{\omega})) $$
This is also equivalent to one minimization step of a standard classification or regression objective:
The first term is the typical objective (such as the cross-entropy).
The second term forces $q$ to remain close to the prior $p(\omega)$ .
If $p(\omega)$ is Gaussian, minimizing the $\text{KL}$ is equivalent to $\ell_2$ regularization.
If $p(\omega)$ is Laplacian, minimizing the $\text{KL}$ is equivalent to $\ell_1$ regularization.
class: middle
Conversely, this shows that when training a network with dropout with a standard classification or regression objective, one is actually implicitly doing variational inference to match the posterior distribution of the weights.
class: middle
Uncertainty estimates from dropout
Proper uncertainty estimates at $\mathbf{x}$ , accounting for both the aleatoric and epistemic uncertainties, can be obtained in a principled way using Monte-Carlo integration:
Draw $T$ sets of network parameters $\hat{\omega}_t$ from $q(\omega;\nu)$ .
Compute the predictions for the $T$ networks, $\{ f(\mathbf{x};\hat{\omega}_t) \}_{t=1}^T$ .
Approximate the predictive mean and variance as
$$
\begin{aligned}
\mathbb{E}_{p(y|\mathbf{x},\mathbf{d})}\left[y\right] &\approx \frac{1}{T} \sum_{t=1}^T f(\mathbf{x};\hat{\omega}_t) \\
\mathbb{V}_{p(y|\mathbf{x},\mathbf{d})}\left[y\right] &\approx \sigma^2 + \frac{1}{T} \sum_{t=1}^T f(\mathbf{x};\hat{\omega}_t)^2 - \hat{\mathbb{E}}\left[y\right]^2,
\end{aligned}
$$
where $\sigma^2$ is the assumed level of noise in the observational model.
class: middle, center
.center.width-80[ ]
(demo )
class: middle
Pixel-wise depth regression
.center.width-80[ ]
.footnote[Credits: Kendall and Gal, What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? , 2017.]
exclude: true
Bayesian Infinite Networks
Consider the 1-layer MLP with a hidden layer of size $q$ and a bounded activation function $\sigma$ :
$$\begin{aligned}
f(x) &= b + \sum_{j=1}^q v_j h_j(x)\\\
h_j(x) &= \sigma\left(a_j + \sum_{i=1}^p u_{i,j}x_i\right)
\end{aligned}$$
Assume Gaussian priors $v_j \sim \mathcal{N}(0, \sigma_v^2)$ , $b \sim \mathcal{N}(0, \sigma_b^2)$ , $u_{i,j} \sim \mathcal{N}(0, \sigma_u^2)$ and $a_j \sim \mathcal{N}(0, \sigma_a^2)$ .
exclude: true
class: middle
For a fixed value $x^{(1)}$ , let us consider the prior distribution of $f(x^{(1)})$ implied by
the prior distributions for the weights and biases.
We have
$$\mathbb{E}[v_j h_j(x^{(1)})] = \mathbb{E}[v_j] \mathbb{E}[h_j(x^{(1)})] = 0,$$
since $v_j$ and $h_j(x^{(1)})$ are statistically independent and $v_j$ has zero mean by hypothesis.
The variance of the contribution of each hidden unit $h_j$ is
$$\begin{aligned}
\mathbb{V}[v_j h_j(x^{(1)})] &= \mathbb{E}[(v_j h_j(x^{(1)}))^2] - \mathbb{E}[v_j h_j(x^{(1)})]^2 \\
&= \mathbb{E}[v_j^2] \mathbb{E}[h_j(x^{(1)})^2] \\
&= \sigma_v^2 \mathbb{E}[h_j(x^{(1)})^2],
\end{aligned}$$
which must be finite since $h_j$ is bounded by its activation function.
We define $V(x^{(1)}) = \mathbb{E}[h_j(x^{(1)})^2]$ , and is the same for all $j$ .
exclude: true
class: middle
By the Central Limit Theorem, as $q \to \infty$ , the total contribution
of the hidden units, $\sum_{j=1}^q v_j h_j(x)$ , to the value of $f(x^{(1)})$ becomes a Gaussian with variance $q \sigma_v^2 V(x^{(1)})$ .
The bias $b$ is also Gaussian, of variance $\sigma_b^2$ , so for large $q$ , the prior
distribution $f(x^{(1)})$ is a Gaussian of variance $\sigma_b^2 + q \sigma_v^2 V(x^{(1)})$ .
exclude: true
class: middle
Accordingly, for $\sigma_v = \omega_v q^{-\frac{1}{2}}$ , for some fixed $\omega_v$ , the prior $f(x^{(1)})$ converges to a Gaussian of mean zero and variance $\sigma_b^2 + \omega_v^2 \sigma_v^2 V(x^{(1)})$ as $q \to \infty$ .
For two or more fixed values $x^{(1)}, x^{(2)}, ...$ , a similar argument shows that,
as $q \to \infty$ , the joint distribution of the outputs converges to a multivariate Gaussian
with means of zero and covariances of
$$\begin{aligned}
\mathbb{E}[f(x^{(1)})f(x^{(2)})] &= \sigma_b^2 + \sum_{j=1}^q \sigma_v^2 \mathbb{E}[h_j(x^{(1)}) h_j(x^{(2)})] \\
&= \sigma_b^2 + \omega_v^2 C(x^{(1)}, x^{(2)})
\end{aligned}$$
where $C(x^{(1)}, x^{(2)}) = \mathbb{E}[h_j(x^{(1)}) h_j(x^{(2)})]$ and is the same for all $j$ .
exclude: true
class: middle
This result states that for any set of fixed points $x^{(1)}, x^{(2)}, ...$ ,
the joint distribution of $f(x^{(1)}), f(x^{(2)}), ...$ is a multivariate
Gaussian.
In other words, the infinitely wide 1-layer MLP converges towards
a Gaussian process .
.center.width-80[ ]
.center[(Neal, 1995)]
class: end-slide, center
count: false
The end.