mip-NeRF

Benjamin Noah Beal

To understand mip-NeRF ¹ one needs to understand NeRF ².

NeRF

NeRF is a method for synthesizing novel views of complex scenes by optimizing an underlying continuous volumetric scene function using a sparse set of input views. The algorithm uses a fully-connected deep neural network whose input is a single continuous 5D coordinate (spatial location $(x, y, z)$ and viewing direction $(\theta, \phi)$) and whose output is the volume density and view-dependent emitted radiance at that spatial location.

Figure 1: NeRF Pipeline

The method trains a neural network to parameterize the neural radiance fields of a scene, which really means that the network learns to encode the 3D spatially of the scene within its weights. Which can then be used to render the scene from any arbitrary view point.

NeRF Pipeline

The input of the NeRF model is the 5D coordinate as previously stated, however, these coordinates are generated by sampling points along a camera ray, thus in actuality the input of the NeRF model is effectively a batch of rays. A ray is defined as $\mathcal{R}(t) = o + td$, where $o$ is the origin of the ray in world space, $d$ is the direction of the ray, and $t$ is some point along the ray.

The model does two iterations of sampling points along the rays, reasoning that an initial sample of rays, $N_c$ acquired via stratified sampling, represents a "coarse-grain" evaluation of the model, which can then be used to produce a more informed sampling, $N_f$ biased towards the relevant parts of the volume, that represents a "fine-grained" evaluation. Both of the samples are fed through the model in the same manner, the first output is simply used to achieve better samples on the second iteration through. Both the first and second set of samples, $N_c + N_f$ are used to compute the final rendered color of the ray.

The samples along the rays are encoded with a positional encoding, similar to the encoding of a transformer model.

Positional Encoding (PE)

The positional encoding, takes each $t$ value encompassed in our sample of rays, $t_k \in t$, and computes it's corresponding 3D position along the ray, $\bold{x} = \mathcal{R}(t)$, and encodes it as follows:

$$ \gamma(\bold{x}) = \left[ \sin(\bold{x}), \cos(\bold{x}), \dots, \sin(2^{L-1}\bold{x}), \cos(2^{L-1}\bold{x}) \right]^{T} $$

This is simply the concatenation of the sines and cosines of each dimension of the 3D position $\bold{x}$ scaled by powers of $2$ from $1$ to $2^{L−1}$, where $L$ is a hyperparameter.

Once encoded, the features are fed into an MLP model and produce two distinct predictions, a volume density prediction, $\sigma$, and a color prediction, $c = (r, g, b)$.

Finally volume rendering techniques are used to composite these values into an image.

NeRF Architecture

Figure 2: NeRF Model Architecture

Input vectors are shown in green
- Positional Encoding layers are represented by $\gamma(\bold{x})$
Hidden vectors are shown in blue
Output vectors are shown in red
- The final RGB output vector is produced following a sigmoid activation

mip-NeRF

mip-NeRF take inspiration from the mipmapping approach used to prevent aliasing in computer graphics rendering pipelines, and extends NeRF to simultaneously represent the prefiltered radiance field for a continuous space of scales. It accomplishes this by effectively shooting cones or cylinders into the scene instead of rays and using Gaussians that approximate the conical frustums corresponding to the pixel. This improvement addresses a flaw in the NeRF model in regards to aliasing and how it deals with scale.

Figure 3: mip-NeRF

mip-NeRF Pipeline

The mip-NeRF Pipeline is identical to that of the NeRF Pipeline with the following exceptions:

Gaussian Approximation of a Conical Frustum

Where NeRF shoots a ray in to the scene

Figure 4: NeRF Ray

mip-NeRF shoots a cone into the scene and then slices the cone into conical frustums

Figure 5: mip-NeRF anti-aliased conical frustums

Then we fit a multivariate gaussian to approximate the conical frustum. This allows for efficient approximation of points that lie within the frustum, since actually using the conical frustum equation would involve computing an integral which has no closed form solution.

Figure 6: mip-NeRF Gaussian Approximation

Integrated Positional Encoding (IPE)

Lastly another major difference between the two is that mip-NeRF implements an integrated positional encoding to encode a coordinate distributed according to the aforementioned Gaussian. This comes as a generalization of NeRF’s positional encoding. To understand it is helpful to rewrite PE as a Fourier feature:

$$ \bold{P} = \begin{bmatrix} 1 & 0 & 0 & 2 & 0 & 0 & & 2^{L-1} & 0 & 0\\ 0 & 1 & 0 & 0 & 2 & 0 & \dots & 0 & 2^{L-1} & 0\\ 0 & 0 & 1 & 0 & 0 & 2 & & 0 & 0 & 2^{L-1} \end{bmatrix}^{T}, \ \gamma(\bold{x}) = \begin{bmatrix} \sin(\bold{Px}) \\ \cos(\bold{Px}) \\ \end{bmatrix} $$

This reparameterization allows us to derive a closed form or IPE. Using the fact that the covariance of a linear transformation of a variable is a linear transformation of the variable’s covariance $(\text{Cov}\left[\bold{Ax}, \bold{By}\right] = \bold{A}\text{Cov}\left[\bold{x}, \bold{y}\right]\bold{B}^T)$ we can identify the mean and covariance of our conical frustum Gaussian after it has been lifted into the PE basis $\bold{P}$:

$$ \bold{\mu}{\gamma} = \bold{P\mu}, \ \bold{\Sigma}{\gamma} = \bold{P\Sigma P}^T $$

The final step in producing an IPE feature is computing the expectation over this lifted multivariate Gaussian, modulated by the sine and the cosine of position. These expectations have simple closed-form expressions:

$$ E_{x \sim \mathcal{N}(\mu, \sigma^2)} \left[\sin(\mu) \text{exp}\left(-\left(\frac{1}{2}\right)\sigma^2\right)\right], \\ E_{x \sim \mathcal{N}(\mu, \sigma^2)} \left[\cos(\mu) \text{exp}\left(-\left(\frac{1}{2}\right)\sigma^2\right)\right], $$

We see that this expected sine or cosine is simply the sine or cosine of the mean attenuated by a Gaussian function of the variance. With this we can compute our final IPE feature as the expected sines and cosines of the mean and the diagonal of the covariance matrix:

$$ \gamma(\bold{\mu, \Sigma}) = E_{x \sim \mathcal{N}(\mu_{\gamma}, \Sigma_{\gamma})}[\gamma(\bold{x})] \ = \begin{bmatrix} \sin(\bold{\mu}{\gamma}) \circ \text{exp}\left(-\left(\frac{1}{2}\right) \text{diag}( \ \Sigma{\gamma})\right) \ \cos(\bold{\mu}{\gamma}) \circ \text{exp}\left(-\left(\frac{1}{2}\right) \text{diag}( \ \Sigma{\gamma})\right) \end{bmatrix} $$

mip-NeRF Architecture

The architecture of mip-NeRF largely follows that of NeRF

Figure 7: mip-NeRF Model Architecture

The final volume-density output vector is produced following a softplus activation

Technicalities

Original NeRF code actually uses two fully-connected deep neural network, one referenced as a "coarse"-grained network and the other referenced as the "fine"-grained network.
- Coarse-grained network used during the first iteration of the sampling of rays
- Fine-grained network used for all other samples
mip-NeRF uses a single network that more accurately follows the process described in this summary

References

Many of the equations, derivations, and explanation's are directly taken from the papers/project pages referenced below.

[1]	NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis, NeRF Project Page
[2]	Mip-NeRF: A Multiscale Representation for Anti-Aliasing Neural Radiance Fields, mip-NeRF Project Page

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Summary.md

Summary.md

mip-NeRF

NeRF

NeRF Pipeline

Positional Encoding (PE)

NeRF Architecture

mip-NeRF

mip-NeRF Pipeline

Gaussian Approximation of a Conical Frustum

Integrated Positional Encoding (IPE)

mip-NeRF Architecture

Technicalities

References

Files

Summary.md

Latest commit

History

Summary.md

File metadata and controls

mip-NeRF

NeRF

NeRF Pipeline

Positional Encoding (PE)

NeRF Architecture

mip-NeRF

mip-NeRF Pipeline

Gaussian Approximation of a Conical Frustum

Integrated Positional Encoding (IPE)

mip-NeRF Architecture

Technicalities

References