Benjamin Noah Beal
To understand mip-NeRF 1 one needs to understand NeRF 2.
NeRF is a method for synthesizing novel views of complex scenes by optimizing an underlying continuous volumetric scene function using a sparse set of input views. The algorithm uses a fully-connected deep neural network whose input is a single continuous 5D coordinate (spatial location
The method trains a neural network to parameterize the neural radiance fields of a scene, which really means that the network learns to encode the 3D spatially of the scene within its weights. Which can then be used to render the scene from any arbitrary view point.
The input of the NeRF model is the 5D coordinate as previously stated, however, these coordinates are generated by sampling points along a camera ray, thus in actuality the input of the NeRF model is effectively a batch of rays. A ray is defined as
The model does two iterations of sampling points along the rays, reasoning that an initial sample of rays,
The samples along the rays are encoded with a positional encoding, similar to the encoding of a transformer model.
The positional encoding, takes each
This is simply the concatenation of the sines and cosines of each dimension of the 3D position
Once encoded, the features are fed into an MLP model and produce two distinct predictions, a volume density prediction,
Finally volume rendering techniques are used to composite these values into an image.
Figure 2: NeRF Model Architecture- Input vectors are shown in green
- Positional Encoding layers are represented by
$\gamma(\bold{x})$
- Positional Encoding layers are represented by
- Hidden vectors are shown in blue
- Output vectors are shown in red
- The final RGB output vector is produced following a sigmoid activation
mip-NeRF take inspiration from the mipmapping approach used to prevent aliasing in computer graphics rendering pipelines, and extends NeRF to simultaneously represent the prefiltered radiance field for a continuous space of scales. It accomplishes this by effectively shooting cones or cylinders into the scene instead of rays and using Gaussians that approximate the conical frustums corresponding to the pixel. This improvement addresses a flaw in the NeRF model in regards to aliasing and how it deals with scale.
Figure 3: mip-NeRFThe mip-NeRF Pipeline is identical to that of the NeRF Pipeline with the following exceptions:
Where NeRF shoots a ray in to the scene
Figure 4: NeRF Raymip-NeRF shoots a cone into the scene and then slices the cone into conical frustums
Figure 5: mip-NeRF anti-aliased conical frustumsThen we fit a multivariate gaussian to approximate the conical frustum. This allows for efficient approximation of points that lie within the frustum, since actually using the conical frustum equation would involve computing an integral which has no closed form solution.
Figure 6: mip-NeRF Gaussian ApproximationLastly another major difference between the two is that mip-NeRF implements an integrated positional encoding to encode a coordinate distributed according to the aforementioned Gaussian. This comes as a generalization of NeRF’s positional encoding. To understand it is helpful to rewrite PE as a Fourier feature:
This reparameterization allows us to derive a closed form or IPE. Using the fact that the covariance of a linear transformation of a variable is a linear transformation of the variable’s covariance
$$ \bold{\mu}{\gamma} = \bold{P\mu}, \ \bold{\Sigma}{\gamma} = \bold{P\Sigma P}^T $$
The final step in producing an IPE feature is computing the expectation over this lifted multivariate Gaussian, modulated by the sine and the cosine of position. These expectations have simple closed-form expressions:
We see that this expected sine or cosine is simply the sine or cosine of the mean attenuated by a Gaussian function of the variance. With this we can compute our final IPE feature as the expected sines and cosines of the mean and the diagonal of the covariance matrix:
$$ \gamma(\bold{\mu, \Sigma}) = E_{x \sim \mathcal{N}(\mu_{\gamma}, \Sigma_{\gamma})}[\gamma(\bold{x})] \ = \begin{bmatrix} \sin(\bold{\mu}{\gamma}) \circ \text{exp}\left(-\left(\frac{1}{2}\right) \text{diag}( \ \Sigma{\gamma})\right) \ \cos(\bold{\mu}{\gamma}) \circ \text{exp}\left(-\left(\frac{1}{2}\right) \text{diag}( \ \Sigma{\gamma})\right) \end{bmatrix} $$
The architecture of mip-NeRF largely follows that of NeRF
Figure 7: mip-NeRF Model Architecture- The final volume-density output vector is produced following a softplus activation
- Original NeRF code actually uses two fully-connected deep neural network, one referenced as a "coarse"-grained network and the other referenced as the "fine"-grained network.
- Coarse-grained network used during the first iteration of the sampling of rays
- Fine-grained network used for all other samples
- mip-NeRF uses a single network that more accurately follows the process described in this summary
Many of the equations, derivations, and explanation's are directly taken from the papers/project pages referenced below.
[1] | NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis, NeRF Project Page |
[2] | Mip-NeRF: A Multiscale Representation for Anti-Aliasing Neural Radiance Fields, mip-NeRF Project Page |