3D Reconstruction from Single Image Using Occupancy Network with Vision Transformer Architecture

The project aims to improve the performance of the Occupancy Network by incorporating the Vision Transformer (ViT) architecture into the model.

Please view our slides here:
Please view our report here:

It is the course of Machine Learning for 3D Geometry (IN2392) from TU Munich.

1. How to run this project?

Please refer to the Jupyter Notebook file for training, mesh generation, and evaluation

ViTOcNet.ipynb

2. Dataset

Download the preprocessed ShapeNet dataset and please prepare a large space (better if more than about 73.4 GB) for it.
A GPU environment is preferred. Dataset will not be published due to large data size but dowload script is a part of the Jupyter Notebook mentioned above.

1. Occupancy Networks: Learning 3D Reconstruction in Function Space

Approach

Instead of reconstructing a 3D shape in the form of a (discrete) voxel grid, point cloud, or mesh from the input data, occupancy networks return a function that predicts an occupancy probability for a continuous 3D point.

The trick is the following: a function that takes an observation x as input and returns a function mapping a 3D point p to the occupancy probability can be replaced by a function that takes a pair (p, x) and returns the occupancy probability.

Limitations of existing 3D data representations

Voxel grids: memory-heavy, up to 128^3–256^3 maximum resolution
Point clouds: need post-processing to generate a (mesh) surface
Meshes: existing approaches often require additional regularization, can generate only meshes with simple topology, need the same class reference template, or cannot guarantee closed surfaces

Pipeline

Training: the occupancy network fθ(p, x) takes as input the task-specific object (for example, for single image 3D reconstruction this would be an RGB image) and a batch of 3D points, randomly sampled from the ground truth 3D representation. For each 3D point, the network predicts the occupancy probability, which is then compared with the ground truth to compute the mini-batch loss.
Inference: to extract a 3D mesh from the learned fθ(p, x), the paper uses a Multiresolution IsoSurface Extraction (MISE) algorithm, which as a first step is building an octree by progressively sampling the points where neighbors' occupancy predictions do not match. After that, the Marching Cubes algorithm is applied to extract the mesh surface.

Occupancy Network Architecture

The network architecture is generally the same across different tasks (e.g. single image 3D reconstruction or point cloud completion) with the task-specific encoder being the only changing element.
Once the task encoder has produced the embedding c, it is passed as input along with the batch of T sampled 3D points, which are processed by 5 sequential ResNet blocks. To condition the network output on the input embedding c, Conditional Batch Normalization is used.
For single image 3D reconstruction, the network uses a Vision Transformer (ViT) encoder with altered last layers to produce 256-dim embeddings.

2. ViT: Vision Transformers

Approach

Vision Transformer (ViT) is a state-of-the-art image recognition model using the Transformer architecture. It treats the input image as a sequence of flattened patches and processes them using Transformer's self-attention mechanism.

Network Architecture

ViT has two major components:

A Patch Embedding Module
A Transformer Encoder

ViT takes an input image and divides it into fixed-size patches, which are then linearly embedded to obtain a sequence of embeddings. The positional encodings are added to these embeddings to provide spatial information.
The sequence of embeddings is then fed into the Transformer encoder, which processes them using multi-head self-attention layers.

Dividing the image and adding positional embeddings:

2D Positional Encoding

ViT uses a fixed positional encoding similar to the one used in the original Transformer.

Transformers

The Transformer encoder enriches embeddings through the multi-head self-attention layers. One notable difference from the original encoder is that ViT adds the positional encodings to the input of each multi-head attention layer.
ViT does not have a decoder as it is primarily used for image classification tasks. However, the output embeddings can be used for various downstream tasks.

3. Proposed Architecture ( ViTOcNet )

The proposed network for 3D reconstruction from a single image consists of:

The ResNet Backbone for feature extraction used in Occupancy Network is replaced with Vision Transformer.
Conditional batch normalization was utilized to effectively integrate the point cloud and embedding data.
Standard CNN was employed instead of fully-connected ResNet-blocks to generate occupancy probabilities.

By incorporating the Vision Transformer into the Occupancy Network, we aimed to leverage the powerful capabilities of self-attention and improve the performance of 3D reconstruction tasks.

Results: The original network turned out to be a more powerful model for 3D reconstruction. However, due to the limitation of hardware and time resources, the limit performance of our model could not be tested.

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
build		build
data		data
evaluation		evaluation
images		images
mesh_generate		mesh_generate
models		models
out/img/onet/vis		out/img/onet/vis
training		training
utils		utils
vit		vit
Final Presentation - ML3D.pdf		Final Presentation - ML3D.pdf
ML43D_Final_Report.pdf		ML43D_Final_Report.pdf
Ocnet.ipynb		Ocnet.ipynb
Project Ideas		Project Ideas
README.md		README.md
ViTOcNet.ipynb		ViTOcNet.ipynb
generation.yaml		generation.yaml
ocnet.yaml		ocnet.yaml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

3D Reconstruction from Single Image Using Occupancy Network with Vision Transformer Architecture

1. How to run this project?

2. Dataset

1. Occupancy Networks: Learning 3D Reconstruction in Function Space

Approach

Pipeline

Occupancy Network Architecture

2. ViT: Vision Transformers

Approach

Network Architecture

2D Positional Encoding

Transformers

3. Proposed Architecture ( ViTOcNet )

About

Releases

Packages

Languages

kaanozgen12/ML43D-PROJECT

Folders and files

Latest commit

History

Repository files navigation

3D Reconstruction from Single Image Using Occupancy Network with Vision Transformer Architecture

1. How to run this project?

2. Dataset

1. Occupancy Networks: Learning 3D Reconstruction in Function Space

Approach

Pipeline

** Occupancy Network Architecture**

2. ViT: Vision Transformers

Approach

Network Architecture

2D Positional Encoding

Transformers

3. Proposed Architecture ( ViTOcNet )

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Occupancy Network Architecture

Packages