Large Spatial Model: End-to-end Unposed Images to Semantic 3D
output_fmap_video.mp4
output_images_video.mp4
-
Dowload repo:
git clone --recurse-submodules https://github.com/NVlabs/LSM.git
-
Create and activate conda environment:
conda create -n lsm python=3.10 conda activate lsm
-
Install PyTorch and related packages:
conda install pytorch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 pytorch-cuda=12.1 -c pytorch -c nvidia -y conda install pytorch-cluster pytorch-scatter pytorch-sparse -c pyg -y
-
Install other Python dependencies:
pip install -r requirements.txt pip install flash-attn --no-build-isolation
-
Install PointTransformerV3:
cd submodules/PointTransformerV3/Pointcept/libs/pointops python setup.py install cd ../../../../..
-
Install 3D Gaussian Splatting modules:
pip install submodules/3d_gaussian_splatting/diff-gaussian-rasterization pip install submodules/3d_gaussian_splatting/simple-knn
-
Install OpenAI CLIP:
pip install git+https://github.com/openai/CLIP.git
-
Build croco model:
cd submodules/dust3r/croco/models/curope python setup.py build_ext --inplace cd ../../../../..
-
Download pre-trained models:
The following three model weights need to be downloaded:
# 1. Create directory for checkpoints mkdir -p checkpoints/pretrained_models # 2. DUSt3R model weights wget -P checkpoints/pretrained_models https://download.europe.naverlabs.com/ComputerVision/DUSt3R/DUSt3R_ViTLarge_BaseDecoder_512_dpt.pth # 3. LSEG demo model weights gdown 1FTuHY1xPUkM-5gaDtMfgCl3D0gR89WV7 -O checkpoints/pretrained_models/demo_e200.ckpt # 4. LSM final checkpoint gdown 1q57nbRJpPhrdf1m7XZTkBfUIskpgnbri -O checkpoints/pretrained_models/checkpoint-final.pth
-
Data preparation
- Prepare any two images of indoor scenes (preferably indoor images, as the model is trained on indoor scene datasets).
- Place your images in a directory of your choice.
Example directory structure:
demo_images/ └── indoor/ ├── scene1/ │ ├── image1.jpg │ └── image2.jpg └── scene2/ ├── room1.png └── room2.png
-
Commands
# Reconstruct 3D scene and generate video using two images bash scripts/infer.sh
Optional parameters in
scripts/infer.sh
(default settings recommended):# Path to your input images --file_list "demo_images/indoor/scene2/image1.jpg" "demo_images/indoor/scene2/image2.jpg" # Output directory for Gaussian points and rendered video --output_path "outputs/indoor/scene2" # Image resolution for processing --resolution "256"
This work is built on many amazing research works and open-source projects, thanks a lot to all the authors for sharing!
- Gaussian-Splatting and diff-gaussian-rasterization
- DUSt3R
- Language-Driven Semantic Segmentation (LSeg)
- Point Transformer V3
- pixelSplat
- Feature 3DGS
If you find our work useful in your research, please consider giving a star ⭐ and citing the following paper 📝.
@misc{fan2024largespatialmodelendtoend,
title={Large Spatial Model: End-to-end Unposed Images to Semantic 3D},
author={Zhiwen Fan and Jian Zhang and Wenyan Cong and Peihao Wang and Renjie Li and Kairun Wen and Shijie Zhou and Achuta Kadambi and Zhangyang Wang and Danfei Xu and Boris Ivanovic and Marco Pavone and Yue Wang},
year={2024},
eprint={2410.18956},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2410.18956},
}