Important
We will be presenting our poster at CoRL'23 on Wednesday, Nov. 8. from 5:15 - 6:00 pm in session 4. Hope to see you in person!
Warning
For those who cloned this repo before Oct 25, 2023, please update the repo by running git pull
and git submodule update --init --recursive
. We fixed a major bug that caused very bad segmentation for ScanNet200. We reran the results for ScanNet200 and the prediction files could be found at here. Sorry for the inconvenience.
OVIR-3D: Open-Vocabulary 3D Instance Retrieval Without Training on 3D Data.
Shiyang Lu, Haonan Chang, Eric Jing, Yu Wu, Abdeslam Boularias, Kostas Bekris
Full paper presented at CoRL'23. Extended abstract presented at OpenSUN3D (ICCV-W).
[Full Paper][Extended Abstract][CoRL Poster]
Recent progress on open-vocabulary (language-driven, without a predefined set of categories) 3D segmentation addresses the problem mainly at the semantic level (by mid-2023). Nevertheless, robotic applications, such as manipulation and navigation, often require 3D object geometries at the instance level. This work provides a straightforward yet effective solution for open-vocabulary 3D instance retrieval, which returns a ranked set of 3D instance segments given a 3D point cloud reconstructed from an RGB-D video and a language query.
Directly training an open-vocabulary 3D segmentation model is hard due to the lack of annotated 3D data with enough category varieties. Instead, this work views this problem as a 3D fusion problem from language-guided 2D region proposals, which could be trained with extensive 2D datasets, and provides a method to project and fused 2D instance information in the 3D space for fast retrieval.
git clone [email protected]:shiyoung77/OVIR-3D.git --recurse-submodules
conda create -n ovir3d python=3.10
conda activate ovir3d
# install pytorch
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
# install other dependencies
pip install -r requirements.txt
# Download Detic pretrained model
cd Detic
mkdir models
wget https://dl.fbaipublicfiles.com/detic/Detic_LCOCOI21k_CLIP_SwinB_896b32_4x_ft4x_max-size.pth -O models/Detic_LCOCOI21k_CLIP_SwinB_896b32_4x_ft4x_max-size.pth
A preprossed sample scene from YCB-Video dataset could be downloaded here (~1.3G). Extract it in this repo and then run ./demo.sh
.
For CUSTOM dataset, make your RGB-D video data in the following format. We have opensourced our video recording scripts for realsense cameras and KinectFusion implementation in Python to help you record and reconstruct your custom 3D scene.
{dataset_path}/
{video_name}/
color/
0000-color.jpg
0001-color.jpg
...
depth/
0000-depth.png
0001-depth.png
...
poses/
0000-pose.txt
0001-pose.txt
...
config.json # camera information
scan-{resolution}.pcd # reconstructed point cloud, e.g. scan-0.005.pcd
config.json
should contain the camera information. An example config.json is as follows.
{
"id": "video0",
"im_w": 640,
"im_h": 480,
"depth_scale": 1000,
"cam_intr": [
[ 1066.778, 0, 312.9869 ],
[ 0, 1067.487, 241.3109 ],
[ 0, 0, 1 ]
]
}
For YCB-Video dataset, we have already processed the validation set (0048~0059) and you can directly download from here (~16G).
For ScanNet200 dataset, please follow the instructions on their website to download the data and extract them in the following format. The color images were captured at a high resolution, which have to be resized to 480p to match with the depth image (they are already aligned so just cv2.resize
).
{dataset_path}/
{video_name}/
color/
0000-color.jpg
0001-color.jpg
...
depth/
0000-depth.png
0001-depth.png
...
poses/
0000-pose.txt
0001-pose.txt
...
config.json
{video_name}.txt
{video_name}_clean_2.ply
You need the following files for ScanNet200 preprocessing/evalution, they are included in this repo for your convenience.
scannet_preprocess.py # copy files, resize images, and generate config.json
scannet200_constants.py
scannet200_splits.py
scannetv2-labels.combined.tsv
scannet200_instance_gt/
validation/
{video_name}.txt
...
You can visualize the ground truth annotation via visualize_{scannet200/ycb_video}_gt.py
.
This works adopts Detic as a backbone 2D region proposal network. This repo contains a modified copy of the original repo as a submodule. To generate region proposals, cd Detic
, change the dataset path in file.py
and then run python fire.py
. This script supports multi-gpu to inference multiple videos in parallel. By default, this scripts query all the categories in imagenet21k
with confidence threshold at 0.3. The output masks and text-aligned features for each frame are stored in the {dataset_path}/{video_name}/detic_output
folder. You can also save the 2D visualization using the --save_vis
option, but this will make inference much slower.
cd Detic
python fire.py --dataset {dataset_path}
Once 2D region proposals are generated, you can fuse the results for the 3D scan using the proposed algorithm. The implementation of this algorithm is in src/proposed_fusion.py
. Again, there is a script src/fire.py
that supports parallel fusion for multiple 3D scenes if you have multiple gpus. The output is stored in {dataset_path}/{video_name}/detic_output/{vocab}/predictions
folder. It is recommended to have at least 11GB memory (e.g. 2080Ti) to run this algorithm, otherwise you may run into memory issues for large scenes.
cd src
python fire.py --dataset {dataset_path}
Once fusion is done, you will be able to interactively query 3D instances via src/instance_query.py
. Here out_filename
is the file outputed from last step, the default is proposed_fusion_detic_iou-0.25_recall-0.50_feature-0.75_interval-300.pkl
.
python src/instance_query.py -d {dataset_path} -v {video_name} --prediction_file {out_filename}
You may wonder why we call it instance retrieval instead of instance segmentation. The reason is that we formulate this problem as an information retrieval problem, i.e. given a query, retrieve relevant documents (ranked instances) from a database (a 3D scene). The proposed method first tries to find all 3D instances in a scene (without knowing the testing categories), and then rank them based on the language query using CLIP feature similarity. This is also how we evaluate our method and baselines, i.e. Standard mAP for information retrieval. We believe that it is a more reasonable metric given our open-vocabulary problem setting, though it is slightly different from the mAP metric commonly used for closed-set instance segmentation, where each predicted instance has to be assiged with a category label and a confidence score. Nevertheless, if you use OVIR-3D as a baseline, feel free to use any metric you like on the prediction files that we provided for ScanNet200, which contains all 3D instance segments (likely more than what ScanNet200 annotated) and their corresponding CLIP features.
We have a follow-up work Context-Aware Entity Grounding with Open-Vocabulary 3D Scene Graphs (OVSG), which will also appear at CoRL'23. It uses OVIR-3D as backbone method to get all 3D instances in a scene, and then build a 3D scene graph for more precise object retrieval using natural language by considering object relationships. Please take a look if you are interested.
For OVIR-3D:
@inproceedings{lu2023ovir,
title={OVIR-3D: Open-Vocabulary 3D Instance Retrieval Without Training on 3D Data},
author={Lu, Shiyang and Chang, Haonan and Jing, Eric Pu and Boularias, Abdeslam and Bekris, Kostas},
booktitle={7th Annual Conference on Robot Learning},
year={2023}
}
For OVSG:
@inproceedings{chang2023context,
title={Context-Aware Entity Grounding with Open-Vocabulary 3D Scene Graphs},
author={Chang, Haonan and Boyalakuntla, Kowndinya and Lu, Shiyang and Cai, Siwei and Jing, Eric Pu and Keskar, Shreesh and Geng, Shijie and Abbas, Adeeb and Zhou, Lifeng and Bekris, Kostas and others},
booktitle={7th Annual Conference on Robot Learning},
year={2023}
}