Skip to content

Latest commit

 

History

History
260 lines (200 loc) · 12.1 KB

README.md

File metadata and controls

260 lines (200 loc) · 12.1 KB

CrossOver: 3D Scene Cross-Modal Alignment

Sayan Deb Sarkar1 . Ondrej Miksik2 . Marc Pollefeys2, 3 . Dániel Béla Baráth3 . Iro Armeni1

1Stanford University · 2Microsoft Spatial AI Lab · 3ETH Zürich

📃 Abstract

Multi-modal 3D object understanding has gained significant attention, yet current approaches often rely on rigid object-level modality alignment or assume complete data availability across all modalities. We present CrossOver, a novel framework for cross-modal 3D scene understanding via flexible, scene-level modality alignment. Unlike traditional methods that require paired data for every object instance, CrossOver learns a unified, modality-agnostic embedding space for scenes by aligning modalities - RGB images, point clouds, CAD models, floorplans, and text descriptions - without explicit object semantics. Leveraging dimensionality-specific encoders, a multi-stage training pipeline, and emergent cross-modal behaviors, CrossOver supports robust scene retrieval and object localization, even with missing modalities. Evaluations on ScanNet and 3RScan datasets show its superior performance across diverse metrics, highlighting CrossOver’s adaptability for real-world applications in 3D scene understanding.

🚀 Features

  • Flexible Scene-Level Alignment 🌐 - Aligns RGB, point clouds, CAD, floorplans, and text at the scene level— no perfect data needed!
  • Emergent Cross-Modal Behaviors 🤯 - Learns unseen modality pairs (e.g., floorplan ↔ text) without explicit pairwise training.
  • Real-World Applications 🌍 AR/VR, robotics, construction—handles temporal changes (e.g., object rearrangement) effortlessly.
Table of Contents
  1. Data Download
  2. Installation
  3. Data Preprocessing
  4. Demo
  5. Training
  6. Evaluation
  7. Acknowledgements
  8. Citation

📰 News

  • [2025-02] We release the preprocessed + generated embedding data. Fill out the form for the download link!
  • [2025-02] We release CrossOver on arXiv. Checkout our paper and website.

⬇️ Data Download

Preprocessed Data

We release required preprocessed data + meta-data and provide instructions for data download & preparation with scripts for ScanNet + 3RScan.

  • For dataset download (single inference setup), please look at README.MD in data_prepare/ directory.
  • For preprocessed data download (training + evaluation only), please refer to Data Preprocessing.

You agree to the terms of ScanNet, 3RScan, ShapeNet, Scan2CAD and SceneVerse datasets by downloading our hosted data.

Generated Embedding Data

We release the embeddings created with CrossOver on the datasets used (embed_data/ in GDrive), which can be used for cross-modal retrieval with a custom dataset.

  • embed_scannet.pt: Scene Embeddings For All Modalities (Point Cloud, RGB, Floorplan, Referral) in ScanNet
  • embed_scan3r.pt : Scene Embeddings For All Modalities (Point Cloud, RGB, Referral) in ScanNet

File structure below:

{
  "scene": [{
    "scan_id": "the ID of the scan",
    "scene_embeds": {
        "modality_name"     : "modality_embedding"
      }
    "mask" : "modality_name" : "True/False whether modality was present in the scan"
    },
    {
      ...
    },...
  ]
}

🛠️ Installation

The code has been tested on:

Ubuntu: 22.04 LTS
Python: 3.9.20
CUDA: 12.1
GPU: GeForce RTX 4090/RTX 3090

📦 Setup

Clone the repo and setup as follows:

git clone [email protected]:GradientSpaces/CrossOver.git
cd CrossOver
conda env create -f req.yml
conda activate crossover

Further installation for MinkowskiEngine and Pointnet2_PyTorch. Setup as follows:

git clone --recursive "https://github.com/EthenJ/MinkowskiEngine"
conda install openblas-devel -c anaconda
cd MinkowskiEngine/
python setup.py install --blas_include_dirs=${CONDA_PREFIX}/include --force_cuda --blas=openblas
cd ..
git clone https://github.com/erikwijmans/Pointnet2_PyTorch.git
pip install pointnet2_ops_lib/.

Since we use CUDA 12.1, we use the above MinkowskiEngine fork; for other CUDA drivers, please refer to the official repo.

📽️ Demo

This demo script allows users to process a custom scene and retrieve the closest match from ScanNet/3RScan using different modalities. Detailed usage can be found inside the script. Example usage below:

python demo/demo_scene_retrieval.py

Various configurable parameters:

  • --query_path: Path to the query scene file (eg: ./example_data/dining_room/scene_cropped.ply).
  • --database_path: Path to the precomputed embeddings of the database scenes downloaded before (eg: ./release_data/embed_scannet.pt).
  • --query_modality: Modality of the query scene, Options: point, rgb, floorplan, referral
  • --database_modality: Modality used for retrieval.same options as above
  • --ckpt: Path to the pre-trained scene crossover model checkpoint (details here), example_path: ./checkpoints/scene_crossover_scannet+scan3r.pth/).

For pre-trained model download, refer to data download and checkpoints sections.

We also provide scripts for inference on a single scan Scannet/3RScan data. Details in Single Inference section.

🔧 Data Preprocessing

In order to process data faster during training + inference, we preprocess 1D (referral), 2D (RGB + floorplan) & 3D (Point Cloud + CAD) for both object instances and scenes. Note that, since for 3RScan dataset, they do not provide frame-wise RGB segmentations, we project the 3D data to 2D and store it in .pt format for every scan. We provide the scripts for projection and release the data.

Please refer to PREPROCESS.MD for details.

🏋️ Training

Train Instance Baseline

Adjust path parameters in configs/train/train_instance_baseline.yaml and run the following:

bash scripts/train/train_instance_baseline.sh

Train Instance Retrieval Pipeline

Adjust path parameters in configs/train/train_instance_crossover.yaml and run the following:

bash scripts/train/train_instance_crossover.sh

Train Scene Retrieval Pipeline

Adjust path/configuration parameters in configs/train/train_scene_crossover.yaml. You can also add your customised dataset or choose to train on Scannet & 3RScan or either. Run the following:

bash scripts/train/train_scene_crossover.sh

The scene retrieval pipeline uses the trained weights from instance retrieval pipeline (for object feature calculation), please ensure to update task:UnifiedTrain:object_enc_ckpt in the config file.

Checkpoints

We provide all available checkpoints on G-Drive here. Detailed descriptions in the table below:

Model Type Description Checkpoint
instance_baseline Instance Baseline trained on 3RScan 3RScan
instance_baseline Instance Baseline trained on ScanNet ScanNet
instance_baseline Instance Baseline trained on ScanNet + 3RScan ScanNet+3RScan
instance_crossover Instance CrossOver trained on 3RScan 3RScan
instance_crossover Instance CrossOver trained on ScanNet ScanNet
instance_crossover Instance CrossOver trained on ScanNet + 3RScan ScanNet+3RScan
scene_crossover Unified CrossOver trained on ScanNet + 3RScan ScanNet+3RScan

🛡️ Single Inference

We release script to perform inference (generate scene-level embeddings) on a single scan of 3RScan/Scannet. Detailed usage in the file. Quick instructions below:

python single_inference/scene_inference.py

Various configurable parameters:

  • --dataset: dataset name, Scannet/Scan3R
  • --data_dir: data directory (eg: ./datasets/Scannet, assumes similar structure as in preprocess.md).
  • --floorplan_dir: directory consisting of the rasterized floorplans (this can point to the downloaded preprocessed directory), only for Scannet
  • --ckpt: Path to the pre-trained scene crossover model checkpoint (details here), example_path: ./checkpoints/scene_crossover_scannet+scan3r.pth/).
  • --scan_id: the scan id from the dataset you'd like to calculate embeddings for (if not provided, embeddings for all scans are calculated).

The script will output embeddings in the same format as provided here.

📊 Evaluation

Cross-Modal Object Retrieval

Run the following script (refer to the script to run instance baseline/instance crossover). Detailed usage inside the script.

bash scripts/evaluation/eval_instance_retrieval.sh

This will also show you scene retrieval results using the instance based methods.

Cross-Modal Scene Retrieval

Run the following script (for scene crossover). Detailed usage inside the script.

bash scripts/evaluation/eval_instance_retrieval.sh

🚧 TODO List

  • Release evaluation on temporal instance matching
  • Release inference code on single image-based scene retrieval
  • Release inference on single scan cross-modal object retrieval
  • Release inference using baselines

🙏 Acknowledgements

We thank the authors from 3D-VisTa, SceneVerse and SceneGraphLoc for open-sourcing their codebases.

📄 Citation

@article{

}