Sayan Deb Sarkar1 . Ondrej Miksik2 . Marc Pollefeys2, 3 . Dániel Béla Baráth3 . Iro Armeni1
1Stanford University · 2Microsoft Spatial AI Lab · 3ETH Zürich
Multi-modal 3D object understanding has gained significant attention, yet current approaches often rely on rigid object-level modality alignment or assume complete data availability across all modalities. We present CrossOver, a novel framework for cross-modal 3D scene understanding via flexible, scene-level modality alignment. Unlike traditional methods that require paired data for every object instance, CrossOver learns a unified, modality-agnostic embedding space for scenes by aligning modalities - RGB images, point clouds, CAD models, floorplans, and text descriptions - without explicit object semantics. Leveraging dimensionality-specific encoders, a multi-stage training pipeline, and emergent cross-modal behaviors, CrossOver supports robust scene retrieval and object localization, even with missing modalities. Evaluations on ScanNet and 3RScan datasets show its superior performance across diverse metrics, highlighting CrossOver’s adaptability for real-world applications in 3D scene understanding.
- Flexible Scene-Level Alignment 🌐 - Aligns RGB, point clouds, CAD, floorplans, and text at the scene level— no perfect data needed!
- Emergent Cross-Modal Behaviors 🤯 - Learns unseen modality pairs (e.g., floorplan ↔ text) without explicit pairwise training.
- Real-World Applications 🌍 AR/VR, robotics, construction—handles temporal changes (e.g., object rearrangement) effortlessly.
Table of Contents
- [2025-02] We release the preprocessed + generated embedding data. Fill out the form for the download link!
- [2025-02] We release CrossOver on arXiv. Checkout our paper and website.
We release required preprocessed data + meta-data and provide instructions for data download & preparation with scripts for ScanNet + 3RScan.
- For dataset download (single inference setup), please look at
README.MD
indata_prepare/
directory. - For preprocessed data download (training + evaluation only), please refer to Data Preprocessing.
You agree to the terms of ScanNet, 3RScan, ShapeNet, Scan2CAD and SceneVerse datasets by downloading our hosted data.
We release the embeddings created with CrossOver on the datasets used (embed_data
/ in GDrive), which can be used for cross-modal retrieval with a custom dataset.
embed_scannet.pt
: Scene Embeddings For All Modalities (Point Cloud, RGB, Floorplan, Referral) in ScanNetembed_scan3r.pt
: Scene Embeddings For All Modalities (Point Cloud, RGB, Referral) in ScanNet
File structure below:
{
"scene": [{
"scan_id": "the ID of the scan",
"scene_embeds": {
"modality_name" : "modality_embedding"
}
"mask" : "modality_name" : "True/False whether modality was present in the scan"
},
{
...
},...
]
}
The code has been tested on:
Ubuntu: 22.04 LTS
Python: 3.9.20
CUDA: 12.1
GPU: GeForce RTX 4090/RTX 3090
Clone the repo and setup as follows:
git clone [email protected]:GradientSpaces/CrossOver.git
cd CrossOver
conda env create -f req.yml
conda activate crossover
Further installation for MinkowskiEngine
and Pointnet2_PyTorch
. Setup as follows:
git clone --recursive "https://github.com/EthenJ/MinkowskiEngine"
conda install openblas-devel -c anaconda
cd MinkowskiEngine/
python setup.py install --blas_include_dirs=${CONDA_PREFIX}/include --force_cuda --blas=openblas
cd ..
git clone https://github.com/erikwijmans/Pointnet2_PyTorch.git
pip install pointnet2_ops_lib/.
Since we use CUDA 12.1, we use the above
MinkowskiEngine
fork; for other CUDA drivers, please refer to the official repo.
This demo script allows users to process a custom scene and retrieve the closest match from ScanNet/3RScan using different modalities. Detailed usage can be found inside the script. Example usage below:
python demo/demo_scene_retrieval.py
Various configurable parameters:
--query_path
: Path to the query scene file (eg:./example_data/dining_room/scene_cropped.ply
).--database_path
: Path to the precomputed embeddings of the database scenes downloaded before (eg:./release_data/embed_scannet.pt
).--query_modality
: Modality of the query scene, Options:point
,rgb
,floorplan
,referral
--database_modality
: Modality used for retrieval.same options as above--ckpt
: Path to the pre-trained scene crossover model checkpoint (details here), example_path:./checkpoints/scene_crossover_scannet+scan3r.pth/
).
For pre-trained model download, refer to data download and checkpoints sections.
We also provide scripts for inference on a single scan Scannet/3RScan data. Details in Single Inference section.
In order to process data faster during training + inference, we preprocess 1D (referral), 2D (RGB + floorplan) & 3D (Point Cloud + CAD) for both object instances and scenes. Note that, since for 3RScan dataset, they do not provide frame-wise RGB segmentations, we project the 3D data to 2D and store it in .pt
format for every scan. We provide the scripts for projection and release the data.
Please refer to PREPROCESS.MD
for details.
Adjust path parameters in configs/train/train_instance_baseline.yaml
and run the following:
bash scripts/train/train_instance_baseline.sh
Adjust path parameters in configs/train/train_instance_crossover.yaml
and run the following:
bash scripts/train/train_instance_crossover.sh
Adjust path/configuration parameters in configs/train/train_scene_crossover.yaml
. You can also add your customised dataset or choose to train on Scannet & 3RScan or either. Run the following:
bash scripts/train/train_scene_crossover.sh
The scene retrieval pipeline uses the trained weights from instance retrieval pipeline (for object feature calculation), please ensure to update
task:UnifiedTrain:object_enc_ckpt
in the config file.
We provide all available checkpoints on G-Drive here. Detailed descriptions in the table below:
Model Type | Description | Checkpoint |
---|---|---|
instance_baseline |
Instance Baseline trained on 3RScan | 3RScan |
instance_baseline |
Instance Baseline trained on ScanNet | ScanNet |
instance_baseline |
Instance Baseline trained on ScanNet + 3RScan | ScanNet+3RScan |
instance_crossover |
Instance CrossOver trained on 3RScan | 3RScan |
instance_crossover |
Instance CrossOver trained on ScanNet | ScanNet |
instance_crossover |
Instance CrossOver trained on ScanNet + 3RScan | ScanNet+3RScan |
scene_crossover |
Unified CrossOver trained on ScanNet + 3RScan | ScanNet+3RScan |
We release script to perform inference (generate scene-level embeddings) on a single scan of 3RScan/Scannet. Detailed usage in the file. Quick instructions below:
python single_inference/scene_inference.py
Various configurable parameters:
--dataset
: dataset name, Scannet/Scan3R--data_dir
: data directory (eg:./datasets/Scannet
, assumes similar structure as inpreprocess.md
).--floorplan_dir
: directory consisting of the rasterized floorplans (this can point to the downloaded preprocessed directory), only for Scannet--ckpt
: Path to the pre-trained scene crossover model checkpoint (details here), example_path:./checkpoints/scene_crossover_scannet+scan3r.pth/
).--scan_id
: the scan id from the dataset you'd like to calculate embeddings for (if not provided, embeddings for all scans are calculated).
The script will output embeddings in the same format as provided here.
Run the following script (refer to the script to run instance baseline/instance crossover). Detailed usage inside the script.
bash scripts/evaluation/eval_instance_retrieval.sh
This will also show you scene retrieval results using the instance based methods.
Run the following script (for scene crossover). Detailed usage inside the script.
bash scripts/evaluation/eval_instance_retrieval.sh
- Release evaluation on temporal instance matching
- Release inference code on single image-based scene retrieval
- Release inference on single scan cross-modal object retrieval
- Release inference using baselines
We thank the authors from 3D-VisTa, SceneVerse and SceneGraphLoc for open-sourcing their codebases.
@article{
}