Unified Human-Scene Interaction via Prompted Chain-of-Contacts

Zeqi Xiao Tai Wang Jingbo Wang Jinkun Cao Wenwei Zhang Bo Dai Dahua Lin Jiangmiao Pang*
Shanghai AI Laboratory Nanyang Technological University Carnegie Mellon University

🏠 About

This paper presents a UNIfied HSI framework, UniHSI, which supports unified control of diverse interactions through language commands. This framework is built upon the definition of interaction as Chain of Contacts (CoC): steps of human joint-object part pairs, which is inspired by the strong correlation between interaction types and human-object contact regions. Based on the definition, UniHSI constitutes a Large Language Model (LLM) Planner to translate language prompts into task plans in the form of CoC, and a Unified Controller that turns CoC into uniform task execution. To facilitate training and evaluation, we collect a new dataset named ScenePlan that encompasses thousands of task plans generated by LLMs based on diverse scenarios. Comprehensive experiments demonstrate the effectiveness of our framework in versatile task execution and generalizability to real scanned scenes.

🔥 News

[2024-04] The data is released.
[2024-03] The code is released.
[2024-01] UniHSI is accepted as ICLR 2024 spotlight. Thanks for the recognition!
[2023-09] We release the paper of UniHSI. Please check the 👉 webpage 👈 and view our demos! 🎇;

🔍 Overview

The whole pipeline consists of two major components: the LLM Planner and the Unified Controller. The LLM planner takes language inputs and background scenario information as inputs and outputs multi-step plan in the form of a Chain of Contacts. The Unified Controller then executes task plans step-by-step and output interaction movements.

Installation

Download Isaac Gym from the website, then follow the installation instructions.

Once Isaac Gym is installed, install the external dependencies for this repo:

pip install -r requirements.txt

Data Preparation

PartNet

Download PartNet and ShapeNet V2.
Save them in the following formation

data/
├── partnet_origin
│   ├── obj_id1
│   ├── obj_id2
│   ├── ...
├── shapenet_origin
│   ├── class_id1
│   │    ├── obj_id1
│   │    ├── ...
│   ├── class_id2
│   │    ├── obj_id1
│   │    ├── ...
│   ├── ...

Extract the objects used in sceneplan by

python cp_partnet_train.py
python cp_partnet_test.py

ScanNet

Download ScanNet.
Save them in the following formation

data/
├── scan_origin
│   ├── scans
│   │   ├── scans_1
│   │   ├── scans_2
│   │   ├── ...

Extract the objects used in sceneplan by

python cp_scannet_test.py

Motio Clips

We select and process motion clips from SAMP and CIRCLE.

Training

We adopt step-by-step training.

sh train_partnet_simple.sh
sh train_partnet_mid.sh
sh train_partnet_hard.sh

Demo

sh demo_scannet.sh

Evaluation

sh test_partnet_simple.sh
sh test_partnet_mid.sh
sh test_partnet_hard.sh
sh test_scannet_simple.sh
sh test_scannet_mid.sh
sh test_scannet_hard.sh

Source	Success Rate (%)			Contact Error			Success Steps
Source	Simple	Mid	Hard	Simple	Mid	Hard	Simple	Mid	Hard
PartNet	85.5	67.9	40.5	0.035	0.037	0.040	2.13	4.11	4.84
ScanNet	73.2	43.1	22.3	0.061	0.072	0.062	2.21	3.47	4.78

The results will be saved in the "output" folder.

There will be ~10% variance due to randomness in sampling.

🔗 Citation

If you find our work helpful, please cite:

@inproceedings{
  xiao2024unified,
  title={Unified Human-Scene Interaction via Prompted Chain-of-Contacts},
  author={Zeqi Xiao and Tai Wang and Jingbo Wang and Jinkun Cao and Wenwei Zhang and Bo Dai and Dahua Lin and Jiangmiao Pang},
  booktitle={The Twelfth International Conference on Learning Representations},
  year={2024},
  url={https://openreview.net/forum?id=1vCnDyQkjg}
}

📄 License

This work is under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

👏 Acknowledgements

ASE: Our codebase is built upon the AMP implementation in ASE.
PartNet and ShapeNet.: We use objects from PartNet for training and evaluation.
ScanNet: We use scenarios from ScanNet for evaluation.
SAMP: We use motion clips from SAMP for training.
CIRCLE: We use motion clips from CIRCLE for training.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Unified Human-Scene Interaction via Prompted Chain-of-Contacts

🏠 About

🔥 News

🔍 Overview

Installation

Data Preparation

PartNet

ScanNet

Motio Clips

Training

Demo

Evaluation

🔗 Citation

📄 License

👏 Acknowledgements

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
assets		assets
checkpoints		checkpoints
data		data
motion_clips		motion_clips
sceneplan		sceneplan
sceneplan_demo		sceneplan_demo
unihsi		unihsi
.gitignore		.gitignore
README.md		README.md
demo_scannet.sh		demo_scannet.sh
requirement.txt		requirement.txt
test_partnet_hard.sh		test_partnet_hard.sh
test_partnet_mid.sh		test_partnet_mid.sh
test_partnet_simple.sh		test_partnet_simple.sh
test_scannet_hard.sh		test_scannet_hard.sh
test_scannet_mid.sh		test_scannet_mid.sh
test_scannet_simple.sh		test_scannet_simple.sh
train_partnet_hard.sh		train_partnet_hard.sh
train_partnet_mid.sh		train_partnet_mid.sh
train_partnet_simple.sh		train_partnet_simple.sh

OpenRobotLab/UniHSI

Folders and files

Latest commit

History

Repository files navigation

Unified Human-Scene Interaction via Prompted Chain-of-Contacts

🏠 About

🔥 News

🔍 Overview

Installation

Data Preparation

PartNet

ScanNet

Motio Clips

Training

Demo

Evaluation

🔗 Citation

📄 License

👏 Acknowledgements

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages