Meng Wei
Xiaoyu Yue
Wenwei Zhang
Xihui Liu
Shu Kong
Jiangmiao Pang*
Shanghai AI Laboratory The University of Hong Kong The University of Sydney University of Macau Texas A&M University
OV-PARTS is a benchmark for Open-Vocabulary Part Segmentation by using the capabilities of large-scale Vision-Language Models (VLMs).
-
Benchmark Datasets: Two refined versions of two publicly available datasets:
-
Benchmark Tasks: Three specific tasks which provides insights into the
analogical reasoning
,open granularity
andfew-shot adapting
abilities of models.- Generalized Zero-Shot Part Segmentation: this benchmark task aims to assess the model’s capability to generalize part segmentation from seen objects to related unseen objects.
- Cross-Dataset Part Segmentation: except for the zero-shot generalization ability, this benchmark task aims to assess the model’s capability to generalize part segmentation across different datasets with varying granularity levels.
- Few-Shot Part Segmentation: this benchmark task aims to assess the model’s fast adaptation capability.
-
Benchmark Baselines: Baselines based on existing two-stage and one-stage object-level open vocabulary segmentation methods, including ZSseg, CLIPSeg, CATSeg.
We organize the Open Vocabulary Part Segmentation (OV-PARTS) Challenge in the Visual Perception via Learning in an Open World (VPLOW) Workshop. Please check our website!
-
Clone this repository
git clone https://github.com/OpenRobotLab/OV_PARTS.git cd OV_PARTS
-
Create a conda environment with
Python3.8+
and install python requirementsconda create -n ovparts python=3.8 conda activate ovparts pip install -r requirements.txt
After downloading the two benchmark datasets, please extract the files by running the following command and place the extracted folder under the "Datasets" directory.
tar -xzf PascalPart116.tar.gz
tar -xzf ADE20KPart234.tar.gz
The Datasets folder should follow this structure:
Datasets/
├─Pascal-Part-116/
│ ├─train_16shot.json
│ ├─images/
│ │ ├─train/
│ │ └─val/
│ ├─annotations_detectron2_obj/
│ │ ├─train/
│ │ └─val/
│ └─annotations_detectron2_part/
│ ├─train/
│ └─val/
└─ADE20K-Part-234/
├─images/
│ ├─training/
│ ├─validation/
├─train_16shot.json
├─ade20k_instance_train.json
├─ade20k_instance_val.json
└─annotations_detectron2_part/
├─training/
└─validation/
Create {train/val}_{obj/part}_label_count.json
files for Pascal-Part-116.
python baselines/data/datasets/mask_cls_collect.py Datasets/Pascal-Part-116/annotations_detectron2_{obj/part}/{train/val} Datasets/Pascal-Part-116/annotations_detectron2_part/{train/val}_{obj/part}_label_count.json
-
Training the two-stage baseline
ZSseg+
.Please first download the clip model fintuned with CPTCoOp.
Then run the training command:
python train_net.py --num-gpus 8 --config-file configs/${SETTING}/zsseg+_R50_coop_${DATASET}.yaml
-
Training the one-stage baselines
CLIPSeg
andCATSeg
.Please first download the pre-trained object models of CLIPSeg and CATSeg and place them under the "pretrain_weights" directory.
Models Pre-trained checkpoint CLIPSeg download CATSeg download Then run the training command:
# For CATseg. python train_net.py --num-gpus 8 --config-file configs/${SETTING}/catseg_${DATASET}.yaml # For CLIPseg. python train_net.py --num-gpus 8 --config-file configs/${SETTING}/clipseg_${DATASET}.yaml
We provide the trained weights for the three baseline models reported in the paper.
Models | Setting | Pascal-Part-116 checkpoint | ADE20K-Part-234 checkpoint |
---|---|---|---|
ZSSeg+ | Zero-shot | download | download |
CLIPSeg | Zero-shot | download | download |
CatSet | Zero-shot | download | download |
CLIPSeg | Few-shot | download | download |
CLIPSeg | cross-dataset | - | download |
To evaluate the trained models, add --eval-only
to the training command.
For example:
python train_net.py --num-gpus 8 --config-file configs/${SETTING}/catseg_${DATASET}.yaml --eval-only MODEL.WEIGHTS ${WEIGHT_PATH}
-
Zero-shot performance of the two-stage and one-stage baselines on Pascal-Part-116
Model Backbone Finetuning Oracle-Obj Pred-Obj Seen Unseen Harmonic Seen Unseen Harmonic Fully-Supervised MaskFormer ResNet-50 - 55.28 52.14 - 53.07 47.82 - Two-Stage Baselines ZSseg ResNet-50 - 49.35 12.57 20.04 40.80 12.07 18.63 ZSseg+ ResNet-50 CPTCoOp 55.33 19.17 28.48 54.23 17.10 26.00 ZSseg+ ResNet-50 CPTCoCoOp 54.43 19.04 28.21 53.31 16.08 24.71 ZSseg+ ResNet-101c CPTCoOp 57.88 21.93 31.81 56.87 20.29 29.91 One-Stage Baselines CATSeg ResNet-101
&ViT-B/16- 14.89 10.29 12.17 13.65 7.73 9.87 CATSeg ResNet-101
&ViT-B/16B+D 43.97 26.11 32.76 41.65 26.08 32.07 CLIPSeg ViT-B/16 - 22.33 19.73 20.95 14.32 10.52 12.13 CLIPSeg ViT-B/16 VA+L+F+D 48.68 27.37 35.04 44.57 27.79 34.24 -
Zero-shot performance of the two-stage and one-stage baselines on ADE20K-Part-234
Model Backbone Finetuning Oracle-Obj Pred-Obj Seen Unseen Harmonic Seen Unseen Harmonic Fully-Supervised MaskFormer ResNet-50 - 46.25 47.86 - 35.52 16.56 - Two-Stage Baselines ZSseg+ ResNet-50 CPTCoOp 43.19 27.84 33.85 21.30 5.60 8.87 ZSseg+ ResNet-50 CPTCoCoOp 39.67 25.15 30.78 19.52 2.98 5.17 ZSseg+ ResNet-101c CPTCoOp 43.41 25.70 32.28 21.42 3.33 5.76 One-Stage Baselines CATSeg ResNet-101
&ViT-B/16- 11.49 8.56 9.81 6.30 3.79 4.73 CATSeg ResNet-101
&ViT-B/16B+D 31.40 25.77 28.31 20.23 8.27 11.74 CLIPSeg ViT-B/16 - 15.27 18.01 16.53 5.00 3.36 4.02 CLIPSeg ViT-B/16 VA+L+F+D 38.96 29.65 33.67 24.80 6.24 9.98 -
Cross-Dataset performance of models trained on the source dataset ADE20K-Part-234 and tested on the target dataset Pascal-Part-116.
Model Source Target Oracle-Obj Pred-Obj Oracle-Obj Pred-Obj CATSeg 27.95 17.22 16.00 14.72 CLIPSeg VA+L+F 35.01 21.74 16.18 11.70 CLIPSeg VA+L+F+D 37.76 21.87 19.69 13.88
If you find our work helpful, please cite:
@inproceedings{wei2023ov,
title={OV-PARTS: Towards Open-Vocabulary Part Segmentation},
author={Wei, Meng and Yue, Xiaoyu and Zhang, Wenwei and Kong, Shu and Liu, Xihui and Pang, Jiangmiao},
booktitle={Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
year={2023}
}
We would like to express our gratitude to the open-source projects and their contributors, including ZSSeg, CATSeg and CLIPSeg. Their valuable work has greatly contributed to the development of our codebase.