Hao Wang1,2,Pengzhen Ren1,Zequn Jie3, Xiao Dong1, Chengjian Feng3, Yinlong Qian3,
Lin Ma3, Dongmei Jiang2, Yaowei Wang2,4, Xiangyuan Lan2π§, Xiaodan Liang1,2π§
1 Sun Yat-sen University, 2 Pengcheng Lab, 3 Meituan Inc, 4 HIT, Shenzhen
π§ corresponding author.
[Paper
] [HuggingFace
] [Demo
] [BibTex
]
-
15/08/2024
: β¨ Have a look!!! We release the pre-training code and log on O365 dataset. You could try to reproduce our results. -
06/08/2024
: π Awesome!!! OV-SAM = OV-DINO + SAM2. We update OV-SAM marrying OV-DINO with SAM2 on the online demo. -
16/07/2024
: We provide the online demo, click and enjoy !!! NOTE: You uploaded image will be stored for failure analysis. -
16/07/2024
: We release the web inference demo, try to deploy it by yourself. -
15/07/2024
: We release the fine-tuning code, try to fine-tune on your custom dataset. Feel free to raise issue if you encounter some problem. -
15/07/2024
: We release the local inference demo, try to deploy OV-DINO on you local machine and run inference on images. -
14/07/2024
: We release the pre-trained models and the evaluation code. -
11/07/2024
: We release OV-DINO paper on arxiv. Code and pre-trained model are coming soon.
This project contains the official PyTorch implementation, pre-trained models, fine-tuning code, and inference demo for OV-DINO.
-
OV-DINO is a novel unified open vocabulary detection approach that offers superior performance and effectiveness for practical real-world application.
-
OV-DINO entails a Unified Data Integration pipeline that integrates diverse data sources for end-to-end pre-training, and a Language-Aware Selective Fusion module to improve the vision-language understanding of the model.
-
OV-DINO shows significant performance improvement on COCO and LVIS benchmarks compared to previous methods, achieving relative improvements of +2.5% AP on COCO and +12.7% AP on LVIS compared to G-DINO in zero-shot evaluation.
Model | Pre-Train Data | APmv | APr | APc | APf | APval | APr | APc | APf | APcoco | Weights |
---|---|---|---|---|---|---|---|---|---|---|---|
OV-DINO1 | O365 | 24.4 | 15.5 | 20.3 | 29.7 | 18.7 | 9.3 | 14.5 | 27.4 | 49.5 / 57.5 | CKPT / LOG π€ |
OV-DINO2 | O365,GoldG | 39.4 | 32.0 | 38.7 | 41.3 | 32.2 | 26.2 | 30.1 | 37.3 | 50.6 / 58.4 | CKPT π€ |
OV-DINO3 | O365,GoldG,CC1Mβ‘ | 40.1 | 34.5 | 39.5 | 41.5 | 32.9 | 29.1 | 30.4 | 37.4 | 50.2 / 58.2 | CKPT π€ |
NOTE: APmv denotes the zero-shot evaluation results on LVIS MiniVal, APval denotes the zero-shot evaluation results on LVIS Val, APcoco denotes (zero-shot / fine-tune) evaluation results on COCO, respectively.
OV-DINO
βββ datas
βΒ Β βββ o365
β β βββ annotations
β β βββ train
β β βββ val
β β βββ test
βΒ Β βββ coco
β β βββ annotations
β β βββ train2017
β β βββ val2017
β βββ lvis
β β βββ annotations
β β βββ train2017
β β βββ val2017
β βββ custom
β βββ annotations
β βββ train
β βββ val
βββ docs
βββ inits
βΒ Β βββ huggingface
βΒ Β βββ ovdino
βΒ Β βββ sam2
βΒ Β βββ swin
βββ ovdino
βΒ Β βββ configs
βΒ Β βββ demo
βΒ Β βββ detectron2-717ab9
βΒ Β βββ detrex
βΒ Β βββ projects
βΒ Β βββ scripts
βΒ Β βββ tools
βββ wkdrs
β βββ ...
β
# clone this project
git clone https://github.com/wanghao9610/OV-DINO.git
cd OV-DINO
export root_dir=$(realpath ./)
cd $root_dir/ovdino
# Optional: set CUDA_HOME for cuda11.6.
# OV-DINO utilizes the cuda11.6 default, if your cuda is not cuda11.6, you need first export CUDA_HOME env manually.
export CUDA_HOME="your_cuda11.6_path"
export PATH=$CUDA_HOME/bin:$PATH
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH
echo -e "$log_format cuda version:\n$(nvcc -V)"
# create conda env for ov-dino
conda create -n ovdino -y
conda activate ovdino
conda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.6 -c pytorch -c nvidia -y
conda install gcc=9 gxx=9 -c conda-forge -y # Optional: install gcc9
python -m pip install -e detectron2-717ab9
pip install -e ./
# Optional: create conda env for ov-sam, it may not compatible with ov-dino, so we create a new env.
# ov-sam = ov-dino + sam2
conda create -n ovsam -y
conda activate ovsam
conda install pytorch==2.3.1 torchvision==0.18.1 torchaudio==2.3.1 pytorch-cuda=12.1 -c pytorch -c nvidia -y
# install the sam2 following the sam2 project.
# please refer to https://github.com/facebookresearch/segment-anything-2.git
# download sam2 checkpoints and put them to inits/sam2
python -m pip install -e detectron2-717ab9
pip install -e ./
- Download COCO from the official website, and put them on datas/coco folder.
cd $root_dir mkdir -p datas/coco wget http://images.cocodataset.org/zips/train2017.zip -O datas/coco/train2017.zip wget http://images.cocodataset.org/zips/val2017.zip -O datas/coco/val2017.zip wget http://images.cocodataset.org/annotations/annotations_trainval2017.zip -O datas/coco/annotations_trainval2017.zip
- Extract the ziped files, and remove them:
cd $root_dir unzip datas/coco/train2017.zip -d datas/coco unzip datas/coco/val2017.zip -d datas/coco unzip datas/coco/annotations_trainval2017.zip -d datas/coco rm datas/coco/train2017.zip datas/coco/val2017.zip datas/coco/annotations_trainval2017.zip
- Download LVIS annotation files:
cd $root_dir mkdir -p datas/lvis wget https://huggingface.co/hao9610/OV-DINO/resolve/main/lvis_v1_minival_inserted_image_name.json -O datas/lvis/annotations/lvis_v1_minival_inserted_image_name.json wget https://huggingface.co/hao9610/OV-DINO/resolve/main/lvis_v1_val_inserted_image_name.json -O datas/lvis/annotations/lvis_v1_val_inserted_image_name.json
- Soft-link COCO to LVIS:
cd $root_dir ln -s $(realpath datas/coco/train2017) datas/lvis ln -s $(realpath datas/coco/val2017) datas/lvis
- Refer to the OpenDataLab for Objects365V1 download, which has provided detailed download instruction.
cd $root_dir mkdir -p datas/o365/annotations # Suppose you download the Objects365 raw file and put them on datas/o365/raw, extract the tared files and reorder them. cd datas/o365/raw tar -xvf Objects365_v1.tar.gz cd 2019-08-02 for file in *.zip; do unzip -o "$file"; done mv *.json $root_dir/datas/o365/annotations mv train val test $root_dir/datas/o365
Download the pre-trained model from Model Zoo, and put them on inits/ovdino directory.
cd $root_dir/ovdino
bash scripts/eval.sh path_to_eval_config_file path_to_pretrained_model output_directory
cd $root_dir/ovdino
# Evaluation mean ap on COCO dataset.
bash scripts/eval.sh \
projects/ovdino/configs/ovdino_swin_tiny224_bert_base_eval_coco.py \
../inits/ovdino/ovdino_swint_og-coco50.6_lvismv39.4_lvis32.2.pth \
../wkdrs/eval_ovdino
cd $root_dir/ovdino
# Evaluation of fixed_ap on LVIS MiniVal dataset.
bash scripts/eval.sh \
projects/ovdino/configs/ovdino_swin_tiny224_bert_base_eval_lvismv.py \
../inits/ovdino/ovdino_swint_ogc-coco50.2_lvismv40.1_lvis32.9.pth \
../wkdrs/eval_ovdino
# Evaluation of fixed_ap on the LVIS Val dataset.
# It will require about 250GB of memory due to the large number of samples in the LVIS Val dataset, so please ensure that your machine has enough memory.
bash scripts/eval.sh \
projects/ovdino/configs/ovdino_swin_tiny224_bert_base_eval_lvis.py \
../inits/ovdino/ovdino_swint_ogc-coco50.2_lvismv40.1_lvis32.9.pth \
../wkdrs/eval_ovdino
cd $root_dir/ovdino
bash scripts/finetune.sh \
projects/ovdino/configs/ovdino_swin_tiny224_bert_base_ft_coco_24ep.py \
../inits/ovdino/ovdino_swint_og-coco50.6_lvismv39.4_lvis32.2.pth
-
Prepare your custom dataset as the COCO annotation format, following the instructions on custom_ovd.py.
-
Refer the following command to run fine-tuning.
cd $root_dir/ovdino bash scripts/finetune.sh \ projects/ovdino/configs/ovdino_swin_tiny224_bert_base_ft_custom_24ep.py \ ../inits/ovdino/ovdino_swint_ogc-coco50.2_lvismv40.1_lvis32.9.pth
-
Download dataset following Objects365 Data Preparing.
-
Refer the following command to run pre-training.
On the first machine:
cd $root_dir/ovdino # Replace $MASTER_PORT and $MASTER_ADDR with your actual machine settings. NNODES=2 NODE_RANK=0 MASTER_PORT=$MASTER_PORT MASTER_ADDR=$MASTER_ADDR \ bash scripts/pretrain.sh \ projects/ovdino/configs/ovdino_swin_tiny224_bert_base_pretrain_o365_24ep.py
On the second machine:
cd $root_dir/ovdino # Replace $MASTER_PORT and $MASTER_ADDR with your actual machine settings. NNODES=2 NODE_RANK=1 MASTER_PORT=$MASTER_PORT MASTER_ADDR=$MASTER_ADDR \ bash scripts/pretrain.sh \ projects/ovdino/configs/ovdino_swin_tiny224_bert_base_pretrain_o365_24ep.py
NOTE: The default batch size for O365 pre-training is 64 in our experiments, and running on 2 nodes with 8 A100 GPUs per-node. If you encounter Out-of-Memory error, you can adjust the batch size and learning rate, total steps by linearly.
Coming soon ...
Coming soon ...
NOTE: We will release the all pre-training code after our paper is accepted.
-
Local inference on a image or folder give the category names.
# for ovdino: conda activate ovdino # for ovsam: conda activate ovsam cd $root_dir/ovdino bash scripts/demo.sh demo_config.py pretrained_model category_names input_images_or_directory output_directory
Examples:
cd $root_dir/ovdino # single image inference bash scripts/demo.sh \ projects/ovdino/configs/ovdino_swin_tiny224_bert_base_infer_demo.py \ ../inits/ovdino/ovdino_swint_ogc-coco50.2_lvismv40.1_lvis32.9.pth \ "class0 class1 class2 ..." img0.jpg output_dir/img0_vis.jpg # multi images inference bash scripts/demo.sh \ projects/ovdino/configs/ovdino_swin_tiny224_bert_base_infer_demo.py \ ../inits/ovdino/ovdino_swint_ogc-coco50.2_lvismv40.1_lvis32.9.pth \ "class0 long_class1 long_class2 ..." "img0.jpg img1.jpg" output_dir # image folder inference bash scripts/demo.sh \ projects/ovdino/configs/ovdino_swin_tiny224_bert_base_infer_demo.py \ ../inits/ovdino/ovdino_swint_ogc-coco50.2_lvismv40.1_lvis32.9.pth \ "class0 long_class1 long_class2 ..." image_dir output_dir
NOTE: The input category_names are separated by spaces, and the words of single class are connected by underline (_).
-
Web inference demo.
cd $root_dir/ovdino bash scripts/app.sh \ projects/ovdino/configs/ovdino_swin_tiny224_bert_base_infer_demo.py \ ../inits/ovdino/ovdino_swint_ogc-coco50.2_lvismv40.1_lvis32.9.pth
After the web demo deployment, you can open the demo on your browser.
We also provide the online demo, click and enjoy.
- Release the pre-trained model.
- Release the fine-tuning and evaluation code.
- Support the local inference demo.
- Support the web inference demo.
- Release the pre-training code on O365 dataset.
- Support ONNX exporting OV-DINO model.
- Support OV-DINO in π€ transformers.
- Release the all pre-training code.
This project has referenced some excellent open-sourced repos (Detectron2, detrex, GLIP, G-DINO, YOLO-World). Thanks for their wonderful works and contributions to the community.
If you find OV-DINO is helpful for your research or applications, please consider giving us a star π and citing it by the following BibTex entry.
@article{wang2024ovdino,
title={OV-DINO: Unified Open-Vocabulary Detection with Language-Aware Selective Fusion},
author={Hao Wang and Pengzhen Ren and Zequn Jie and Xiao Dong and Chengjian Feng and Yinlong Qian and Lin Ma and Dongmei Jiang and Yaowei Wang and Xiangyuan Lan and Xiaodan Liang},
journal={arXiv preprint arXiv:2407.07844},
year={2024}
}