init

DAMO-NLP-SG · Dec 23, 2024 · 39ef165 · 39ef165
commit 39ef165
Show file tree

Hide file tree

Showing 45 changed files with 6,539 additions and 0 deletions.
diff --git a/Benchmark.md b/Benchmark.md
@@ -0,0 +1,164 @@
+# VideoRefer-Bench
+VideoRefer-Bench enables an in-depth evaluation of video-based referring conversational models through two types of assessments:
+
+1. Video-based object-level Description Generation
+2. Zero-shot object-level question-answer
+
+---
+
+## VideoRefer-Bench-D
+
+The benchmark is designed to evaluate the description generation performance of video-based referring models. The benchmark comprises a total of 400 curated data entries. We curated the test set based on Panda-70M, employing the automatic pipeline, followed by a meticulous human check.
+
+This benchmark covers four key aspects:
+
+1. **Subject Correspondence (SC)**: This dimension evaluates whether the subject of the generated description accurately corresponds to that specified in the ground truth.
+2. **Appearance Description (AD)**: This criterion assesses the accuracy of appearance-related details, including color, shape, texture, and other relevant visual attributes.
+3. **Temporal Description (TD)**: This aspect analyzes whether the representation of the object’s motion is consistent with the actual movements.
+4. **Hallucination Detection (HD)**: This facet identifies discrepancies by determining if the generated description includes any facts, actions, or elements absent from reality, like imaginative interpretations or incorrect inferences.
+
+| Type                   | GPT-4o        | InternVL2-26B | Qwen2-VL-7B | Elysium    | Artemis | VideoRefer        |
+| ---------------------- | ------------- | ------------- | ----------- | ---------- | ------- | ----------------- |
+| Subject Correspondence | 3.34/4.15     | 3.55/4.08     | 2.97/3.30   | 2.35/-     | -/3.42  | **4.41**/**4.44** |
+| Appearance Description | 2.96/**3.31** | 2.99/3.35     | 2.24/2.54   | 0.30/-     | -/1.34  | **3.27**/3.27     |
+| Temporal Description   | 3.01/**3.11** | 2.57/3.08     | 2.03/2.22   | 0.02/-     | -/1.39  | **3.03**/3.10     |
+| Hallucinaton Detection | 2.50/2.43     | 2.25/2.28     | 2.31/2.12   | **3.59**/- | -/2.90  | 2.97/**3.04**     |
+| Average                | 2.95/3.25     | 2.84/3.20     | 2.39/2.55   | 1.57/-     | -/2.26  | **3.42**/**3.46** |
+
+### Data download
+The annotation of VideoRefer-Bench-D can be downloaded [here]().
+
+Given the vast size of the Panda-70M dataset, downloading it can be quite costly. Therefore, we have provided the video used in the benchmark [here]().
+
+Data structure:
+```bash
+VideoRefer
+└── eval
+    └── VideoRefer-Bench-D
+        ├── VideoRefer-Bench-D.json
+        └── Panda-70M-part 
+```
+
+### Data Format
+For each object, we uniformly sampled 32 frames to generate the corresponding mask.
+
+The data format organized in the benchmark json file is as below:
+
+```json
+[
+    {
+        "id": 0,
+        "video": "rLlzmcp3J6s_0:01:09.633_0:01:14.333.mp4",
+        "caption": "The cub is a smaller, light colored lion. It is lying down and resting its head against the other lion. The cub looks calm and relaxed. It is the lion on the far left side of the frame.",
+        "frame_idx": "36",
+        "annotation":[
+            {
+                "2":{
+                    "segmentation": {
+                    }
+                },
+                "6":{
+                    "segmentation": {
+                    }
+                },
+                ...
+            }
+        ]
+    }
+]
+```
+
+- `frame_idx`: When using single-frame mask mode, we only use the single mask with the frame_idx.
+- All the segmentations are in `RLE` format.
+
+### Evaluation
+We use GPT-4o to evaluate this benchmark by assigning scores to the generated predictions on a scale from 0 to 5 across four dimensions.
+
+The evaluation code can be found in [videorefer/eval/videorefer_bench_d](videorefer/eval/videorefer_bench_d).
+
+
+
+## VideoRefer-Bench-Q
+The benchmark is designed to evaluate the proficiency of MLLMs in interpreting video objects, including 1,000 high-quality multiple-choice questions.
+
+The benchmark covers five types of questions:
+
+1. Basic Questions
+2. Sequential Questions
+3. Relationship Questions
+4. Reasoning Questions
+5. Future Predictions
+
+| Type                   | GPT-4o   | GPT-4o-mini | InternVL2-26B | Qwen2-VL-7B | VideoRefer |
+| ---------------------- | -------- | ----------- | ------------- | ----------- | ---------- |
+| Basic Questions        | 62.3     | 57.6        | 58.5          | 62.0        | **75.4**   |
+| Sequential Questions   | **74.5** | 67.1        | 63.5          | 69.6        | 68.6       |
+| Relationship Questions | **66.0** | 56.5        | 53.4          | 54.9        | 59.3       |
+| Reasoning Questions    | 88.0     | 85.9        | 88.0          | 87.3        | **89.4**   |
+| Reasoning Questions    | 73.7     | 75.4        | **78.9**      | 74.6        | 78.1       |
+| Average                | 71.3     | 65.8        | 65.0          | 66.0        | **71.9**   |
+
+### Data download
+The annotation of VideoRefer-Bench-Q can be downloaded [here]().
+
+The source video in VideoRefer-Bench includes MeViS and Davis.
+- MeViS
+    - Available at: https://codalab.lisn.upsaclay.fr/competitions/15094
+- DAVIS
+    - Available at: https://davischallenge.org/davis2017/code.html
+    - Please download and unzip `TrainVal`, `Test-Dev` and `Test-Challenge` to the JPEGImages directory.
+
+Data structure:
+```bash
+VideoRefer
+└── eval
+    └── VideoRefer-BenchQ
+        ├── VideoRefer-Bench-Q.json
+        ├── MeViS 
+        |   ├── valid_u/ 
+        |   |   └── JPEGImages/      
+        └── DAVIS 
+            └── JPEGImages/  
+                └── 480p/      
+
+```
+
+### Data Format
+For each object, we uniformly sampled 32 frames to generate the corresponding mask.
+
+The data format organized in the benchmark json file is as below:
+
+```json
+[
+    {
+        "id": 0,
+        "video": "DAVIS/JPEGImages/480p/aerobatics",
+        "Question": "What is <object3><region> not wearing?",
+        "type": "Basic Questions",
+        "options": [
+            "(A) A helmet",
+            "(B) A hat",
+            "(C) Sunglasses",
+            "(D) A watch"
+        ],
+        "Answer": "(A) A helmet",
+        "frame_idx": "57",
+        "annotation":[
+            {
+                "0":{
+                    "segmentation": {
+                    }
+                },
+                "3":{
+                    "segmentation": {
+                    }
+                },
+                ...
+            }
+        ]
+    }
+]
+```
+
+- `frame_idx`: When using single-frame mask mode, we only use the single mask with the frame_idx.
+- All the segmentations are in `RLE` format.
diff --git a/README.md b/README.md
@@ -0,0 +1,162 @@
+<p align="center">
+    <img src="assets/videorefer.png" width="80%" style="margin-bottom: 0.2;"/>
+<p>
+
+<h3 align="center"><a href="https://arxiv.org/abs/2406.07476" style="color:#4D2B24">
+VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM</a></h3>
+
+<div align=center>
+
+![Static Badge](https://img.shields.io/badge/VideoRefer-v1-F7C97E) 
+[![arXiv preprint](https://img.shields.io/badge/arxiv-xxx-ECA8A7?logo=arxiv)]() 
+[![Dataset](https://img.shields.io/badge/Dataset-Comming_Soon-E59FB6)]() 
+[![Model](https://img.shields.io/badge/Model-Hugging_Face-CFAFD4)]() 
+[![Benchmark](https://img.shields.io/badge/Benchmark-Hugging_Face-96D03A)]() 
+[![Static Badge](https://img.shields.io/badge/Try_Demo-6B88E3?logo=youtubegaming&logoColor=DAE4EE)]() 
+[![Homepage](https://img.shields.io/badge/Homepage-visit-9DC3E6)]() 
+
+</div>
+
+<p align="center">
+    <img src="assets/demo.gif" width="100%" style="margin-bottom: 0.2;"/>
+<p>
+
+<p align="center" style="margin-bottom: 5px;">
+  VideoRefer can understand any object you're interested within a video.
+</p>
+<p align="center" style="margin-top: 5px;">
+  This demo integrates SAM 2 for visualization.
+</p>
+
+## 📰 News
+* **[2024.12.xx]**  We Release the code of VideoRefer and the VideoRefer-Bench.
+
+
+<details open><summary>💡 Some other multimodal-LLM projects from our team may interest you ✨. </summary><p>
+<!--  may -->
+
+> [**Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding**](https://github.com/DAMO-NLP-SG/Video-LLaMA) <br>
+> Hang Zhang, Xin Li, Lidong Bing <br>
+[![github](https://img.shields.io/badge/-Github-black?logo=github)](https://github.com/DAMO-NLP-SG/Video-LLaMA)  [![github](https://img.shields.io/github/stars/DAMO-NLP-SG/Video-LLaMA.svg?style=social)](https://github.com/DAMO-NLP-SG/Video-LLaMA) [![arXiv](https://img.shields.io/badge/Arxiv-2306.02858-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2306.02858) <br>
+
+> [**VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs**](https://github.com/DAMO-NLP-SG/VideoLLaMA2) <br>
+> Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, Lidong Bing <br>
+[![github](https://img.shields.io/badge/-Github-black?logo=github)](https://github.com/DAMO-NLP-SG/VideoLLaMA2)  [![github](https://img.shields.io/github/stars/DAMO-NLP-SG/VideoLLaMA2.svg?style=social)](https://github.com/DAMO-NLP-SG/VideoLLaMA2) [![arXiv](https://img.shields.io/badge/Arxiv-2406.07476-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2406.07476) <br>
+
+> [**Osprey: Pixel Understanding with Visual Instruction Tuning**](https://github.com/CircleRadon/Osprey) <br>
+> Yuqian Yuan, Wentong Li, Jian Liu, Dongqi Tang, Xinjie Luo, Chi Qin, Lei Zhang, Jianke Zhu <br>
+[![github](https://img.shields.io/badge/-Github-black?logo=github)](https://github.com/CircleRadon/Osprey)  [![github](https://img.shields.io/github/stars/CircleRadon/Osprey.svg?style=social)](https://github.com/CircleRadon/Osprey) [![arXiv](https://img.shields.io/badge/Arxiv-2312.10032-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2312.10032) <br>
+
+</p></details>
+
+
+## 🔍 About VideoRefer Suite 
+
+`VideoRefer Suite` is a suite designed to enhance the fine-grained spatial-temporal understanding capabilities of Video Large Language Models (Video LLMs). It consists of three primary components:
+
+* **Model (VideoRefer)**
+
+`VideoRefer` is an effective Video LLM, which enables fine-grained perceiving, reasoning and retrieval for user-defined regions at any specified timestamps. Supporting both single-frame and multi-frame region inputs.
+
+<p align="center">
+    <img src="assets/model.png" width="90%" style="margin-bottom: 0.2;"/>
+<p>
+
+
+* **Dataset (VideoRefer-700K)**
+
+`VideoRefer-700K` is a large-scale, high-quality object-level video instruction dataset. Curated using a sophisticated multi-agent data engine to fill the gap for high-quality object-level video instruction data.
+
+<p align="center">
+    <img src="assets/dataset.png" width="90%" style="margin-bottom: 0.2;"/>
+<p>
+
+
+* **Benchmark (VideoRefer-Bench)**
+
+`VideoRefer-Bench` is a comprehensive benchmark to evaluate the object-level video understanding capabilities of a model, which consists of two sub-benchmarks: **VideoRefer-Bench-D** and **VideoRefer-Bench-Q**.
+
+<p align="center">
+    <img src="assets/benchmark.png" width="70%" style="margin-bottom: 0.2;"/>
+<p>
+
+
+
+## 🛠️ Requirements and Installation
+Basic Dependencies:
+* Python >= 3.8
+* Pytorch >= 2.2.0
+* CUDA Version >= 11.8
+* transformers == 4.40.0 (for reproducing paper results)
+* tokenizers == 0.19.1
+
+Install required packages:
+```bash
+git clone https://github.com/DAMO-NLP-SG/VideoRefer
+cd VideoRefer
+pip install -r requirements.txt
+pip install flash-attn==2.5.8 --no-build-isolation
+```
+
+
+## 🗝️ Training & Evaluation
+### Training
+The training data and data structure can be found in [Dataset preparation](training.md).
+
+The training pipeline of our model is structured into four distinct stages.
+
+- **Stage1: Image-Text Alignment Pre-training**
+    - We use the same data as in [VideoLLaMA2.1](https://github.com/DAMO-NLP-SG/VideoLLaMA2).
+    - The pretrained projector weights can be found in [VideoLLaMA2.1-7B-16F-Base](https://huggingface.co/DAMO-NLP-SG/VideoLLaMA2.1-7B-16F-Base).
+
+- **Stage2: Region-Text Alignment Pre-training**
+    - Prepare datasets used for stage2.
+    - Run `bash scripts/train/stage2.sh`.
+
+- **Stage2.5:  High-Quality Knowledge Learning**
+    - Prepare datasets used for stage2.5.
+    - Run `bash scripts/train/stage2.5.sh`.
+
+- **Stage3:  Visual Instruction Tuning**
+    - Prepare datasets used for stage3.
+    - Run `bash scripts/train/stage3.sh`.
+
+### Evaluation
+For model evaluation, please refer to [eval](.eval/eval.md)
+
+## 🌏 Model Zoo
+| Model Name     | Visual Encoder | Language Decoder | # Training Frames |
+|:----------------|:----------------|:------------------|:----------------:|
+| [VideoRefer-7B]() | [siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384) | [Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct)  | 16 |
+| [VideoRefer-7B-stage2]()  | [siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384) | [Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct)  | 16 |
+| [VideoRefer-7B-stage2.5]()  | [siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384) | [Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct)  | 16 |
+
+
+## 🕹️ VideoRefer-Bench
+
+`VideoRefer-Bench` assesses the models in two key areas: Description Generation, corresponding to `VideoRefer-BenchD`, and Multiple-choice Question-Answer, corresponding to `VideoRefer-BenchQ`.
+
+https://github.com/user-attachments/assets/33757d27-56bd-4523-92da-8f5a58fe5c85
+
+- The annotations of the benchmark can be found in [🤗benchmark]().
+
+- The usage of VideoRefer-Bench is detailed in [doc](./Benchmark.md).
+
+
+## 📑 Citation
+
+If you find VideoRefer Suite useful for your research and applications, please cite using this BibTeX:
+```bibtex
+@article{yuan2024videorefersuite,
+  title = {VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM},
+  author = {Yuqian Yuan, Hang Zhang, Wentong Li, Zesen Cheng, Boqiang Zhang, Long Li, Xin Li, Deli Zhao, Wenqiao Zhang, Yueting Zhuang, Jianke Zhu, Lidong Bing},
+  journal={arXiv},
+  year={2024},
+  url = {}
+}
+```
+
+## 👍 Acknowledgement
+The codebase of VideoRefer is adapted from [**VideoLLaMA 2**](https://github.com/DAMO-NLP-SG/VideoLLaMA2).
+The LLM we used is Qwen2.
+
diff --git a/assets/benchmark.png b/assets/benchmark.png
diff --git a/assets/dataset.png b/assets/dataset.png
diff --git a/assets/demo.gif b/assets/demo.gif
diff --git a/assets/model.png b/assets/model.png
diff --git a/assets/videorefer.png b/assets/videorefer.png
diff --git a/eval/eval.md b/eval/eval.md
@@ -0,0 +1,55 @@
+# Evaluation for VideoRefer 📊
+
+This document provides instructions on evaluating VideoRefer on video referring tasks and general video understanding tasks.
+
+## 1.VideoRefer-Bench
+Please prepare the datasets and annotations used for evaluation, as outlined in [VideoRefer-Bench](../Benchmark.md).
+1. VideoRefer-Bench-D
+```bash
+CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/eval/eval_videorefer-bench-d.sh
+```
+Note: Adjust the `--mode` parameter to switch between annotation modes: use `single` for single-frame mode and `multi` for multi-frame mode.
+
+2. VideoRefer-Bench-Q
+```bash
+CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/eval/eval_videorefer-bench-q.sh
+```
+Note: 
+- Fill in the `AZURE_API_KEY`, `AZURE_API_ENDPOINT` and `AZURE_API_DEPLOYNAME` in the `eval_videorefer-bench-q.sh` first.
+- Adjust the `--mode` parameter to switch between annotation modes: use `single` for single-frame mode and `multi` for multi-frame mode.
+
+
+## 2.General Video Understanding
+We test three benchmarks, MVBench, videomme and perception test.
+
+The evaluation data structure is derived from [VideoLLaMA2](https://github.com/DAMO-NLP-SG/VideoLLaMA2).
+
+```
+VideoLLaMA2
+├── eval
+│   ├── mvbench # Official website: https://huggingface.co/datasets/OpenGVLab/MVBench
+|   |   ├── video/
+|   |   |   ├── clever/
+|   |   |   └── ...
+|   |   └── json/
+|   |   |   ├── action_antonym.json
+|   |   |   └── ...
+│   ├── perception_test_mcqa # Official website: https://huggingface.co/datasets/OpenGVLab/MVBench
+|   |   ├── videos/ # Available at: https://storage.googleapis.com/dm-perception-test/zip_data/test_videos.zip
+|   |   └── mc_question_test.json # Download from https://storage.googleapis.com/dm-perception-test/zip_data/mc_question_test_annotations.zip
+│   ├── videomme # Official website: https://video-mme.github.io/home_page.html#leaderboard
+|   |   ├── test-00000-of-00001.parquet
+|   |   ├── videos/
+|   |   └── subtitles/
+```
+
+Running command:
+
+```bash
+# mvbench evaluation
+CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/eval/eval_video_qa_mvbench.sh
+# videomme evaluation
+CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/eval/eval_video_mcqa_videomme.sh
+# perception test evaluation
+CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/eval/eval_video_mcqa_perception_test_mcqa.sh
+```