🌐 Homepage | 🤗 Paper | 📖 arXiv | 🏆 Leaderboard
Figure 1: The main tasks of VidEgoThink benchmark to comprehensively assess the egocentric video understanding capabilities in Embodied AI. There are four types of tasks, including video question answering, hierarchy planning, visual grounding, and reward modeling. These four tasks are complementary to each other to implement a complete goal for Embodied AI.
[2024-10]: Our paper VidEgoThink: Assessing Egocentric Video Understanding Capabilities for Embodied AI has been released.
[2024-09]: EgoThink and VidEgoThink is invited to be presented in ZhiDX.
Given that the utilization of foundation models in Embodied AI remains an open research question, we carefully design four types of interrelated tasks for comprehensive assessment: (i) video question-answering, (ii) hierarchy planning, (iii) visual grounding, (iv) reward modeling.
You can use Ego4D CLI to get the original egocentric videos of Ego4d GoalStep.
# download goalstep videos
ego4d --datasets full_scale --benchmark goalstep -o <out-dir>
Please directly clone our GitHub Repo.
git clone https://github.com/AdaCheng/VidEgoThink.git
cd data
The format of our annotations are as follows, where this video_path
indicates the clipped video from start_time
to end_time
of the original video_uid
in Ego4D GoalStep. The image_path
contains the uniformly sampled keyframes from our clipped videos.
[
{
"video_uid": "a13a145f-920a-44ec-8aef-b489c097f4a7",
"start_time": 294.21739,
"end_time": 341.15273,
"video_path": "151.mp4",
"image_path": [
"151/frame_0001.png",
"151/frame_0015.png",
"151/frame_0030.png",
"151/frame_0045.png",
"151/frame_0060.png",
"151/frame_0074.png",
"151/frame_0089.png",
"151/frame_0104.png"
],
"question": "How many times did I adjust a container in the cupboard with my right hand?",
"answer": "Twice."
},
]
Considering the license of Ego4D and the large file size, readers need to use our scripts to process the original egocentric videos. 😎 We will also try to share our videos and images to external cloud soon.
- Prepare clipped videos.
python video_clip.py \
--data_path /VidEgoThin/data/${annotation_file} \
--video_folder /goal_step/v2/full_scale/ \
--output_folder /data/${clipped_video_folder}
- Prepare sampled keyframes. (Optional, we use the same keyframes for multi-images MLLMs to ensure fairness. You can choose better strategy.)
python keyframe_extract.py \
--input_folder /data/${clipped_video_folder} \
--output_folder /data/${keyframe_folder}
🫰 Thank you very much if you would like to contribute the code of the new model you have deployed!
- create
test_{new_model}.py
in/models
. - Add the new model in
get_model()
in/models/__init__.py
.
# Qwen2-VL-7B-Instruct
if model_name == 'qwen2_vl':
from .test_qwen2vl import TestQwen2VL
return TestQwen2VL(device)
- API-based Model
Please update the API-based models' keys and base_urls between the line 23 to line 33 of file gpt_eval.py.
# dataset: Activity, Object/existence, etc.
# MODEL: GPT series models, such as gpt-4o
# INFERENCE_TYPE: {caption, frames, 32-frames, text}
# TASK: {vqa, hp_high2mid, hp_mid2low, rm_critique, rm_feedback}
python gpt_eval.py \
--model_name $MODEL \
--inference_type $INFERENCE_TYPE \
--annotation_path /${dataset}/annotations.json \
--video_folder /data/${clipped_video_folder} \
--image_folder /data/${keyframe_folder} \
--answer_path /answer/${dataset} \
--task $TASK
- Open-Source Model (@TODO: double check)
# dataset: Activity, Object/existence, etc.
# MODEL: models defined in the models file
# DEVICE: GPU id, 0/1/2..., currently only single card can run
python eval.py \
--model_name $MODEL \
--annotation_path /${dataset}/annotations.json \
--answer_path /answer/${dataset} \
--batch_size 1 \
--device $DEVICE
Please update the API-based models' key and base between the line 463 to line 546 of file common.py.
# data-folder: the folder name of answer.
# bench-name: Activity, Object/existence, etc.
# EVA_MODELS: a list of models to be evaluated (separated by spaces), for example "llava-13b-llama2 llava-1.5-13b llava-1.5-7b"
# $EVA_JUDGE_MODEL: gpt-4o (default), gpt-3.5-turbo, claude-2, etc.
python gen_judgment.py \
--data-folder /answer \
--bench-name $dataset \
--mode single \
--model-list $EVA_MODELS \
--judge-model $EVA_JUDGE_MODEL
--parallel 4
--judge-file judge_prompts.jsonl
# EVA_MODELS: a list of models to be evaluated (separated by spaces), for example "llava-13b-llama2 llava-1.5-13b llava-1.5-7b"
# $EVA_JUDGE_MODEL: gpt-4 (default), gpt-3.5-turbo, claude-2, etc.
python show_result.py \
--input-file {data_folder}/{bench-name}/model_judgment/{judge-model}_single.jsonl \
--judge-model $EVA_JUDGE_MODEL \
--model-list $EVA_MODELS \
--mode single
Table 1: Experimental results of video question answering. OE, OO, OI, OC, OS, OP denote object existence, object order, object interaction, object count, object state, object prediction. AE, AS, AC indicates action existence, action sequence, action count. SE, ST, SP denote scene existence, scene transition, scene prediction. The bold font denotes the best performance and the underline font denotes the second-best performance.
Table 2: Experimental results of video question answerng, hierarchy planning, visual grounding, and reward modeling tasks. The bold font denotes the best performance and the underline font denotes the second-best performance.
- Sijie Cheng: [email protected]
@article{cheng2024videgothink,
title={VidEgoThink: Assessing Egocentric Video Understanding Capabilities for Embodied AI},
author={Cheng, Sijie and Fang, Kechen and Yu, Yangyang and Zhou, Sicheng and Li, Bohao and Tian, Ye and Li, Tingguang and Han, Lei and Liu, Yang},
journal={arXiv preprint arXiv:2410.11623},
year={2024}
}
If you are intested in our VidEgoThink, we strongly recommend you to read our previous related work, EgoThink.🥰
@InProceedings{Cheng_2024_CVPR,
author = {Cheng, Sijie and Guo, Zhicheng and Wu, Jingwen and Fang, Kechen and Li, Peng and Liu, Huaping and Liu, Yang},
title = {EgoThink: Evaluating First-Person Perspective Thinking Capability of Vision-Language Models},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2024},
pages = {14291-14302}
}
Thanks to Yuyang You for his support in data collection and inference. Thanks to Xiang Yue, Yuanzhi Li, Jiangjie Chen for their early discussion.
Furthermore, we appreciate the developers behind the following projects for their significant contributions to our research: EgoThink, Ego4D, Multi-Modality-Arena, FastChat.