Skip to content

yunlong10/Awesome-LLMs-for-Video-Understanding

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 

Repository files navigation

Awesome-LLMs-for-Video-Understanding Awesome

Yunlong Tang1,*, Jing Bi1,*, Siting Xu2,*, Luchuan Song1, Susan Liang1 , Teng Wang2,3 , Daoan Zhang1 , Jie An1 , Jingyang Lin1 , Rongyi Zhu1 , Ali Vosoughi1 , Chao Huang1 , Zeliang Zhang1 , Pinxin Liu1 , Mingqian Feng1 , Feng Zheng2 , Jianguo Zhang2 , Ping Luo3 , Jiebo Luo1, Chenliang Xu1,†. (*Core Contributors, †Corresponding Authors)

1University of Rochester, 2Southern University of Science and Technology, 3The University of Hong Kong

image

📢 News

[07/23/2024]

📢 We've recently updated our survey: “Video Understanding with Large Language Models: A Survey”!

✨ This comprehensive survey covers video understanding techniques powered by large language models (Vid-LLMs), training strategies, relevant tasks, datasets, benchmarks, and evaluation methods, and discusses the applications of Vid-LLMs across various domains.

🚀 What's New in This Update:
✅ Updated to include around 100 additional Vid-LLMs and 15 new benchmarks as of June 2024.
✅ Introduced a novel taxonomy for Vid-LLMs based on video representation and LLM functionality.
✅ Added a Preliminary chapter, reclassifying video understanding tasks from the perspectives of granularity and language involvement, and enhanced the LLM Background section.
✅ Added a new Training Strategies chapter, removing adapters as a factor for model classification.
✅ All figures and tables have been redesigned.

Multiple minor updates will follow this major update. And the GitHub repository will be gradually updated soon. We welcome your reading and feedback ❤️

Table of Contents

Why we need Vid-LLMs?

image

😎 Vid-LLMs: Models

image

📑 Citation

If you find our survey useful for your research, please cite the following paper:

@article{vidllmsurvey,
      title={Video Understanding with Large Language Models: A Survey}, 
      author={Tang, Yunlong and Bi, Jing and Xu, Siting and Song, Luchuan and Liang, Susan and Wang, Teng and Zhang, Daoan and An, Jie and Lin, Jingyang and Zhu, Rongyi and Vosoughi, Ali and Huang, Chao and Zhang, Zeliang and Zheng, Feng and Zhang, Jianguo and Luo, Ping and Luo, Jiebo and Xu, Chenliang},
      journal={arXiv preprint arXiv:2312.17432},
      year={2023},
}

🗒️ Taxonomy 1

🕹️ Video Analyzer × LLM

LLM as Summarizer
Title Model Date Code Venue
Seeing the Unseen: Visual Metaphor Captioning for Videos GIT-LLaVA 06/2024 code arXiv
Zero-shot long-form video understanding through screenplay MM-Screenplayer 06/2024 project page CVPR
MoReVQA exploring modular reasoning models for video question answering MoReVQA 04/2024 project page CVPR
An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM IG-VLM 03/2024 code arXiv
Language repository for long video understanding LangRepo 03/2024 code arXiv
Understanding long videos in one multimodal language model pass MVU 03/2024 code arXiv
Video ReCap recursive captioning of hour-long videos Video ReCap 02/2024 code CVPR
A Simple LLM Framework for Long-Range Video Question-Answering LLoVi 12/2023 code arXiv
Grounding-prompter prompting LLM with multimodal information for temporal sentence grounding in long videos Grounding-prompter 12/2023 code arXiv
Learning object state changes in videos an open-world perspective VIDOSC 12/2023 code CVPR
AntGPT: Can Large Language Models Help Long-term Action Anticipation from Videos? AntGPT 07/2023 code ICLR
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and DatasetStar VAST 05/2023 code NeurIPS
VLog: Video as a Long DocumentStar VLog 04/2023 code -
Learning Video Representations from Large Language ModelsStar LaViLa 12/2022 code CVPR
LLM as Manager
Title Model Date Code Venue
DrVideo: Document Retrieval Based Long Video Understanding DrVideo 06/2024 code arXiv
OmAgent a multi-modal agent framework for complex video understanding with task divide-and-conquer OmAgent 06/2024 code arXiv
Too Many Frames, not all Useful: Efficient Strategies for Long-Form Video QA LVNet 06/2024 code arXiv
VideoTree adaptive tree-based video representation for LLM reasoning on long videos VideoTree 05/2024 code arXiv
Harnessing Large Language Models for Training-free Video Anomaly Detection LAVAD 04/2024 code CVPR
TraveLER a multi-LMM agent framework for video question-answering TraveLER 04/2024 code arXiv
GPTSee enhancing moment retrieval and highlight detection via description-based similarity features GPTSee 03/2024 code arXiv
Reframe anything LLM agent for open world video reframing RAVA 03/2024 code arXiv
SCHEMA state CHangEs MAtter for procedure planning in instructional videos SCHEMA 03/2024 code ICLR
TV-TREES multimodal entailment trees for neuro-symbolic video reasoning TV-TREES 02/2024 code arXiv
VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding VideoAgent 03/2024 project page arXiv
VideoAgent long-form video understanding with large language model as agent VideoAgent 03/2024 code arXiv
VURF a general-purpose reasoning and self-refinement framework for video understanding VURF 03/2024 code arXiv
Why not use your textbook knowledge-enhanced procedure planning of instructional videos KEPP 03/2024 code CVPR
DoraemonGPT toward understanding dynamic scenes with large language models DoraemonGPT 01/2024 code arXiv
LifelongMemory: Leveraging LLMs for Answering Queries in Long-form Egocentric Videos LifelongMemory 12/2023 code arXiv
Zero-Shot Video Question Answering with Procedural Programs ProViQ 12/2023 code arXiv
AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn AssistGPT 06/2023 code arXiv
ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding System ChatVideo 04/2023 project page arXiv
Video ChatCaptioner: Towards Enriched Spatiotemporal DescriptionsStar Video ChatCaptioner 04/2023 code arXiv
ViperGPT: Visual Inference via Python Execution for Reasoning ViperGPT 03/2023 code arXiv
Hawk: Learning to Understand Open-World Video Anomalies Hawk 05/2024 code arXiv

👾 Video Embedder × LLM

LLM as Text Decoder
Title Model Date Code Venue
AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark AuroraCap 10/2024 project page arXiv
Artemis towards referential understanding in complex videos Artemis 06/2024 code arXiv
EmoLLM multimodal emotional understanding meets large language models EmoLLM 06/2024 code arXiv
Fewer tokens and fewer videos extending video understanding abilities in large vision-language models FTFV-LLM 06/2024 - arXiv
Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams Flash-VStream 06/2024 code arXiv
LLAVIDAL benchmarking large language vision models for daily activities of living LLAVIDAL 06/2024 code arXiv
Long context transfer from language to vision LongVA 06/2024 code arXiv
ShareGPT4Video improving video understanding and generation with better captions ShareGPT4Video 06/2024 code arXiv
Towards event-oriented long video understanding VIM 06/2024 code arXiv
Video-SALMONN speech-enhanced audio-visual large language models Video-SALMONN 06/2024 code ICML
VideoGPT+ integrating image and video encoders for enhanced video understanding VideoGPT+ 06/2024 code arXiv
VideoLLaMA 2 advancing spatial-temporal modeling and audio understanding in video-LLMs VideoLLaMA 2 06/2024 code arXiv
MotionLLM: Understanding Human Behaviors from Human Motions and Videos MotionLLM 05/2024 project page arXiv
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark VideoChat2 11/2023 code CVPR
Shotluck Holmes: A Family of Efficient Small-Scale Large Language Vision Models For Video Captioning and Summarization Shotluck Holmes 05/2024 - arXiv
Streaming long video understanding with large language models VideoStreaming 05/2024 - arXiv
Synchronized Video Storytelling: Generating Video Narrations with Structured Storyline VideoNarrator 05/2024 - arXiv
TOPA extend large language models for video understanding via text-only pre-alignment TOPA 05/2024 code NeurIPS
MovieChat+: Question-aware Sparse Memory for Long Video Question Answering MovieChat+ 04/2024 code arXiv
AutoAD III: The Prequel – Back to the Pixels AutoAD III 04/2024 project page CVPR
Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward LLaVA-Hound-DPO 04/2024 code arXiv
From image to video, what do we need in multimodal LLMs RED-VILLM 04/2024 - arXiv
Koala key frame-conditioned long video-LLM Koala 04/2024 project page CVPR
LongVLM efficient long video understanding via large language models LongVLM 04/2024 code ECCV
MA-LMM memory-augmented large multimodal model for long-term video understanding MA-LMM 04/2024 code CVPR
MiniGPT4-video advancing multimodal LLMs for video understanding with interleaved visual-textual tokens MiniGPT4-Video 04/2024 code arXiv
Pegasus-v1 technical report Pegasus-v1 04/2024 code arXiv
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning PLLaVA 04/2024 code arXiv
ST-LLM: Large Language Models Are Effective Temporal Learners ST-LLM 04/2024 code arXiv
Tarsier recipes for training and evaluating large video description models Tarsier 07/2024 code arXiv
X-VARS introducing explainability in football refereeing with multi-modal large language model X-VARS 04/2024 code arXiv
CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios CAT 03/2024 code arXiv
InternVideo2 scaling video foundation models for multimodal video understanding InternVideo2 03/2024 code ECCV
MovieLLM enhancing long video understanding with AI-generated movies MovieLLM 03/2024 code arXiv
LLMs meet long video advancing long video comprehension with an interactive visual adapter in LLMs IVAwithLLM 02/2024 code arXiv
LSTP language-guided spatial-temporal prompt learning for long-form video-text understanding LSTP 02/2024 code EMNLP
LVCHAT facilitating long video comprehension LVCHAT 02/2024 code arXiv
OSCaR: Object State Captioning and State Change Representation OSCaR 02/2024 code NAACL
Slot-VLM SlowFast slots for video-language modeling Slot-VLM 02/2024 code arXiv
COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training COSMO 01/2024 code arXiv
Weakly supervised gaussian contrastive grounding with large multimodal models for video question answering GCG 01/2024 code ACMMM
Audio-Visual LLM for Video Understanding AV-LLM 12/2023 code arXiv
Generative Multimodal Models are In-Context Learners Emu2 12/2023 project page CVPR
MMICT: Boosting Multi-Modal Fine-Tuning with In-Context Examples MMICT 12/2023 code TOMM
VaQuitA : Enhancing Alignment in LLM-Assisted Video Understanding VaQuitA 12/2023 code arXiv
VILA: On Pre-training for Visual Language Models VILA 12/2023 code CVPR
Vista-LLaMA reliable video narrator via equal distance to visual tokens Vista-LLaMA 12/2023 project page arXiv
Chat-UniVi unified visual representation empowers large language models with image and video understanding Chat-UniVi 11/2023 code CVPR
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models LLaMA-VID 11/2023 code arXiv
Video-LLaVA learning united visual representation by alignment before projection Video-LLaVA 11/2023 code arXiv
Large Language Models are Temporal and Causal Reasoners for Video Question Answering LLaMA-VQA 10/2023 code EMNLP
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding MovieChat 07/2023 code CVPR
LLMVA-GEBC: Large Language Model with Video Adapter for Generic Event Boundary Captioning LLMVA-GEBC 06/2023 code CVPR
Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration Macaw-LLM 06/2023 project page arXiv
Valley: Video Assistant with Large Language model Enhanced abilitY VALLEY 06/2023 code arXiv
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models Video-ChatGPT 06/2023 code ACL
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding Video-LLaMA 06/2023 code EMNLP
Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks mPLUG-video 06/2023 code arXiv
ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst ChatBridge 05/2023 code arXiv
Otter: A Multi-Modal Model with In-Context Instruction Tuning Otter 05/2023 code arXiv
VideoLLM: Modeling Video Sequence with Large Language Models VideoLLM 05/2023 code arXiv
LLM as Regressor
Title Model Date Code Venue
Holmes-VAD: Towards Unbiased and Explainable Video Anomaly Detection via Multi-modal LLM Holmes-VAD 06/2024 code arXiv
VideoLLM-online online video large language model for streaming video VideoLLM-online 06/2024 code CVPR
HOI-Ref: Hand-Object Interaction Referral in Egocentric Vision VLM4HOI 04/2024 project page arXiv
V2Xum-LLM cross-modal video summarization with temporal prompt instruction tuning V2Xum-LLaMA 04/2024 code arXiv
AVicuna: Audio-Visual LLM with Interleaver and Context-Boundary Alignment for Temporal Referential Dialogue AVicuna 03/2024 code arXiv
Elysium exploring object-level perception in videos via MLLM Elysium 03/2024 code arXiv
HawkEye training video-text LLMs for grounding text in videos HawkEye 03/2024 code arXiv
LITA language instructed temporal-localization assistant LITA 03/2024 code arXiv
OmniViD: A Generative Framework for Universal Video Understanding OmniViD 03/2024 code CVPR
GroundingGPT: Language Enhanced Multi-modal Grounding Model GroundingGPT 01/2024 [code](https: //github.com/lzw-lzw/GroundingGPT) arXiv
TimeChat a time-sensitive multimodal large language model for long video understanding TimeChat 12/2023 code CVPR
Self-Chained Image-Language Model for Video Localization and Question Answering SeViLA 11/2023 code NeurIPS
VTimeLLM: Empower LLM to Grasp Video Moments VTimeLLM 11/2023 code arXiv
LLM as Hidden Layer
Title Model Date Code Venue
VTG-LLM integrating timestamp knowledge into video LLMs for enhanced video temporal grounding VTG-LLM 05/2024 code arXiv
VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing VITRON 04/2024 project page NeurIPS
VTG-GPT: Tuning-Free Zero-Shot Video Temporal Grounding with GPT VTG-GPT 03/2024 code arXiv
Momentor advancing video large language model with fine-grained temporal reasoning Momentor 02/2024 code ICML
Detours for navigating instructional videos VidDetours 01/2024 code CVPR
OneLLM: One Framework to Align All Modalities with Language OneLLM 12/2023 code arXiv
GPT4Video a unified multimodal large language model for lnstruction-followed understanding and safety-aware generation GPT4Video 11/2023 code ACMMM

🧭 (Analyzer + Embedder) × LLM

LLM as Manager
Title Model Date Code Venue
MM-VID: Advancing Video Understanding with GPT-4V(ision) MM-VID 10/2023 - arXiv
LLM as Summarizer
Title Model Date Code Venue
Shot2Story20K a new benchmark for comprehensive understanding of multi-shot videos SUM-shot 12/2023 code arXiv
LLM as Regressor
Title Model Date Code Venue
Vript: A Video Is Worth Thousands of Words Vriptor 06/2024 code NeurIPS
Merlin:Empowering Multimodal LLMs with Foresight Minds Merlin 12/2023 project page ECCV
VideoChat: Chat-Centric Video Understanding VideoChat 05/2023 code arXiv
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning Vid2Seq 02/2023 code CVPR
LLM as Text Decoder
Title Model Date Code Venue
Contextual AD Narration with Interleaved Multimodal Sequence Uni-AD 03/2024 code arXiv
MM-Narrator: Narrating Long-form Videos with Multimodal In-Context Learning MM-narrator 11/2023 project page arXiv
Vamos: Versatile Action Models for Video Understanding Vamos 11/2023 project page ECCV
AutoAD II: The Sequel -- Who, When, and What in Movie Audio Description Auto-AD II 10/2023 project page ICCV
LLM as Hidden Layer
Title Model Date Code Venue
PG-Video-LLaVA: Pixel Grounding Large Video-Language ModelsStar PG-Video-LLaVA 11/2023 code arXiv

🗒️ Taxonomy 2

🤖 LLM-based Video Agents

Title Model Date Code Venue
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language Socratic Models 04/2022 project page arXiv
Video ChatCaptioner: Towards Enriched Spatiotemporal DescriptionsStar Video ChatCaptioner 04/2023 code arXiv
VLog: Video as a Long DocumentStar VLog 04/2023 code -
ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding System ChatVideo 04/2023 project page arXiv
MM-VID: Advancing Video Understanding with GPT-4V(ision) MM-VID 10/2023 - arXiv
MISAR: A Multimodal Instructional System with Augmented RealityStar MISAR 10/2023 project page ICCV
Grounding-Prompter: Prompting LLM with Multimodal Information for Temporal Sentence Grounding in Long Videos Grounding-Prompter 12/2023 - arXiv
NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation NaVid 02/2024 project page - RSS
VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding VideoAgent 03/2024 project page arXiv
VideoINSTA: Zero-shot Long Video Understanding via Informative Spatial-Temporal Reasoning with LLMs VideoINSTA 09/2024 code EMNLP

🎥 Vid-LLM Pretraining

Title Model Date Code Venue
Learning Video Representations from Large Language ModelsStar LaViLa 12/2022 code CVPR
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning Vid2Seq 02/2023 code CVPR
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and DatasetStar VAST 05/2023 code NeurIPS
Merlin:Empowering Multimodal LLMs with Foresight Minds Merlin 12/2023 - arXiv

👀 Vid-LLM Instruction Tuning

Fine-tuning with Connective Adapters
Title Model Date Code Venue
Video-LLaMA: An Instruction-Finetuned Visual Language Model for Video Understanding Star Video-LLaMA 06/2023 code arXiv
VALLEY: Video Assistant with Large Language model Enhanced abilitYStar VALLEY 06/2023 code -
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language ModelsStar Video-ChatGPT 06/2023 code arXiv
Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text IntegrationStar Macaw-LLM 06/2023 code arXiv
LLMVA-GEBC: Large Language Model with Video Adapter for Generic Event Boundary Captioning Star LLMVA-GEBC 06/2023 code CVPR
Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks Star mPLUG-video 06/2023 code arXiv
MovieChat: From Dense Token to Sparse Memory for Long Video UnderstandingStar MovieChat 07/2023 code arXiv
Large Language Models are Temporal and Causal Reasoners for Video Question AnsweringStar LLaMA-VQA 10/2023 code EMNLP
Video-LLaVA: Learning United Visual Representation by Alignment Before ProjectionStar Video-LLaVA 11/2023 code arXiv
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video UnderstandingStar Chat-UniVi 11/2023 code arXiv
LLaMA-VID: An Image is Worth 2 Tokens in Large Language ModelsStar LLaMA-VID 11/2023 code arXiv
VISTA-LLAMA: Reliable Video Narrator via Equal Distance to Visual Tokens VISTA-LLAMA 12/2023 - arXiv
Audio-Visual LLM for Video Understanding - 12/2023 - arXiv
AutoAD: Movie Description in Context AutoAD 06/2023 code CVPR
AutoAD II: The Sequel - Who, When, and What in Movie Audio Description AutoAD II 10/2023 - ICCV
AutoAD III: The Prequel -- Back to the Pixels AutoAD III 04/2024 - CVPR
Fine-grained Audio-Visual Joint Representations for Multimodal Large Language ModelsStar FAVOR 10/2023 code arXiv
VideoLLaMA2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMsStar VideoLLaMA2 06/2024 code arXiv
Fine-tuning with Insertive Adapters
Title Model Date Code Venue
Otter: A Multi-Modal Model with In-Context Instruction TuningStar Otter 06/2023 code arXiv
VideoLLM: Modeling Video Sequence with Large Language ModelsStar VideoLLM 05/2023 code arXiv
Fine-tuning with Hybrid Adapters
Title Model Date Code Venue
VTimeLLM: Empower LLM to Grasp Video MomentsStar VTimeLLM 11/2023 code arXiv
GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation GPT4Video 11/2023 - arXiv

🦾 Hybrid Methods

Title Model Date Code Venue
VideoChat: Chat-Centric Video UnderstandingStar VideoChat 05/2023 code demo arXiv
PG-Video-LLaVA: Pixel Grounding Large Video-Language ModelsStar PG-Video-LLaVA 11/2023 code arXiv
TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video UnderstandingStar TimeChat 12/2023 code CVPR
Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video GroundingStar Video-GroundingDINO 12/2023 code arXiv
A Video Is Worth 4096 Tokens: Verbalize Videos To Understand Them In Zero Shot Video4096 05/2023 EMNLP

💎 Training-free Methods

Title Model Date Code Venue
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models SlowFast-LLaVA 07/2024 - arXiv
TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models TS-LLaVA 11/2024 code arXiv

Tasks, Datasets, and Benchmarks

Recognition and Anticipation

Name Paper Date Link Venue
Charades Hollywood in homes: Crowdsourcing data collection for activity understanding 2016 Link ECCV
YouTube8M YouTube-8M: A Large-Scale Video Classification Benchmark 2016 Link -
ActivityNet ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding 2015 Link CVPR
Kinetics-GEBC GEB+: A Benchmark for Generic Event Boundary Captioning, Grounding and Retrieval 2022 Link ECCV
Kinetics-400 The Kinetics Human Action Video Dataset 2017 Link -
VidChapters-7M VidChapters-7M: Video Chapters at Scale 2023 Link NeurIPS

Captioning and Description

Name Paper Date Link Venue
Microsoft Research Video Description Corpus (MSVD) Collecting Highly Parallel Data for Paraphrase Evaluation 2011 Link ACL
Microsoft Research Video-to-Text (MSR-VTT) MSR-VTT: A Large Video Description Dataset for Bridging Video and Language 2016 Link CVPR
Tumblr GIF (TGIF) TGIF: A New Dataset and Benchmark on Animated GIF Description 2016 Link CVPR
Charades Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding 2016 Link ECCV
Charades-Ego Actor and Observer: Joint Modeling of First and Third-Person Videos 2018 Link CVPR
ActivityNet Captions Dense-Captioning Events in Videos 2017 Link ICCV
HowTo100m HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips 2019 Link ICCV
Movie Audio Descriptions (MAD) MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions 2021 Link CVPR
YouCook2 Towards Automatic Learning of Procedures from Web Instructional Videos 2017 Link AAAI
MovieNet MovieNet: A Holistic Dataset for Movie Understanding 2020 Link ECCV
Youku-mPLUG Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks 2023 Link arXiv
Video Timeline Tags (ViTT) Multimodal Pretraining for Dense Video Captioning 2020 Link AACL-IJCNLP
TVSum TVSum: Summarizing web videos using titles 2015 Link CVPR
SumMe Creating Summaries from User Videos 2014 Link ECCV
VideoXum VideoXum: Cross-modal Visual and Textural Summarization of Videos 2023 Link IEEE Trans Multimedia
Multi-Source Video Captioning (MSVC) VideoLLaMA2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs 2024 Link arXiv

Grounding and Retrieval

Name Paper Date Link Venue
Epic-Kitchens-100 Rescaling Egocentric Vision 2021 Link IJCV
VCR (Visual Commonsense Reasoning) From Recognition to Cognition: Visual Commonsense Reasoning 2019 Link CVPR
Ego4D-MQ and Ego4D-NLQ Ego4D: Around the World in 3,000 Hours of Egocentric Video 2021 Link CVPR
Vid-STG Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form Sentences 2020 Link CVPR
Charades-STA TALL: Temporal Activity Localization via Language Query 2017 Link ICCV
DiDeMo Localizing Moments in Video with Natural Language 2017 Link ICCV

Question Answering

Name Paper Date Link Venue
MSVD-QA Video Question Answering via Gradually Refined Attention over Appearance and Motion 2017 Link ACM Multimedia
MSRVTT-QA Video Question Answering via Gradually Refined Attention over Appearance and Motion 2017 Link ACM Multimedia
TGIF-QA TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering 2017 Link CVPR
ActivityNet-QA ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering 2019 Link AAAI
Pororo-QA DeepStory: Video Story QA by Deep Embedded Memory Networks 2017 Link IJCAI
TVQA TVQA: Localized, Compositional Video Question Answering 2018 Link EMNLP
MAD-QA Encoding and Controlling Global Semantics for Long-form Video Question Answering 2024 Link EMNLP
Ego-QA Encoding and Controlling Global Semantics for Long-form Video Question Answering 2024 Link EMNLP

Video Instruction Tuning

Pretraining Dataset
Name Paper Date Link Venue
VidChapters-7M VidChapters-7M: Video Chapters at Scale 2023 Link NeurIPS
VALOR-1M VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset 2023 Link arXiv
Youku-mPLUG Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks 2023 Link arXiv
InternVid InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation 2023 Link arXiv
VAST-27M VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset 2023 Link NeurIPS
Fine-tuning Dataset
Name Paper Date Link Venue
MIMIC-IT MIMIC-IT: Multi-Modal In-Context Instruction Tuning 2023 Link arXiv
VideoInstruct100K Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models 2023 Link arXiv
TimeIT TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding 2023 Link CVPR

Video-based Large Language Models Benchmark

Title Date Code Venue
LVBench: An Extreme Long Video Understanding Benchmark 06/2024 code -
Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models 11/2023 code -
Perception Test: A Diagnostic Benchmark for Multimodal Video Models 05/2023 code NeurIPS 2023, ICCV 2023 Workshop
Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks Star 07/2023 code -
FETV: A Benchmark for Fine-Grained Evaluation of Open-Domain Text-to-Video Generation Star 11/2023 code NeurIPS 2023
MoVQA: A Benchmark of Versatile Question-Answering for Long-Form Movie Understanding 12/2023 code -
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark 12/2023 code -
TempCompass: Do Video LLMs Really Understand Videos? Star 03/2024 code ACL 2024
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis Star 06/2024 code -
VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models Star 06/2024 code -

Contributing

We welcome everyone to contribute to this repository and help improve it. You can submit pull requests to add new papers, projects, and helpful materials, or to correct any errors that you may find. Please make sure that your pull requests follow the "Title|Model|Date|Code|Venue" format. Thank you for your valuable contributions!

🌟 Star History

Star History Chart

♥️ Contributors

Our project wouldn't be possible without the contributions of these amazing people! Thank you all for making this project better.

Yunlong Tang @ University of Rochester
Jing Bi @ University of Rochester
Siting Xu @ Southern University of Science and Technology
Luchuan Song @ University of Rochester
Susan Liang @ University of Rochester
Teng Wang @ The University of Hong Kong
Daoan Zhang @ University of Rochester
Jie An @ University of Rochester
Jingyang Lin @ University of Rochester
Rongyi Zhu @ University of Rochester
Ali Vosoughi @ University of Rochester
Chao Huang @ University of Rochester
Zeliang Zhang @ University of Rochester
Pinxin Liu @ University of Rochester
Mingqian Feng @ University of Rochester
Feng Zheng @ Southern University of Science and Technology
Jianguo Zhang @ Southern University of Science and Technology
Ping Luo @ University of Hong Kong
Jiebo Luo @ University of Rochester
Chenliang Xu @ University of Rochester

About

🔥🔥🔥Latest Papers, Codes and Datasets on Vid-LLMs.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published