VTG-LLM: Integrating Timestamp Knowledge into Video LLMs for Enhanced Video Temporal Grounding

If our project helps you, please give us a star ⭐ and cite our paper!

News

10/12/2024, VTG-LLM has been accepted to AAAI 2025.
10/10/2024, We released a more powerful temporal grounding video LLM TRACE.
7/22/2024, Update evaluation results using various temperature.
5/28/2024, NPU checkpoints can be fine-tuned on V100 GPU.

Overview

We introduce

VTG-IT-120K, a high-quality and comprehensive instruction tuning dataset that covers VTG tasks such as moment retrieval (63.2K), dense video captioning (37.2K), video summarization (15.2K), and video highlight detection (3.9K).
VTG-LLM, which (1) effectively integrates timestamp knowledge into visual tokens; (2) incorporates absolute-time tokens that specifically handle timestamp knowledge, thereby avoiding concept shifts; and (3) introduces a lightweight, high-performance slot-based token compression method to facilitate the sampling of more video frames.

Overview of VTG-LLM.

Enviroments

We recommend utilizing NPU environments for training, evaluation, and fine-tuning. The environment we use can be found in environment-npu.yaml. Additionally, we have discovered that executing the script below is sufficient for most scenarios.

bash install_requirements.sh

If an NPU is not available, a V100 can also be employed for training and evaluation, but it cannot be used for fine-tuning checkpoints trained by an NPU. The necessary environments can be found in requirements-v100.txt.

Model Checkpoints

The model checkpoint (without finetuning) is avaliable at huggingface:

git lfs install

git clone https://huggingface.co/Yongxin-Guo/VTG-LLM

Data

See DATA.md for details. The data annotations are avaliable at huggingface:

git lfs install

git clone https://huggingface.co/datasets/Yongxin-Guo/VTG-IT

Requirments

Please download the following model checkpoints:

EVA-ViT-g: https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/BLIP2/eva_vit_g.pth
InstructBLIP: https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/InstructBLIP/instruct_blip_vicuna7b_trimmed.pth
Video-LLaMA: https://huggingface.co/DAMO-NLP-SG/Video-LLaMA-2-7B-Finetuned/tree/main
Bert: https://huggingface.co/google-bert/bert-base-uncased

Script

Tuning

Config the checkpoint and dataset paths in pretrain-slot-sample-fmt-96.yaml. Config the bert checkpoint paths in blip2.py and vtgllm.py

torchrun --nproc_per_node=16 train.py --cfg-path  train_configs/videollama/pretrain-slot-sample-fmt-96.yaml

Evaluation

Config the checkpoint and dataset paths in videollama-slot-96.yaml.

Config the downstream task in eval.sh.

bash eval.sh

Results

Youcook2	CIDER	METEOR	SODA_c	F1
t=1.0 (paper)	5.0	1.9	1.5	17.5
t=0.1	5.4	1.8	1.6	18.4

Charades-STA	0.3	0.5	0.7
t=1.0 (paper)	52.0	33.8	15.7
t=0.1	53.9	36.3	16.6

QVHighlights	mAP	Hit@1
t=1.0 (paper)	16.5	33.5
t=0.1	16.2	30.7

ActivityNet	CIDER	METEOR	SODA_c	F1
t=1.0 (paper)	18.2	5.7	4.7	34.0
t=0.1	20.7	5.9	5.1	34.8

Demo

# cat_and_chicken.mp4

# Describe this video

A cute little kitten is sleeping on a couch. A little chicken is sitting on the cats chest and looking at the camera. The cat is purring and the chicken is moving its head.

# Please locate a series of events in the video, output the start and end timestamps of each event, and describe each event in sentences.

0000.0 - 0010.0 seconds, A cute kitten is sleeping on a couch. 0010.0 - 0020.0 seconds, A yellow bird lands on the couch and gently touches the kitten's head. 0020.0 - 0030.0 seconds, The bird picks up the kitten and starts to play with it. 0030.0 - 0040.0 seconds, The kitten tries to push the bird away, but the bird continues to play with it. 0040.0 - 0050.0 seconds, The kitten falls asleep on the couch.

Gradio Demo

You need to firstly change the path of videos and model checkpoints to your path.

python gradio_demo.py

Recommended GPUs

Instruction-tuning: 16xATN 910B
Inference: 1xV100

Acknowledgement

We are grateful for the following awesome projects:

TimeChat
Video-LLaMA
MiniGPT-4
FastChat
BLIP-2
EVA-CLIP
LLaMA
VideoChat
TESTA
VTimeLLM
Video-LLaVA
entropy_estimators

Bibliography

If you find this repository helpful for your project, please consider citing:

@article{guo2024vtg,
  title={VTG-LLM: Integrating Timestamp Knowledge into Video LLMs for Enhanced Video Temporal Grounding},
  author={Guo, Yongxin and Liu, Jingyu and Li, Mingda and Tang, Xiaoying and Chen, Xi and Zhao, Bo},
  journal={arXiv preprint arXiv:2405.13382},
  year={2024}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

VTG-LLM: Integrating Timestamp Knowledge into Video LLMs for Enhanced Video Temporal Grounding

If our project helps you, please give us a star ⭐ and cite our paper!

News

Overview

Enviroments

Model Checkpoints

Data

Requirments

Script

Tuning

Evaluation

Results

Demo

Gradio Demo

Recommended GPUs

Acknowledgement

Bibliography

Files

README.md

Latest commit

History

README.md

File metadata and controls

VTG-LLM: Integrating Timestamp Knowledge into Video LLMs for Enhanced Video Temporal Grounding

If our project helps you, please give us a star ⭐ and cite our paper!

News

Overview

Enviroments

Model Checkpoints

Data

Requirments

Script

Tuning

Evaluation

Results

Demo

Gradio Demo

Recommended GPUs

Acknowledgement

Bibliography