LVNet

Official Code for Too Many Frames, Not All Useful: Efficient Strategies for Long-Form Video QA

It is accepted in the workshop on Video-Language Models at NeurIPS 2024

Abstract

Long-form videos that span across wide temporal intervals are highly information- redundant and contain multiple distinct events or entities that are often loosely- related. Therefore, when performing long-form video question answering (LVQA), all information necessary to generate a correct response can often be contained within a small subset of frames. Recent literature explore the use of large language models (LLMs) in LVQA benchmarks, achieving exceptional performance, while relying on vision language models (VLMs) to convert all visual content within videos into natural language. Such VLMs often independently caption a large number of frames uniformly sampled from long videos, which is not efficient and can mostly be redundant. Questioning these decision choices, we explore optimal strategies for key-frame selection and sequence-aware captioning, that can signifi- cantly reduce these redundancies. We propose two novel approaches that improve each of aspects, namely Hierarchical Keyframe Selector and Sequential Visual LLM. Our resulting framework termed LVNet achieves state-of-the-art performance across three benchmark LVQA datasets

Accuracy vs Captions on the EgoSchema Subset

LVNet shows a SOTA 68.2% accuracy, merely at 12 captions.
The result highlights the quality of keyframes from the hierarchical keyframe selector.

Hierarchical Keyframe Selector: Structural Overview

Overall strategy: Generate captions by hierarchical keyframe selector and feed them to the separate LLM to answer the question.
Temporal Scene Clustering (TSC): Divides the long-video into scenes, enabling per-scene subsampling.
Coarse Keyframe Detector (CKD): Selects frames best-aligned with keywords relevant to the query.
Fine Keyframe detector (FKD): Selects frames by refining keyword alignements through a templated visual prompting.

Hierarchical Keyframe Selector: Operational Visualization

Temporal Scene Clustering (TSC): 900 frames get clustered into scenes and uniformly subsampled within each scene to output around 280 frames.
Coarse Keyframe Detector (CKD): Coarse Keyframe Detector selects only 32 frames out of them, based on the alignment with keywords which are from options.
Visual Templating: Coarsely refined keyframes are then ordered according to confidence scores and temporal orders, and grouped them into 4 groups of 8 frames each.
Fine Keyframe Detector (FKD): Selects 12 frames by refining keyword alignments in visual templates.

Experiments: EgoSchema, NExT-QA, and IntentQA

LVNet achieves state-of-the-art accuracies of 61.1%, 72.9%, and 71.7% on the three datasets, respectively, using just 12 frames compared to the models using the similar number of captions.
Models with video-caption pretraining or utilizing significantly more captions than the 12 frames used by LVNet are de-emphasized in grey or downplayed in light green to ensure fairness with image-level pretraining or highlight caption efficiency.

Evaluation

Generate Answers Using LLM

You can easily run the LLM to generate answers for the questions using the pre-generated captions.

Download the Captions for Dataset

EgoSchema: bash scripts/get_ES_captions.sh

Run LLM bash scripts/eval_ES.sh

Generate captions using our provided modules

Hierarchical Keyframe Selector (HKS)

Temporal Scene Clustering (TSC): temporalSceneClustering.py
Coarse Keyframe Detector (CKD): coarseKeyframeDetector.py
Fine Keyframe detector (FKD): fineKeyframeDetector.py

EgoSchema keyframe selection from images: bash config/run.sh
Generate captions based on the keyframes: bash scripts/create_caption.sh

Data

Hierarchical Keyframe Selector hyper-parameters & paths

[LINK]

coarseKeyframeDetector.py CLIP model checkpoint

ICCV 2023 Perceptual Grouping in Contrastive Vision-Language Models
Checkpoint: Download

Citation

@inproceedings{Park2024TooMF,
  title={Too Many Frames, not all Useful: Efficient Strategies for Long-Form Video QA},
  author={Jongwoo Park and Kanchana Ranasinghe and Kumara Kahatapitiya and Wonjeong Ryoo and Donghyun Kim and Michael S. Ryoo},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
config		config
figures		figures
keywords		keywords
questions		questions
scripts		scripts
src		src
tables		tables
.gitignore		.gitignore
LLM_stage.py		LLM_stage.py
README.md		README.md
VLM_stage.py		VLM_stage.py
coarseKeyframeDetector.py		coarseKeyframeDetector.py
fineKeyframeDetector.py		fineKeyframeDetector.py
requirements.txt		requirements.txt
temporalSceneClustering.py		temporalSceneClustering.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LVNet

Abstract

Accuracy vs Captions on the EgoSchema Subset

Hierarchical Keyframe Selector: Structural Overview

Hierarchical Keyframe Selector: Operational Visualization

Experiments: EgoSchema, NExT-QA, and IntentQA

Evaluation

Generate Answers Using LLM

Generate captions using our provided modules

Hierarchical Keyframe Selector (HKS)

Data

Hierarchical Keyframe Selector hyper-parameters & paths

coarseKeyframeDetector.py CLIP model checkpoint

Citation

About

Releases

Packages

Contributors 2

Languages

jongwoopark7978/LVNet

Folders and files

Latest commit

History

Repository files navigation

LVNet

Abstract

Accuracy vs Captions on the EgoSchema Subset

Hierarchical Keyframe Selector: Structural Overview

Hierarchical Keyframe Selector: Operational Visualization

Experiments: EgoSchema, NExT-QA, and IntentQA

Evaluation

Generate Answers Using LLM

Generate captions using our provided modules

Hierarchical Keyframe Selector (HKS)

Data

Hierarchical Keyframe Selector hyper-parameters & paths

coarseKeyframeDetector.py CLIP model checkpoint

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages