Skip to content

Official implementation of paper ReTaKe: Reducing Temporal and Knowledge Redundancy for Long Video Understanding

License

Notifications You must be signed in to change notification settings

SCZwangxiao/video-ReTaKe

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ReTaKe is a novel approach for long video understanding that reduces temporal and knowledge redundancy, enabling MLLMs to process 8x longer video sequences (up to 2048 frames) under the same memory budget.


📢 Recent Updates

  • 2025/02/01: Support for the latest version of Transformers (v4.48).
  • 2025/01/29: Added support for LLaVA-Video and LLaVA-OneVision.

🚀 Key Contributions

  • Training-Free Framework: ReTaKe is the first method to jointly model temporal and knowledge redundancy for long video understanding, reducing the model sequence length to 1/4 of the original with a relative performance loss within 1%.

  • Novel Techniques:

    • DPSelect: A keyframe selection method to reduce low-level temporal redundancy.
    • PivotKV: A KV cache compression method to reduce high-level knowledge redundancy in long videos.

Overview of ReTaKe


⚙️ Environment Setup

For GPU Users:

conda env create -f environment.yaml

For NPU Users:

conda env create -f environment_npu.yaml

Additional Dependencies:

apt-get install ffmpeg  # Required for full functionality; quick demo does not require ffmpeg.

🖥️ Quick Demo

Step 1: Update Configuration

Modify the hf_qwen2vl7b_path in ./demo.py to point to your local path for Qwen2-VL-7B-Instruct.
For NPU users, also update config_path to 'configs/retake_demo_npu.yaml'.

Step 2 (Optional for LLaVA-Video): Convert Model

# Convert LLaVA-Video model into Hugging Face format
# Ensure the following models are downloaded: Qwen2-7B-Instruct, siglip-so400m-patch14-384, and LLaVAVideoQwen2_7B.
python scripts/utils/convert_llava_video_weights_to_hf.py \
  --text_model_id /path_to/Qwen2-7B-Instruct \
  --vision_model_id /path_to/siglip-so400m-patch14-384 \
  --output_hub_path /path_to/llava-video-qwen2-7b-hf \
  --old_state_dict_id /path_to/LLaVAVideoQwen2_7B

Step 3: Run the Demo

python demo.py

📊 Reproducing ReTaKe Results

Step 1: Prepare Datasets

Follow the documentation to prepare the required datasets:

Step 2: Run Inference and Evaluation

Use the provided script to perform inference and evaluation:

bash scripts/infer_eval_retake.sh ${YOUR_PATH_TO_Qwen2-VL-7B-Instruct} configs/qwen2_vl/retake_qwen2-vl_videomme.yaml 8
bash scripts/infer_eval_retake.sh ${YOUR_PATH_TO_Qwen2-VL-7B-Instruct} configs/qwen2_vl/retake_qwen2-vl_mlvu.yaml 8
bash scripts/infer_eval_retake.sh ${YOUR_PATH_TO_Qwen2-VL-7B-Instruct} configs/qwen2_vl/retake_qwen2-vl_lvbench.yaml 8
  • Results will be saved in the ./results directory.

📚 Citation

If you find this work helpful, please consider citing:

@misc{xiao_retake_2024,
  author       = {Xiao Wang and
                  Qingyi Si and
                  Jianlong Wu and
                  Shiyu Zhu and
                  Li Cao and
                  Liqiang Nie},
  title        = {{ReTaKe}: {Reducing} {Temporal} and {Knowledge} {Redundancy} for {Long} {Video} {Understanding}},
  year         = {2024},
  note = {arXiv:2412.20504 [cs]}
}

About

Official implementation of paper ReTaKe: Reducing Temporal and Knowledge Redundancy for Long Video Understanding

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published