Skip to content

GenjiB/ECLIPSE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ECLIPSE: Efficient Long-range Video Retrieval using Sight and Sound

License: MIT

This is the PyTorch implementation of our paper:
ECLIPSE: Efficient Long-range Video Retrieval using Sight and Sound
Yan-Bo Lin, Jie Lei, Mohit Bansal, and Gedas Bertasius
In European Conference on Computer Vision, 2022.

paper

📝 Preparation

  1. pip3 install requirements.txt
  2. Dataset: ActivityNet, QVHighlights, YouCook2, DiDeMo and Charades.
  3. extract video frames in 3 fps.
  4. extract audio features.
  5. To load pretrained CLIP weight

The download links are from official CLIP4Clip Download CLIP (ViT-B/32) weight,

wget -P ./modules https://openaipublic.azureedge.net/clip/models/40d365715913c9da98579312b702a82c18be219cc2a73407c4526f58eba950af/ViT-B-32.pt

or, download CLIP (ViT-B/16) weight,

wget -P ./modules https://openaipublic.azureedge.net/clip/models/5806e77cd80f8b59890b7e101eabd078d9fb84e6937f9e85e4ecb61988df416f/ViT-B-16.pt

💿 Extract images and audio features.

ActivityNet/
├── raw_frames/
│       └── VIDEO_NAME/
│           ├── 0001.jpg
│           ├── ...
│           └── 00...jpg
│
└── VGGSound_Audio_features_10s_aligned/
        └── VIDEO_NAME/
            ├── 0000.pt
            ├── ...
            └── 00...pt

💿 Extracted audio features.

VGGSound features on ActivityNet Captions: Google Drive

📚 Train and evaluate

ActivityNet Captions: bash run_act.sh
DiDemo: bash run_didemo.sh
Charades: bash run_cha.sh
QVHighlight:bash run_qvh.sh
YouCook2: bash run_yc2.sh

🎓 Cite

If you use this code in your research, please cite:

@InProceedings{ECLIPSE_ECCV22,
author = {Yan-Bo Lin and Jie Lei and Mohit Bansal and Gedas Bertasius},
title = {ECLIPSE: Efficient Long-range Video Retrieval using Sight and Sound},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
month = {October},
year = {2022}
}

👍 Acknowledgments

Our code is based on CLIP4Clip and VGGSound

✏ Future works

  • Preprocessed video frames and audio features

License

This project is licensed under MIT License, as found in the LICENSE file.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published