-
Notifications
You must be signed in to change notification settings - Fork 46
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
0 parents
commit 262ebc7
Showing
56 changed files
with
11,181 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,129 @@ | ||
# Byte-compiled / optimized / DLL files | ||
__pycache__/ | ||
*.py[cod] | ||
*$py.class | ||
|
||
# C extensions | ||
*.so | ||
|
||
# Distribution / packaging | ||
.Python | ||
build/ | ||
develop-eggs/ | ||
dist/ | ||
downloads/ | ||
eggs/ | ||
.eggs/ | ||
lib/ | ||
lib64/ | ||
parts/ | ||
sdist/ | ||
var/ | ||
wheels/ | ||
pip-wheel-metadata/ | ||
share/python-wheels/ | ||
*.egg-info/ | ||
.installed.cfg | ||
*.egg | ||
MANIFEST | ||
|
||
# PyInstaller | ||
# Usually these files are written by a python script from a template | ||
# before PyInstaller builds the exe, so as to inject date/other infos into it. | ||
*.manifest | ||
*.spec | ||
|
||
# Installer logs | ||
pip-log.txt | ||
pip-delete-this-directory.txt | ||
|
||
# Unit test / coverage reports | ||
htmlcov/ | ||
.tox/ | ||
.nox/ | ||
.coverage | ||
.coverage.* | ||
.cache | ||
nosetests.xml | ||
coverage.xml | ||
*.cover | ||
*.py,cover | ||
.hypothesis/ | ||
.pytest_cache/ | ||
|
||
# Translations | ||
*.mo | ||
*.pot | ||
|
||
# Django stuff: | ||
*.log | ||
local_settings.py | ||
db.sqlite3 | ||
db.sqlite3-journal | ||
|
||
# Flask stuff: | ||
instance/ | ||
.webassets-cache | ||
|
||
# Scrapy stuff: | ||
.scrapy | ||
|
||
# Sphinx documentation | ||
docs/_build/ | ||
|
||
# PyBuilder | ||
target/ | ||
|
||
# Jupyter Notebook | ||
.ipynb_checkpoints | ||
|
||
# IPython | ||
profile_default/ | ||
ipython_config.py | ||
|
||
# pyenv | ||
.python-version | ||
|
||
# pipenv | ||
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. | ||
# However, in case of collaboration, if having platform-specific dependencies or dependencies | ||
# having no cross-platform support, pipenv may install dependencies that don't work, or not | ||
# install all needed dependencies. | ||
#Pipfile.lock | ||
|
||
# PEP 582; used by e.g. github.com/David-OConnor/pyflow | ||
__pypackages__/ | ||
|
||
# Celery stuff | ||
celerybeat-schedule | ||
celerybeat.pid | ||
|
||
# SageMath parsed files | ||
*.sage.py | ||
|
||
# Environments | ||
.env | ||
.venv | ||
env/ | ||
venv/ | ||
ENV/ | ||
env.bak/ | ||
venv.bak/ | ||
|
||
# Spyder project settings | ||
.spyderproject | ||
.spyproject | ||
|
||
# Rope project settings | ||
.ropeproject | ||
|
||
# mkdocs documentation | ||
/site | ||
|
||
# mypy | ||
.mypy_cache/ | ||
.dmypy.json | ||
dmypy.json | ||
|
||
# Pyre type checker | ||
.pyre/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
# Code of Conduct | ||
|
||
Facebook has adopted a Code of Conduct that we expect project participants to adhere to. | ||
Please read the [full text](https://code.fb.com/codeofconduct/) | ||
so that you can understand what actions will and will not be tolerated. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,39 @@ | ||
# Contributing to LaViLa | ||
We want to make contributing to this project as easy and transparent as | ||
possible. | ||
|
||
## Our Development Process | ||
Minor changes and improvements will be released on an ongoing basis. Larger changes (e.g., changesets implementing a new paper) will be released on a more periodic basis. | ||
|
||
## Pull Requests | ||
We actively welcome your pull requests. | ||
|
||
1. Fork the repo and create your branch from `main`. | ||
2. If you've added code that should be tested, add tests. | ||
3. If you've changed APIs, update the documentation. | ||
4. Ensure the test suite passes. | ||
5. Make sure your code lints. | ||
6. If you haven't already, complete the Contributor License Agreement ("CLA"). | ||
|
||
## Contributor License Agreement ("CLA") | ||
In order to accept your pull request, we need you to submit a CLA. You only need | ||
to do this once to work on any of Facebook's open source projects. | ||
|
||
Complete your CLA here: <https://code.facebook.com/cla> | ||
|
||
## Issues | ||
We use GitHub issues to track public bugs. Please ensure your description is | ||
clear and has sufficient instructions to be able to reproduce the issue. | ||
|
||
Facebook has a [bounty program](https://www.facebook.com/whitehat/) for the safe | ||
disclosure of security bugs. In those cases, please go through the process | ||
outlined on that page and do not file a public issue. | ||
|
||
## Coding Style | ||
* 4 spaces for indentation rather than tabs | ||
* 80 character line length | ||
* PEP8 formatting following [Black](https://black.readthedocs.io/en/stable/) | ||
|
||
## License | ||
By contributing to LaViLa, you agree that your contributions will be licensed | ||
under the LICENSE file in the root directory of this source tree. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
|
||
MIT License | ||
|
||
Copyright (c) Meta Platforms, Inc. and affiliates. | ||
|
||
Permission is hereby granted, free of charge, to any person obtaining a copy | ||
of this software and associated documentation files (the "Software"), to deal | ||
in the Software without restriction, including without limitation the rights | ||
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell | ||
copies of the Software, and to permit persons to whom the Software is | ||
furnished to do so, subject to the following conditions: | ||
|
||
The above copyright notice and this permission notice shall be included in all | ||
copies or substantial portions of the Software. | ||
|
||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR | ||
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, | ||
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE | ||
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER | ||
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, | ||
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE | ||
SOFTWARE. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,131 @@ | ||
# Learning Video Representations from Large Language Models | ||
|
||
|
||
[**Learning Video Representations from Large Language Models**](http://arxiv.org/abs/2212.04501) | ||
Yue Zhao, Ishan Misra, Philipp Krähenbühl, Rohit Girdhar | ||
[arxiv](http://arxiv.org/abs/2212.04501) | [bibtex](#citing-lavila) | [colab](#narrator-demo) | ||
|
||
LaViLa (**L**anguage **a**ugmented **Vi**deo **La**nguage Pretraining) is a new approach to learning video representations from Large Language Models (LLMs). We repurpose LLMs to be visually conditioned "Narrators", and use them to automatically generate video-language paired data. We use this data to then learn a video-langauge representation, outperforming prior work by large margins. | ||
|
||
**Sample Generations:** | ||
|
||
| Video | Generation 1 | Generation 2 | | ||
| --------|-------------|--------------| | ||
| <img src="assets/mixkit-pastry-chef-cutting-a-loaf-into-slices-43015-medium.gif" height=128> | so now we're going to slice the bread | now i'm going to do is just slice<br>this up into a nice chunk and<br>then we're going to place it<br>on the plate | | ||
|
||
[Try out](#narrator-demo) our Narrator to generate text descriptions for your own videos! | ||
|
||
The resulting video-language model sets a new **state-of-the-art** on a number of popular video tasks! | ||
<img width="400" alt="image" src="https://user-images.githubusercontent.com/1893429/205997492-a6cbc7c1-1f8e-4fad-9d94-f9e22920272d.png"> | ||
|
||
|
||
|
||
|
||
## Introduction and installation | ||
|
||
<span style="font-variant:small-caps;">LaViLa</span> leverages Large Language Models (LLMs) as "NARRATOR"s (and "REPHRASER"s) to densely narrate long videos, and uses these narrations to train strong dual-encoder models. | ||
|
||
|
||
<img src="assets/lavila_ego4d.gif" height=384> | ||
|
||
See [INSTALL.md](docs/INSTALL.md) to install this code. | ||
|
||
## NARRATOR | ||
|
||
NARRATOR is a *visually conditioned* LLM that takes videos frames as input and pseudo-labels this clip with narrations. | ||
|
||
<img src="assets/narrator.gif" height=384> | ||
|
||
|
||
### NARRATOR Demo | ||
|
||
We provide some generated samples by our NARRATOR: | ||
|
||
| | <img src="assets/06919917-76bc-4adc-b944-2a722f165513.gif" height=128> | <img src="assets/cf7c12db-1a9e-46d3-96d6-38174bbe373c.gif" height=128> | <img src="assets/ab865129-78fa-47d4-8a50-ff8c5533246f.gif" height=128> | ||
| :----------------: | :----------------------------------------: | :-------------------------------------: | :--------------------------------------: | | ||
| Human<br>narration | C separates the yarn. | C lifts container. | C opterates the camera. | | ||
| NARRATOR generation (a) | C stetches the thread with both hands. | C wipes the countertop with a sponge. | C takes a photo shot. | | ||
| NARRATOR generation (b) | C pulls out the yarn with her right hand. | C moves the container. | A man X looks at the camera. | | ||
|
||
|
||
Run the narrator demo using Colab (no GPU needed): [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1gHWiEWywIotRivYQTR-8NQ6GJC7sJUe4) | ||
|
||
Since Colab free account offers very limited RAM, if you'd like to run the demo with a larger model, please run [./demo_narrator.py](./demo_narrator.py) locally. For more technical details, please refer to Sec 4.1 in our paper. | ||
|
||
```bash | ||
# CPU mode | ||
python demo_narrator.py [--video-path $TEST_VIDEO] | ||
|
||
# GPU mode | ||
python demo_narrator.py --cuda | ||
``` | ||
|
||
|
||
Our narrator also works on third-person videos! Below are several examples generated by our NARRATOR that is pre-trained on HowTo100M Auto-Aligned ([HTM-AA](https://www.robots.ox.ac.uk/~vgg/research/tan/index.html#htm-align)) and applied to some stock footage video clips. Note that since the text corpus in HowTo100M is ASR transcription, the style of narration is slightly different from that of ground-truth captions. However the generated results are generally reasonable. | ||
|
||
| | <img src="assets/mixkit-pastry-chef-cutting-a-loaf-into-slices-43015-medium.gif" height=128> | <img src="assets/mixkit-hands-of-a-baker-kneading-a-dough-42467-medium.gif" height=128> | <img src="assets/mixkit-chef-preparing-a-sauce-in-a-blender-43034-medium.gif" height=128> | | ||
| :--------: | :-------------------------------------------------------------------------------------------------------------: | :---: | :---: | | ||
| GT caption | Pastry chef cutting bread into<br>slices during the preparation<br>of a dessert, inside a kitchen. | Close-up shot of the hands<br>of an experienced baker<br>skillfully kneading bread dough. | Chef preparing a sauce in<br>a blender, adding different<br>ingredients while blending. | | ||
| NARRATOR (a) | so now we're going to slice the bread | i'm gonna make a little hole<br>in the middle of the dough here | all right let's blend this up | | ||
| NARRATOR (b) | now i'm going to do is just slice<br>this up into a nice chunk and<br>then we're going to place it<br>on the plate | you just keep kneading it | the last step to making this<br>is to blend the ingredients<br>in the food processor | | ||
|
||
Below is a demo for 3rd-person videos. | ||
```bash | ||
python demo_narrator_3rd_person.py [--video-path $TEST_VIDEO] [--cuda] | ||
``` | ||
|
||
## Dual-Encoder | ||
|
||
The dual-encoder model contains a video encoder and a text encoder. It learns video-langauge representation from both human annotations and generated narrations using a contrastive loss like [CLIP](https://github.com/openai/CLIP). | ||
|
||
|
||
* LaViLa's dual-encoder achieves excellent **zero-shot** performance on a wide range of egocentric benchmarks, outperforming previous state-of-the-art video-language pretraining methods by a large margin. | ||
|
||
|
||
<div class="table-wrapper" markdown="block"> | ||
|
||
| | Backbone | EK-100 MIR<br>avg. mAP^ | EK-100 MIR<br>avg. nDCG^ | Charades-Ego<br>mAP | EGTEA<br> mean acc. | EgoMCQ<br>intra-video acc. | | ||
| :----------: | :------: | :---------------------: | :----------------------: | :------------------: | :-----------------: | :------------------------: | | ||
| Prev. SOTA^^ | TSF-B | 22.1/23.3 | 22.1/27.9 | 25.2 | 17.6 | 57.2 | | ||
| LAVILA | TSF-B | 29.7/30.9 | 31.5/32.0 | 26.8 | 28.9 | 59.9 | | ||
| LAVILA | TSF-L | 35.0/36.1 | 34.2/34.6 | 28.9 | 34.1 | 63.1 | | ||
|
||
</div> | ||
|
||
^ The two numbers are obtained by using different number of frames as input (4-frame and 16-frame). | ||
|
||
^^ We use the checkpoints released by [EgoVLP](https://github.com/showlab/EgoVLP) and convert them to be compatible with this codebase. Also note that our reproduced numbers are better than the reported numbers, especially on EK-100 MIR since we evaluate on raw videos directly (for more details, check out Appendix F & Table 10 in our paper). | ||
|
||
For details on how to get the numbers, please refer to [MODEL_ZOO.md](./docs/MODEL_ZOO.md#zero-shot). | ||
|
||
|
||
* Once **fine-tuned** on the down-stream dataset, LaViLa's dual-encoder can also achieve state-of-the-art results on it. We show some key results as follows. | ||
|
||
<div class="table-wrapper" markdown="block"> | ||
|
||
| | EK-100 MIR<br>avg. mAP | EK-100 MIR<br>avg. nDCG | EK-100 CLS<br>Action top-1 | Charades-Ego<br>mAP | EGTEA<br> mean acc. | | ||
| :--------: | :--------------------: | :---------------------: | :------------------------: | :------------------: | :-----------------: | | ||
| Prev. SOTA | 45.0 | 59.4 | 50.5 | 32.1 | 65.9 | | ||
| LAVILA | 50.9 | 66.5 | 50.9 | 36.1 | 76.0 | | ||
|
||
</div> | ||
|
||
For details on how to fine-tune the pre-trained dual-encoder on down-stream datasets, please refer to [MODEL_ZOO.md](./docs/MODEL_ZOO.md#fine-tuned). | ||
|
||
## License | ||
The majority of LAVILA is licensed under a [MIT License](./LICENSE), however portions of the project are available under separate license terms: | ||
|
||
* https://github.com/EGO4D/episodic-memory is licensed under the MIT license. | ||
|
||
* The videos of [cutting a loaf](https://mixkit.co/free-stock-video/pastry-chef-cutting-a-loaf-into-slices-43015/), [kneading a dough](https://mixkit.co/free-stock-video/hands-of-a-baker-kneading-a-dough-42467/), and [preparing a sauce in a blender](https://mixkit.co/free-stock-video/chef-preparing-a-sauce-in-a-blender-43034/) are licensed under the [Mixkit Stock Video Free License](https://mixkit.co/license/#videoFree). | ||
|
||
## Citing LaViLa | ||
|
||
```bibtex | ||
@inproceedings{zhao2022lavila, | ||
title={Learning Video Representations from Large Language Models}, | ||
author={Zhao, Yue and Misra, Ishan and Kr{\"a}henb{\"u}hl, Philipp and Girdhar, Rohit}, | ||
booktitle={arXiv preprint arXiv:2212.04501}, | ||
year={2022} | ||
} | ||
``` |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+1.11 MB
assets/mixkit-pastry-chef-cutting-a-loaf-into-slices-43015-medium.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Oops, something went wrong.