Initial commit

facebookresearch · Dec 9, 2022 · 262ebc7 · 262ebc7
commit 262ebc7
Show file tree

Hide file tree

Showing 56 changed files with 11,181 additions and 0 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,129 @@
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+
+# C extensions
+*.so
+
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+pip-wheel-metadata/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+
+# Translations
+*.mo
+*.pot
+
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+
+# Flask stuff:
+instance/
+.webassets-cache
+
+# Scrapy stuff:
+.scrapy
+
+# Sphinx documentation
+docs/_build/
+
+# PyBuilder
+target/
+
+# Jupyter Notebook
+.ipynb_checkpoints
+
+# IPython
+profile_default/
+ipython_config.py
+
+# pyenv
+.python-version
+
+# pipenv
+#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
+#   However, in case of collaboration, if having platform-specific dependencies or dependencies
+#   having no cross-platform support, pipenv may install dependencies that don't work, or not
+#   install all needed dependencies.
+#Pipfile.lock
+
+# PEP 582; used by e.g. github.com/David-OConnor/pyflow
+__pypackages__/
+
+# Celery stuff
+celerybeat-schedule
+celerybeat.pid
+
+# SageMath parsed files
+*.sage.py
+
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+
+# Spyder project settings
+.spyderproject
+.spyproject
+
+# Rope project settings
+.ropeproject
+
+# mkdocs documentation
+/site
+
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+
+# Pyre type checker
+.pyre/
diff --git a/CODE_OF_CONDUCT.md b/CODE_OF_CONDUCT.md
@@ -0,0 +1,5 @@
+# Code of Conduct
+
+Facebook has adopted a Code of Conduct that we expect project participants to adhere to.
+Please read the [full text](https://code.fb.com/codeofconduct/)
+so that you can understand what actions will and will not be tolerated.
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -0,0 +1,39 @@
+# Contributing to LaViLa
+We want to make contributing to this project as easy and transparent as
+possible.
+
+## Our Development Process
+Minor changes and improvements will be released on an ongoing basis. Larger changes (e.g., changesets implementing a new paper) will be released on a more periodic basis.
+
+## Pull Requests
+We actively welcome your pull requests.
+
+1. Fork the repo and create your branch from `main`.
+2. If you've added code that should be tested, add tests.
+3. If you've changed APIs, update the documentation.
+4. Ensure the test suite passes.
+5. Make sure your code lints.
+6. If you haven't already, complete the Contributor License Agreement ("CLA").
+
+## Contributor License Agreement ("CLA")
+In order to accept your pull request, we need you to submit a CLA. You only need
+to do this once to work on any of Facebook's open source projects.
+
+Complete your CLA here: <https://code.facebook.com/cla>
+
+## Issues
+We use GitHub issues to track public bugs. Please ensure your description is
+clear and has sufficient instructions to be able to reproduce the issue.
+
+Facebook has a [bounty program](https://www.facebook.com/whitehat/) for the safe
+disclosure of security bugs. In those cases, please go through the process
+outlined on that page and do not file a public issue.
+
+## Coding Style  
+* 4 spaces for indentation rather than tabs
+* 80 character line length
+* PEP8 formatting following [Black](https://black.readthedocs.io/en/stable/)
+
+## License
+By contributing to LaViLa, you agree that your contributions will be licensed
+under the LICENSE file in the root directory of this source tree.
diff --git a/LICENSE b/LICENSE
@@ -0,0 +1,22 @@
+
+MIT License
+
+Copyright (c) Meta Platforms, Inc. and affiliates.
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
diff --git a/README.md b/README.md
@@ -0,0 +1,131 @@
+# Learning Video Representations from Large Language Models
+
+
+[**Learning Video Representations from Large Language Models**](http://arxiv.org/abs/2212.04501)                                     
+Yue Zhao, Ishan Misra, Philipp Kr&auml;henb&uuml;hl, Rohit Girdhar                 
+[arxiv](http://arxiv.org/abs/2212.04501) | [bibtex](#citing-lavila) | [colab](#narrator-demo)
+
+LaViLa (**L**anguage **a**ugmented **Vi**deo **La**nguage Pretraining) is a new approach to learning video representations from Large Language Models (LLMs). We repurpose LLMs to be visually conditioned "Narrators", and use them to automatically generate video-language paired data. We use this data to then learn a video-langauge representation, outperforming prior work by large margins.
+
+**Sample Generations:**
+
+| Video | Generation 1 | Generation 2 |
+| --------|-------------|--------------|
+| <img src="assets/mixkit-pastry-chef-cutting-a-loaf-into-slices-43015-medium.gif" height=128> | so now we're going to slice the bread | now i'm going to do is just slice<br>this up into a nice chunk and<br>then we're going to place it<br>on the plate |
+
+[Try out](#narrator-demo)  our Narrator to generate text descriptions for your own videos!
+
+The resulting video-language model sets a new **state-of-the-art** on a number of popular video tasks!
+<img width="400" alt="image" src="https://user-images.githubusercontent.com/1893429/205997492-a6cbc7c1-1f8e-4fad-9d94-f9e22920272d.png">
+
+
+
+
+## Introduction and installation
+
+<span style="font-variant:small-caps;">LaViLa</span> leverages Large Language Models (LLMs) as "NARRATOR"s (and "REPHRASER"s) to densely narrate long videos, and uses these narrations to train strong dual-encoder models.
+
+
+<img src="assets/lavila_ego4d.gif" height=384> 
+
+See [INSTALL.md](docs/INSTALL.md) to install this code.
+
+## NARRATOR
+
+NARRATOR is a *visually conditioned* LLM that takes videos frames as input and pseudo-labels this clip with narrations.
+
+<img src="assets/narrator.gif" height=384>
+
+
+### NARRATOR Demo
+
+We provide some generated samples by our NARRATOR:
+
+|                    | <img src="assets/06919917-76bc-4adc-b944-2a722f165513.gif" height=128> | <img src="assets/cf7c12db-1a9e-46d3-96d6-38174bbe373c.gif" height=128> | <img src="assets/ab865129-78fa-47d4-8a50-ff8c5533246f.gif" height=128>
+| :----------------: | :----------------------------------------: | :-------------------------------------: | :--------------------------------------: |
+| Human<br>narration |  C separates the yarn.                     |  C lifts container.                     |  C opterates the camera.                 |
+| NARRATOR generation (a)    |  C stetches the thread with both hands.    |  C wipes the countertop with a sponge.  |  C takes a photo shot.                   |
+|  NARRATOR generation  (b)    |  C pulls out the yarn with her right hand. | C moves the container.                  |  A man X looks at the camera.            |
+
+
+Run the narrator demo using Colab (no GPU needed): [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1gHWiEWywIotRivYQTR-8NQ6GJC7sJUe4)
+
+Since Colab free account offers very limited RAM, if you'd like to run the demo with a larger model, please run [./demo_narrator.py](./demo_narrator.py) locally. For more technical details, please refer to Sec 4.1 in our paper.
+
+```bash
+# CPU mode
+python demo_narrator.py [--video-path $TEST_VIDEO]
+
+# GPU mode
+python demo_narrator.py --cuda
+```
+
+
+Our narrator also works on third-person videos! Below are several examples generated by our NARRATOR that is pre-trained on HowTo100M Auto-Aligned ([HTM-AA](https://www.robots.ox.ac.uk/~vgg/research/tan/index.html#htm-align)) and applied to some stock footage video clips. Note that since the text corpus in HowTo100M is ASR transcription, the style of narration is slightly different from that of ground-truth captions. However the generated results are generally reasonable.
+
+|              |  <img src="assets/mixkit-pastry-chef-cutting-a-loaf-into-slices-43015-medium.gif" height=128>   | <img src="assets/mixkit-hands-of-a-baker-kneading-a-dough-42467-medium.gif" height=128> | <img src="assets/mixkit-chef-preparing-a-sauce-in-a-blender-43034-medium.gif" height=128> |
+| :--------: | :-------------------------------------------------------------------------------------------------------------: | :---: | :---: |
+| GT caption   | Pastry chef cutting bread into<br>slices during the preparation<br>of a dessert, inside a kitchen.              | Close-up shot of the hands<br>of an experienced baker<br>skillfully kneading bread dough. | Chef preparing a sauce in<br>a blender, adding different<br>ingredients while blending. |
+| NARRATOR (a) | so now we're going to slice the bread                                                                           | i'm gonna make a little hole<br>in the middle of the dough here                             | all right let's blend this up                                                  |
+| NARRATOR (b) | now i'm going to do is just slice<br>this up into a nice chunk and<br>then we're going to place it<br>on the plate | you just keep kneading it                                                          | the last step to making this<br>is to blend the ingredients<br>in the food processor    |
+
+Below is a demo for 3rd-person videos.
+```bash
+python demo_narrator_3rd_person.py [--video-path $TEST_VIDEO] [--cuda]
+```
+
+## Dual-Encoder
+
+The dual-encoder model contains a video encoder and a text encoder. It learns video-langauge representation from both human annotations and generated narrations using a contrastive loss like [CLIP](https://github.com/openai/CLIP).
+
+
+* LaViLa's dual-encoder achieves excellent **zero-shot** performance on a wide range of egocentric benchmarks, outperforming previous state-of-the-art video-language pretraining methods by a large margin.
+
+
+  <div class="table-wrapper" markdown="block">
+
+  |              | Backbone | EK-100 MIR<br>avg. mAP^ | EK-100 MIR<br>avg. nDCG^ | Charades-Ego<br>mAP  | EGTEA<br> mean acc. | EgoMCQ<br>intra-video acc. |
+  | :----------: | :------: | :---------------------: | :----------------------: | :------------------: | :-----------------: | :------------------------: |
+  | Prev. SOTA^^ |  TSF-B   |       22.1/23.3         |       22.1/27.9          |        25.2          |       17.6          |            57.2            |
+  |   LAVILA     |  TSF-B   |       29.7/30.9         |       31.5/32.0          |        26.8          |       28.9          |            59.9            |
+  |   LAVILA     |  TSF-L   |       35.0/36.1         |       34.2/34.6          |        28.9          |       34.1          |            63.1            |
+
+  </div>
+
+  ^ The two numbers are obtained by using different number of frames as input (4-frame and 16-frame).
+
+  ^^ We use the checkpoints released by [EgoVLP](https://github.com/showlab/EgoVLP) and convert them to be compatible with this codebase. Also note that our reproduced numbers are better than the reported numbers, especially on EK-100 MIR since we evaluate on raw videos directly (for more details, check out Appendix F & Table 10 in our paper).
+
+  For details on how to get the numbers, please refer to [MODEL_ZOO.md](./docs/MODEL_ZOO.md#zero-shot).
+
+
+* Once **fine-tuned** on the down-stream dataset, LaViLa's dual-encoder can also achieve state-of-the-art results on it. We show some key results as follows.
+
+  <div class="table-wrapper" markdown="block">
+
+  |            | EK-100 MIR<br>avg. mAP | EK-100 MIR<br>avg. nDCG | EK-100 CLS<br>Action top-1 | Charades-Ego<br>mAP  | EGTEA<br> mean acc. |
+  | :--------: | :--------------------: | :---------------------: | :------------------------: | :------------------: | :-----------------: |
+  | Prev. SOTA |          45.0          |          59.4           |            50.5            |        32.1          |       65.9          |
+  |  LAVILA    |          50.9          |          66.5           |            50.9            |        36.1          |       76.0          |
+
+  </div>
+
+  For details on how to fine-tune the pre-trained dual-encoder on down-stream datasets, please refer to [MODEL_ZOO.md](./docs/MODEL_ZOO.md#fine-tuned).
+
+## License
+The majority of LAVILA is licensed under a [MIT License](./LICENSE), however portions of the project are available under separate license terms:
+
+* https://github.com/EGO4D/episodic-memory is licensed under the MIT license.
+
+* The videos of [cutting a loaf](https://mixkit.co/free-stock-video/pastry-chef-cutting-a-loaf-into-slices-43015/), [kneading a dough](https://mixkit.co/free-stock-video/hands-of-a-baker-kneading-a-dough-42467/), and [preparing a sauce in a blender](https://mixkit.co/free-stock-video/chef-preparing-a-sauce-in-a-blender-43034/) are licensed under the [Mixkit Stock Video Free License](https://mixkit.co/license/#videoFree).
+
+## Citing LaViLa
+
+```bibtex
+@inproceedings{zhao2022lavila,
+  title={Learning Video Representations from Large Language Models},
+  author={Zhao, Yue and Misra, Ishan and Kr{\"a}henb{\"u}hl, Philipp and Girdhar, Rohit},
+  booktitle={arXiv preprint arXiv:2212.04501},
+  year={2022}
+}
+```
diff --git a/assets/06919917-76bc-4adc-b944-2a722f165513.gif b/assets/06919917-76bc-4adc-b944-2a722f165513.gif
diff --git a/assets/3c0dffd0-e38e-4643-bc48-d513943dc20b_012_014.mp4 b/assets/3c0dffd0-e38e-4643-bc48-d513943dc20b_012_014.mp4
diff --git a/assets/ab865129-78fa-47d4-8a50-ff8c5533246f.gif b/assets/ab865129-78fa-47d4-8a50-ff8c5533246f.gif
diff --git a/assets/cf7c12db-1a9e-46d3-96d6-38174bbe373c.gif b/assets/cf7c12db-1a9e-46d3-96d6-38174bbe373c.gif
diff --git a/assets/lavila_ego4d.gif b/assets/lavila_ego4d.gif
diff --git a/assets/mixkit-chef-preparing-a-sauce-in-a-blender-43034-medium.gif b/assets/mixkit-chef-preparing-a-sauce-in-a-blender-43034-medium.gif
diff --git a/assets/mixkit-hands-of-a-baker-kneading-a-dough-42467-medium.gif b/assets/mixkit-hands-of-a-baker-kneading-a-dough-42467-medium.gif
diff --git a/assets/mixkit-pastry-chef-cutting-a-loaf-into-slices-43015-medium.gif b/assets/mixkit-pastry-chef-cutting-a-loaf-into-slices-43015-medium.gif
diff --git a/assets/narrator.gif b/assets/narrator.gif
diff --git a/assets/rephraser.gif b/assets/rephraser.gif