Video Timeline Tags (ViTT)

This repo provides the Video Timeline Tags (ViTT) dataset introduced in Multimodal Pretraining for Dense Video Captioning (arXiv | presentation | slides).

If you find the data or paper useful for your own work, please consider citing:

@inproceedings{huang2020multimodal,
  title={Multimodal Pretraining for Dense Video Captioning},
  author={Huang, Gabriel and Pang, Bo and Zhu, Zhenhai and Rivera, Clara and Soricut, Radu},
  booktitle={AACL-IJCNLP 2020},
  year={2020}
}

Dataset Description

Data files for this dataset can be downloaded via the following links:

The ViTT dataset consists of human produced segment-level annotations for 8,169 videos. Of these, 5,840 videos have been annotated once, and the rest of the videos have been annotated twice or more. A total of 12,461 sets of annotations are released in ViTT-annotations.json. Below is an example set of annotations from the dataset:

{
  "id": "FmTp",
  "annotations": [
    {
      "timestamp": 260,
      "tag": "Opening"
    },
    {
      "timestamp": 16000,
      "tag": "Displaying technique"
    },
    {
      "timestamp": 23990,
      "tag": "Showing foot positioning"
    },
    {
      "timestamp": 55530,
      "tag": "Demonstrating crossover"
    },
    {
      "timestamp": 114100,
      "tag": "Closing"
    }
  ]
}

Data fields:

id is the id for the video from YouTube-8M release, which was a randomly-generated ID to protect the privacy of uploaders. The external YouTube ID can be looked up following the instructions on this page, as long as the video remains public on YouTube.
annotations contain a list of segment-level annotations. In this example, the annotator had identified 5 sections in the video. For each section,
- timestamp is the start time (in milliseconds) for that section, and
- tag is a free-text tag describing the content of that section concisely. There are 3 annotations (from 3 different annotators) included for this video in the json file.

For experiments described in the paper, we have additionally gone through the following steps:

Lower-case all tags
Standardize top tags according to Table 6 in the paper
Use videos with one set of raw annotations as the training split, and randomly split those with 2 or 3 sets of raw annotations into dev and test sets.
- Note that some of the raw annotations do not contain segment-level annotations (e.g., when the raters consider the video to be not instructional), and are excluded from this data release. We release the list of video ids in our training / dev / test sets ([train|dev|test]_id.txt) for reproducibility.
- Note also that there are 276 videos in the data release with more than 3 annotations. These videos were not included in either the training split or the dev / test splits in experiments reported in the paper. We are still including them in the json file for people who are interested in understanding inter-annotator agreement for this task.

Please refer to Appendix A.1 in the paper for details on the dataset construction and guidelines for human annotation.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Video Timeline Tags (ViTT)

Dataset Description

About

Releases

Packages

Contributors 2

License

google-research-datasets/Video-Timeline-Tags-ViTT

Folders and files

Latest commit

History

Repository files navigation

Video Timeline Tags (ViTT)

Dataset Description

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Packages