From 4fbe42c311fd75f2e5e5c218b40fc8866212c37e Mon Sep 17 00:00:00 2001 From: Merve Noyan Date: Wed, 28 Aug 2024 14:25:37 +0300 Subject: [PATCH 01/12] video-text-to-text task guide --- docs/source/en/tasks/video_text_to_text.md | 153 +++++++++++++++++++++ 1 file changed, 153 insertions(+) create mode 100644 docs/source/en/tasks/video_text_to_text.md diff --git a/docs/source/en/tasks/video_text_to_text.md b/docs/source/en/tasks/video_text_to_text.md new file mode 100644 index 000000000000..4464687ffd90 --- /dev/null +++ b/docs/source/en/tasks/video_text_to_text.md @@ -0,0 +1,153 @@ + + +# Video-text-to-text + +[[open-in-colab]] + +Video-text-to-text models, also known as video language models or vision language models with video input, are language models that take an video input. These models can tackle various tasks, from visual question answering to video captioning. + +These models are almost the same as [image-text-to-text](../image_text_to_text.md) models by means of architecture. They have some adjustments and additions to their architecture to accept video data, which is essentially image frames with temporal dependencies. Some `image-text-to-text` models take in multiple images, but this condition alone is inadequate for the model to accept videos. Moreover, `video-text-to-text` models are often trained with all vision modalities, each row might have videos, multiple videos, images and multiple images. Some of these models can also take interleaved input, i.e. one can refer to a specific video inside text through inputting video token in text like "What is happening in this video? `