From 4fbe42c311fd75f2e5e5c218b40fc8866212c37e Mon Sep 17 00:00:00 2001
From: Merve Noyan <merveenoyan@gmail.com>
Date: Wed, 28 Aug 2024 14:25:37 +0300
Subject: [PATCH 01/12] video-text-to-text task guide

---
 docs/source/en/tasks/video_text_to_text.md | 153 +++++++++++++++++++++
 1 file changed, 153 insertions(+)
 create mode 100644 docs/source/en/tasks/video_text_to_text.md
diff --git a/docs/source/en/tasks/video_text_to_text.md b/docs/source/en/tasks/video_text_to_text.md
new file mode 100644
index 000000000000..4464687ffd90
--- /dev/null
+++ b/docs/source/en/tasks/video_text_to_text.md
@@ -0,0 +1,153 @@
+<!--Copyright 2024 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+
+-->
+
+# Video-text-to-text
+
+[[open-in-colab]]
+
+Video-text-to-text models, also known as video language models or vision language models with video input, are language models that take an video input. These models can tackle various tasks, from visual question answering to video captioning. 
+
+These models are almost the same as [image-text-to-text](../image_text_to_text.md) models by means of architecture. They have some adjustments and additions to their architecture to accept video data, which is essentially image frames with temporal dependencies. Some `image-text-to-text` models take in multiple images, but this condition alone is inadequate for the model to accept videos. Moreover, `video-text-to-text` models are often trained with all vision modalities, each row might have videos, multiple videos, images and multiple images. Some of these models can also take interleaved input, i.e. one can refer to a specific video inside text through inputting video token in text like "What is happening in this video? `<video>`". 
+
+In this guide, we provide a brief overview of video LMs and show how to use them with Transformers for inference.
+
+To begin with, there are multiple types of video LMs:
+- base models used for fine-tuning
+- chat fine-tuned models for conversation
+- instruction fine-tuned models
+
+This guide focuses on inference with an instruction-tuned model, [llava-hf/llava-interleave-qwen-7b-hf](https://huggingface.co/llava-hf/llava-interleave-qwen-7b-hf) which can take in interleaved data. Alternatively you can try [llava-interleave-qwen-0.5b-hf](https://huggingface.co/llava-hf/llava-interleave-qwen-0.5b-hf)] if your hardware doesn't allow running a 7B model.
+
+Let's begin installing the dependencies.
+
+```bash
+pip install -q transformers accelerate flash_attn 
+```
+
+Let's initialize the model and the processor. 
+
+```python
+from transformers import LlavaProcessor, LlavaForConditionalGeneration
+import torch
+model_id = "llava-hf/llava-interleave-qwen-0.5b-hf"
+
+processor = LlavaProcessor.from_pretrained(model_id)
+
+model = LlavaForConditionalGeneration.from_pretrained(model_id, torch_dtype=torch.float16)
+model.to("cuda")
+```
+
+Some models directly consume `<video>` token, and some take in `<image>` tokens inserted as many as the number of sampled frames. This model handles videos in latter fashion. We will write a simple utility to handle image tokens and another to get a video from a url and sample frames from it. 
+
+```python
+import uuid
+import requests
+import cv2
+
+def replace_video_with_images(text, frames):
+  return text.replace("<video>", "<image>" * frames)
+
+def sample_frames(url, num_frames):
+
+    response = requests.get(url)
+    path_id = str(uuid.uuid4())
+
+    path = f"./{path_id}.mp4" 
+
+    with open(path, "wb") as f:
+      f.write(response.content)
+
+    video = cv2.VideoCapture(path)
+    total_frames = int(video.get(cv2.CAP_PROP_FRAME_COUNT))
+    interval = total_frames // num_frames
+    frames = []
+    for i in range(total_frames):
+        ret, frame = video.read()
+        pil_img = Image.fromarray(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))
+        if not ret:
+            continue
+        if i % interval == 0:
+            frames.append(pil_img)
+    video.release()
+    return frames
+```
+
+Let's get our inputs.
+
+```python
+video_1 = "https://huggingface.co/spaces/merve/llava-interleave/resolve/main/cats_1.mp4"
+video_2 = "https://huggingface.co/spaces/merve/llava-interleave/resolve/main/cats_2.mp4"
+
+
+video_1 = sample_frames(video_1, 8)
+video_2 = sample_frames(video_2, 8)
+
+video_1
+
+# [<PIL.Image.Image image mode=RGB size=1920x1080>,
+# <PIL.Image.Image image mode=RGB size=1920x1080>,
+# <PIL.Image.Image image mode=RGB size=1920x1080>, ...]
+```
+
+Both videos have cats.
+
+<div class="container">
+  <div class="video-container">
+    <video width="400" controls>
+      <source src="https://huggingface.co/spaces/merve/llava-interleave/resolve/main/cats_1.mp4" type="video/mp4">
+    </video>
+  </div>
+
+  <div class="video-container">
+    <video width="400" controls>
+      <source src="https://huggingface.co/spaces/merve/llava-interleave/resolve/main/cats_2.mp4" type="video/mp4">
+    </video>
+  </div>
+</div>
+
+Now we can preprocess the inputs.
+
+This model has a prompt template that looks like following. First we'll put all sampled frames into one list. Since we have eight frames in each video, we will insert 12 `<image>` tokens to our prompt. Note that we are adding `assistant` at the end to trigger model to give answers. We can then preprocess.
+
+```python
+user_prompt = "Are these two cats in these two videos doing the same thing?"
+toks = "<image>" * 12
+prompt = "<|im_start|>user"+ toks + f"\n{user_prompt}<|im_end|><|im_start|>assistant"
+inputs = processor(prompt, images=videos).to(model.device, model.dtype)
+```
+
+The image inputs look like the following.
+
+<div class="flex justify-center">
+     <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/cats.png" alt="Two cats sitting on a net"/>
+</div>
+
+<div class="flex justify-center">
+     <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg" alt="A bee on a pink flower"/>
+</div>
+
+We can now infer. The model will output the question we've input and the answer, so we will take the text that is after our prompt and the "assistant" part.
+
+```python
+output = model.generate(**inputs, max_new_tokens=200, do_sample=False)
+print(processor.decode(output[0][2:], skip_special_tokens=True)[len(user_prompt)+10:])
+
+# No, the two cats in the video are not engaging in the same activity. One cat is lying down while the other is standing up.
+```
+
+And voila! 
+
+The chat templates and token streaming for `video-text-to-text` models work in similar fashion as [image-text-to-text](../image_text_to_text.md) models, so if you would like to learn more, please visit the task guide for it.
\ No newline at end of file

From 7e7938e85a5800f58cd50b9a1ad378fb08d3eea1 Mon Sep 17 00:00:00 2001
From: Merve Noyan <merveenoyan@gmail.com>
Date: Wed, 28 Aug 2024 14:25:55 +0300
Subject: [PATCH 02/12] nit

---
 docs/source/en/tasks/video_text_to_text.md | 9 ---------
 1 file changed, 9 deletions(-)

diff --git a/docs/source/en/tasks/video_text_to_text.md b/docs/source/en/tasks/video_text_to_text.md
index 4464687ffd90..3fb68c548e1e 100644
--- a/docs/source/en/tasks/video_text_to_text.md
+++ b/docs/source/en/tasks/video_text_to_text.md
@@ -129,15 +129,6 @@ prompt = "<|im_start|>user"+ toks + f"\n{user_prompt}<|im_end|><|im_start|>assis
 inputs = processor(prompt, images=videos).to(model.device, model.dtype)
 ```
 
-The image inputs look like the following.
-
-<div class="flex justify-center">
-     <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/cats.png" alt="Two cats sitting on a net"/>
-</div>
-
-<div class="flex justify-center">
-     <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg" alt="A bee on a pink flower"/>
-</div>
 
 We can now infer. The model will output the question we've input and the answer, so we will take the text that is after our prompt and the "assistant" part.
 

From f109f249d04c6272e7daba2a4c9371310b426a45 Mon Sep 17 00:00:00 2001
From: Merve Noyan <merveenoyan@gmail.com>
Date: Wed, 28 Aug 2024 14:36:23 +0300
Subject: [PATCH 03/12] nit

---
 docs/source/en/tasks/video_text_to_text.md | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/docs/source/en/tasks/video_text_to_text.md b/docs/source/en/tasks/video_text_to_text.md
index 3fb68c548e1e..d1c1a5173cf4 100644
--- a/docs/source/en/tasks/video_text_to_text.md
+++ b/docs/source/en/tasks/video_text_to_text.md
@@ -133,10 +133,12 @@ inputs = processor(prompt, images=videos).to(model.device, model.dtype)
 We can now infer. The model will output the question we've input and the answer, so we will take the text that is after our prompt and the "assistant" part.
 
 ```python
-output = model.generate(**inputs, max_new_tokens=200, do_sample=False)
+output = model.generate(**inputs, max_new_tokens=100, do_sample=False)
 print(processor.decode(output[0][2:], skip_special_tokens=True)[len(user_prompt)+10:])
 
-# No, the two cats in the video are not engaging in the same activity. One cat is lying down while the other is standing up.
+# The first cat is shown in a relaxed state, with its eyes closed and a content expression, while the second cat is shown in a more active state, with its mouth open wide, possibly in a yawn or a vocalization.
+
+
 ```
 
 And voila! 

From 7a25ee6a7e3c818bf8c9b63bd9fd831edbaad3e8 Mon Sep 17 00:00:00 2001
From: Merve Noyan <merveenoyan@gmail.com>
Date: Thu, 29 Aug 2024 12:23:26 +0300
Subject: [PATCH 04/12] Update docs/source/en/tasks/video_text_to_text.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
---
 docs/source/en/tasks/video_text_to_text.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/en/tasks/video_text_to_text.md b/docs/source/en/tasks/video_text_to_text.md
index d1c1a5173cf4..336edd509396 100644
--- a/docs/source/en/tasks/video_text_to_text.md
+++ b/docs/source/en/tasks/video_text_to_text.md
@@ -18,7 +18,7 @@ rendered properly in your Markdown viewer.
 
 [[open-in-colab]]
 
-Video-text-to-text models, also known as video language models or vision language models with video input, are language models that take an video input. These models can tackle various tasks, from visual question answering to video captioning. 
+Video-text-to-text models, also known as video language models or vision language models with video input, are language models that take a video input. These models can tackle various tasks, from visual question answering to video captioning. 
 
 These models are almost the same as [image-text-to-text](../image_text_to_text.md) models by means of architecture. They have some adjustments and additions to their architecture to accept video data, which is essentially image frames with temporal dependencies. Some `image-text-to-text` models take in multiple images, but this condition alone is inadequate for the model to accept videos. Moreover, `video-text-to-text` models are often trained with all vision modalities, each row might have videos, multiple videos, images and multiple images. Some of these models can also take interleaved input, i.e. one can refer to a specific video inside text through inputting video token in text like "What is happening in this video? `<video>`". 
 

From 24d8192a992356c0a8583522bec9017ca1e92fed Mon Sep 17 00:00:00 2001
From: Merve Noyan <merveenoyan@gmail.com>
Date: Thu, 29 Aug 2024 12:23:32 +0300
Subject: [PATCH 05/12] Update docs/source/en/tasks/video_text_to_text.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
---
 docs/source/en/tasks/video_text_to_text.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/en/tasks/video_text_to_text.md b/docs/source/en/tasks/video_text_to_text.md
index 336edd509396..6a407f058455 100644
--- a/docs/source/en/tasks/video_text_to_text.md
+++ b/docs/source/en/tasks/video_text_to_text.md
@@ -20,7 +20,7 @@ rendered properly in your Markdown viewer.
 
 Video-text-to-text models, also known as video language models or vision language models with video input, are language models that take a video input. These models can tackle various tasks, from visual question answering to video captioning. 
 
-These models are almost the same as [image-text-to-text](../image_text_to_text.md) models by means of architecture. They have some adjustments and additions to their architecture to accept video data, which is essentially image frames with temporal dependencies. Some `image-text-to-text` models take in multiple images, but this condition alone is inadequate for the model to accept videos. Moreover, `video-text-to-text` models are often trained with all vision modalities, each row might have videos, multiple videos, images and multiple images. Some of these models can also take interleaved input, i.e. one can refer to a specific video inside text through inputting video token in text like "What is happening in this video? `<video>`". 
+These models have nearly the same architecture as [image-text-to-text](../image_text_to_text.md) models except for some changes to accept video data, which are essentially image frames with temporal dependencies. Some image-text-to-text models take in multiple images, but this alone is inadequate for a model to accept videos. Moreover, video-text-to-text models are often trained with all vision modalities. Each row might have videos, multiple videos, images and multiple images. Some of these models can also take interleaved inputs. For example, you can refer to a specific video inside a string of text by adding a video token in text like "What is happening in this video? `<video>`". 
 
 In this guide, we provide a brief overview of video LMs and show how to use them with Transformers for inference.
 

From d3056ad8ca409b73dbcd386eaec3e54b830eb759 Mon Sep 17 00:00:00 2001
From: Merve Noyan <merveenoyan@gmail.com>
Date: Thu, 29 Aug 2024 12:23:37 +0300
Subject: [PATCH 06/12] Update docs/source/en/tasks/video_text_to_text.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
---
 docs/source/en/tasks/video_text_to_text.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/en/tasks/video_text_to_text.md b/docs/source/en/tasks/video_text_to_text.md
index 6a407f058455..7286360faa47 100644
--- a/docs/source/en/tasks/video_text_to_text.md
+++ b/docs/source/en/tasks/video_text_to_text.md
@@ -29,7 +29,7 @@ To begin with, there are multiple types of video LMs:
 - chat fine-tuned models for conversation
 - instruction fine-tuned models
 
-This guide focuses on inference with an instruction-tuned model, [llava-hf/llava-interleave-qwen-7b-hf](https://huggingface.co/llava-hf/llava-interleave-qwen-7b-hf) which can take in interleaved data. Alternatively you can try [llava-interleave-qwen-0.5b-hf](https://huggingface.co/llava-hf/llava-interleave-qwen-0.5b-hf)] if your hardware doesn't allow running a 7B model.
+This guide focuses on inference with an instruction-tuned model, [llava-hf/llava-interleave-qwen-7b-hf](https://huggingface.co/llava-hf/llava-interleave-qwen-7b-hf) which can take in interleaved data. Alternatively, you can try [llava-interleave-qwen-0.5b-hf](https://huggingface.co/llava-hf/llava-interleave-qwen-0.5b-hf)] if your hardware doesn't allow running a 7B model.
 
 Let's begin installing the dependencies.
 

From 6384fbc841fe97d2fdf81535b8cd13783dbd57a3 Mon Sep 17 00:00:00 2001
From: Merve Noyan <merveenoyan@gmail.com>
Date: Thu, 29 Aug 2024 12:23:43 +0300
Subject: [PATCH 07/12] Update docs/source/en/tasks/video_text_to_text.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
---
 docs/source/en/tasks/video_text_to_text.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/en/tasks/video_text_to_text.md b/docs/source/en/tasks/video_text_to_text.md
index 7286360faa47..a1d32e1e28c3 100644
--- a/docs/source/en/tasks/video_text_to_text.md
+++ b/docs/source/en/tasks/video_text_to_text.md
@@ -50,7 +50,7 @@ model = LlavaForConditionalGeneration.from_pretrained(model_id, torch_dtype=torc
 model.to("cuda")
 ```
 
-Some models directly consume `<video>` token, and some take in `<image>` tokens inserted as many as the number of sampled frames. This model handles videos in latter fashion. We will write a simple utility to handle image tokens and another to get a video from a url and sample frames from it. 
+Some models directly consume the `<video>` token, and others accept `<image>` tokens equal to the number of sampled frames. This model handles videos in the latter fashion. We will write a simple utility to handle image tokens, and another utility to get a video from a url and sample frames from it. 
 
 ```python
 import uuid

From c271ed449aa5c96a9e7ea3ae8f7f294e12c98d7e Mon Sep 17 00:00:00 2001
From: Merve Noyan <merveenoyan@gmail.com>
Date: Thu, 29 Aug 2024 12:23:51 +0300
Subject: [PATCH 08/12] Update docs/source/en/tasks/video_text_to_text.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
---
 docs/source/en/tasks/video_text_to_text.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/en/tasks/video_text_to_text.md b/docs/source/en/tasks/video_text_to_text.md
index a1d32e1e28c3..eb188cdd038a 100644
--- a/docs/source/en/tasks/video_text_to_text.md
+++ b/docs/source/en/tasks/video_text_to_text.md
@@ -130,7 +130,7 @@ inputs = processor(prompt, images=videos).to(model.device, model.dtype)
 ```
 
 
-We can now infer. The model will output the question we've input and the answer, so we will take the text that is after our prompt and the "assistant" part.
+Use [`~GenerationMixin.generate`] for inference. The model outputs the question in our input and answer, so only take the text after the prompt and `assistant` part. 
 
 ```python
 output = model.generate(**inputs, max_new_tokens=100, do_sample=False)

From 891d845c1b7ce558021f51aea60d0a701a37d54f Mon Sep 17 00:00:00 2001
From: Merve Noyan <merveenoyan@gmail.com>
Date: Thu, 29 Aug 2024 12:23:56 +0300
Subject: [PATCH 09/12] Update docs/source/en/tasks/video_text_to_text.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
---
 docs/source/en/tasks/video_text_to_text.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/en/tasks/video_text_to_text.md b/docs/source/en/tasks/video_text_to_text.md
index eb188cdd038a..264be32b36b2 100644
--- a/docs/source/en/tasks/video_text_to_text.md
+++ b/docs/source/en/tasks/video_text_to_text.md
@@ -143,4 +143,4 @@ print(processor.decode(output[0][2:], skip_special_tokens=True)[len(user_prompt)
 
 And voila! 
 
-The chat templates and token streaming for `video-text-to-text` models work in similar fashion as [image-text-to-text](../image_text_to_text.md) models, so if you would like to learn more, please visit the task guide for it.
\ No newline at end of file
+To learn more about chat templates and token streaming for video-text-to-text models, refer to the [image-text-to-text](../image_text_to_text) task guide because these models work similarly.
\ No newline at end of file

From c92e3bcff6224b1764539b4ae4b792f7595996aa Mon Sep 17 00:00:00 2001
From: Merve Noyan <merveenoyan@gmail.com>
Date: Thu, 29 Aug 2024 12:29:48 +0300
Subject: [PATCH 10/12] Readability pass and nits

---
 docs/source/en/_toctree.yml                | 2 ++
 docs/source/en/tasks/video_text_to_text.md | 9 ++++-----
 2 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/docs/source/en/_toctree.yml b/docs/source/en/_toctree.yml
index ef2a33d463af..00451fa61462 100644
--- a/docs/source/en/_toctree.yml
+++ b/docs/source/en/_toctree.yml
@@ -94,6 +94,8 @@
       title: Text to speech
     - local: tasks/image_text_to_text
       title: Image-text-to-text
+    - local: tasks/video_text_to_text
+      title: Video-text-to-text
     title: Multimodal
   - isExpanded: false
     sections:
diff --git a/docs/source/en/tasks/video_text_to_text.md b/docs/source/en/tasks/video_text_to_text.md
index 264be32b36b2..128a0e027c18 100644
--- a/docs/source/en/tasks/video_text_to_text.md
+++ b/docs/source/en/tasks/video_text_to_text.md
@@ -18,9 +18,9 @@ rendered properly in your Markdown viewer.
 
 [[open-in-colab]]
 
-Video-text-to-text models, also known as video language models or vision language models with video input, are language models that take a video input. These models can tackle various tasks, from visual question answering to video captioning. 
+Video-text-to-text models, also known as video language models or vision language models with video input, are language models that take a video input. These models can tackle various tasks, from video question answering to video captioning. 
 
-These models have nearly the same architecture as [image-text-to-text](../image_text_to_text.md) models except for some changes to accept video data, which are essentially image frames with temporal dependencies. Some image-text-to-text models take in multiple images, but this alone is inadequate for a model to accept videos. Moreover, video-text-to-text models are often trained with all vision modalities. Each row might have videos, multiple videos, images and multiple images. Some of these models can also take interleaved inputs. For example, you can refer to a specific video inside a string of text by adding a video token in text like "What is happening in this video? `<video>`". 
+These models have nearly the same architecture as [image-text-to-text](../image_text_to_text.md) models except for some changes to accept video data, since video data is essentially image frames with temporal dependencies. Some image-text-to-text models take in multiple images, but this alone is inadequate for a model to accept videos. Moreover, video-text-to-text models are often trained with all vision modalities. Each example might have videos, multiple videos, images and multiple images. Some of these models can also take interleaved inputs. For example, you can refer to a specific video inside a string of text by adding a video token in text like "What is happening in this video? `<video>`". 
 
 In this guide, we provide a brief overview of video LMs and show how to use them with Transformers for inference.
 
@@ -29,7 +29,7 @@ To begin with, there are multiple types of video LMs:
 - chat fine-tuned models for conversation
 - instruction fine-tuned models
 
-This guide focuses on inference with an instruction-tuned model, [llava-hf/llava-interleave-qwen-7b-hf](https://huggingface.co/llava-hf/llava-interleave-qwen-7b-hf) which can take in interleaved data. Alternatively, you can try [llava-interleave-qwen-0.5b-hf](https://huggingface.co/llava-hf/llava-interleave-qwen-0.5b-hf)] if your hardware doesn't allow running a 7B model.
+This guide focuses on inference with an instruction-tuned model, [llava-hf/llava-interleave-qwen-7b-hf](https://huggingface.co/llava-hf/llava-interleave-qwen-7b-hf) which can take in interleaved data. Alternatively, you can try [llava-interleave-qwen-0.5b-hf](https://huggingface.co/llava-hf/llava-interleave-qwen-0.5b-hf) if your hardware doesn't allow running a 7B model.
 
 Let's begin installing the dependencies.
 
@@ -129,8 +129,7 @@ prompt = "<|im_start|>user"+ toks + f"\n{user_prompt}<|im_end|><|im_start|>assis
 inputs = processor(prompt, images=videos).to(model.device, model.dtype)
 ```
 
-
-Use [`~GenerationMixin.generate`] for inference. The model outputs the question in our input and answer, so only take the text after the prompt and `assistant` part. 
+We can now call [`~GenerationMixin.generate`] for inference. The model outputs the question in our input and answer, so we only take the text after the prompt and `assistant` part from the model output. 
 
 ```python
 output = model.generate(**inputs, max_new_tokens=100, do_sample=False)

From 12f1cadeb65575090cebd1dbbd77ad15970c1395 Mon Sep 17 00:00:00 2001
From: Merve Noyan <merveenoyan@gmail.com>
Date: Fri, 30 Aug 2024 10:58:10 +0300
Subject: [PATCH 11/12] Update docs/source/en/tasks/video_text_to_text.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
---
 docs/source/en/tasks/video_text_to_text.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/en/tasks/video_text_to_text.md b/docs/source/en/tasks/video_text_to_text.md
index 128a0e027c18..a3bd0614f582 100644
--- a/docs/source/en/tasks/video_text_to_text.md
+++ b/docs/source/en/tasks/video_text_to_text.md
@@ -120,7 +120,7 @@ Both videos have cats.
 
 Now we can preprocess the inputs.
 
-This model has a prompt template that looks like following. First we'll put all sampled frames into one list. Since we have eight frames in each video, we will insert 12 `<image>` tokens to our prompt. Note that we are adding `assistant` at the end to trigger model to give answers. We can then preprocess.
+This model has a prompt template that looks like following. First, we'll put all the sampled frames into one list. Since we have eight frames in each video, we will insert 12 `<image>` tokens to our prompt. Add `assistant` at the end of the prompt to trigger the model to give answers. Then we can preprocess.
 
 ```python
 user_prompt = "Are these two cats in these two videos doing the same thing?"

From e9878c77b83fd51cedbbbd85915aba934edb6b98 Mon Sep 17 00:00:00 2001
From: Merve Noyan <merveenoyan@gmail.com>
Date: Sat, 31 Aug 2024 23:25:09 +0300
Subject: [PATCH 12/12] Fix sampling

---
 docs/source/en/tasks/video_text_to_text.md | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/docs/source/en/tasks/video_text_to_text.md b/docs/source/en/tasks/video_text_to_text.md
index a3bd0614f582..fcc1c86e8bd7 100644
--- a/docs/source/en/tasks/video_text_to_text.md
+++ b/docs/source/en/tasks/video_text_to_text.md
@@ -85,17 +85,18 @@ def sample_frames(url, num_frames):
     return frames
 ```
 
-Let's get our inputs.
+Let's get our inputs. We will sample frames and concatenate them.
 
 ```python
 video_1 = "https://huggingface.co/spaces/merve/llava-interleave/resolve/main/cats_1.mp4"
 video_2 = "https://huggingface.co/spaces/merve/llava-interleave/resolve/main/cats_2.mp4"
 
+video_1 = sample_frames(video_1, 6)
+video_2 = sample_frames(video_2, 6)
 
-video_1 = sample_frames(video_1, 8)
-video_2 = sample_frames(video_2, 8)
+videos = video_1 + video_2
 
-video_1
+videos
 
 # [<PIL.Image.Image image mode=RGB size=1920x1080>,
 # <PIL.Image.Image image mode=RGB size=1920x1080>,