Add VideoToTextPipeline with smart frame sampling and system prompts #42432

mkorn1 · 2025-11-26T18:40:43Z

What does this PR do?

This PR creates a placeholder pipeline in anticipation of the addition of video to text models to the HF ecosystem. In the meantime, the approach is multimodal.

This PR adds a comprehensive VideoToTextPipeline implementation that enables video-to-text generation using image-text-to-text models. The pipeline supports intelligent frame sampling, system prompts for guided generation, and flexible generation parameters.

Key Features

Smart Frame Sampling: Automatically calculates optimal number of frames based on video duration (1 frame per second, with min 8 and max 128 frames)
Explicit Frame Control: num_frames parameter allows users to specify exact number of frames to sample
Time-based Sampling: Uses time-spaced frame selection rather than simple linear sampling for better temporal coverage
System Prompts: system_prompt parameter enables guided generation with custom instructions
Flexible Generation: Full support for generate_kwargs to pass any generation parameters to the underlying model
Batch Processing: Supports processing single videos or batches of videos
Error Handling: Proper validation for conflicting parameters (e.g., max_new_tokens in both argument and generate_kwargs)

Implementation Details

Implements VideoToTextPipeline class extending the base Pipeline class
Uses PyAV for video frame extraction with intelligent temporal sampling
Supports both local file paths and HTTP/HTTPS URLs for video inputs
Includes comprehensive logging for debugging frame selection and processing
Handles GIT models and other image-text-to-text models appropriately

Example Usage

from transformers import pipeline

Basic usage

captioner = pipeline("video-to-text", model="microsoft/git-base")
result = captioner("path/to/video.mp4")

With custom frame count

result = captioner("path/to/video.mp4", num_frames=32)

With system prompt

result = captioner("path/to/video.mp4", system_prompt="Describe this video in detail.")

With generation parameters

result = captioner("path/to/video.mp4", generate_kwargs={"temperature": 0.7, "max_new_tokens": 100})

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

This PR adds a new pipeline for video-to-text generation, which is a multimodal task. Please review:

@Rocketknight1 (pipelines)
@zucchini-nlp (multimodal models, visual-language models)

The implementation follows the existing pipeline patterns and integrates with the existing pipeline registry system.

zucchini-nlp

Thanks for the PR @mkorn1 . I just merged any-to-any pipeline (#40884) and I want us to use it for all modalities when possible. Having a separate pipeline for each modality-to-text is not easy to maintain in long-term

mkorn1 added 2 commits November 24, 2025 18:33

video to frames to text output.

1011925

Add video-to-text pipeline with enhanced features

1b706be

mkorn1 changed the title ~~Feat/base v2t pipe~~ Add VideoToTextPipeline with smart frame sampling and system prompts Nov 26, 2025

zucchini-nlp reviewed Nov 27, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add VideoToTextPipeline with smart frame sampling and system prompts #42432

Add VideoToTextPipeline with smart frame sampling and system prompts #42432

Uh oh!

mkorn1 commented Nov 26, 2025

Uh oh!

zucchini-nlp left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add VideoToTextPipeline with smart frame sampling and system prompts #42432

Are you sure you want to change the base?

Add VideoToTextPipeline with smart frame sampling and system prompts #42432

Uh oh!

Conversation

mkorn1 commented Nov 26, 2025

What does this PR do?

Key Features

Implementation Details

Example Usage

Basic usage

With custom frame count

With system prompt

With generation parameters

Before submitting

Who can review?

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants