Skip to content

Conversation

@mkorn1
Copy link

@mkorn1 mkorn1 commented Nov 26, 2025

What does this PR do?

This PR creates a placeholder pipeline in anticipation of the addition of video to text models to the HF ecosystem. In the meantime, the approach is multimodal.

This PR adds a comprehensive VideoToTextPipeline implementation that enables video-to-text generation using image-text-to-text models. The pipeline supports intelligent frame sampling, system prompts for guided generation, and flexible generation parameters.

Key Features

  • Smart Frame Sampling: Automatically calculates optimal number of frames based on video duration (1 frame per second, with min 8 and max 128 frames)
  • Explicit Frame Control: num_frames parameter allows users to specify exact number of frames to sample
  • Time-based Sampling: Uses time-spaced frame selection rather than simple linear sampling for better temporal coverage
  • System Prompts: system_prompt parameter enables guided generation with custom instructions
  • Flexible Generation: Full support for generate_kwargs to pass any generation parameters to the underlying model
  • Batch Processing: Supports processing single videos or batches of videos
  • Error Handling: Proper validation for conflicting parameters (e.g., max_new_tokens in both argument and generate_kwargs)

Implementation Details

  • Implements VideoToTextPipeline class extending the base Pipeline class
  • Uses PyAV for video frame extraction with intelligent temporal sampling
  • Supports both local file paths and HTTP/HTTPS URLs for video inputs
  • Includes comprehensive logging for debugging frame selection and processing
  • Handles GIT models and other image-text-to-text models appropriately

Example Usage

from transformers import pipeline

Basic usage

captioner = pipeline("video-to-text", model="microsoft/git-base")
result = captioner("path/to/video.mp4")

With custom frame count

result = captioner("path/to/video.mp4", num_frames=32)

With system prompt

result = captioner("path/to/video.mp4", system_prompt="Describe this video in detail.")

With generation parameters

result = captioner("path/to/video.mp4", generate_kwargs={"temperature": 0.7, "max_new_tokens": 100})

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).

  • Did you read the contributor guideline,
    Pull Request section?

  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.

  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.

  • Did you write any new necessary tests?

Who can review?

This PR adds a new pipeline for video-to-text generation, which is a multimodal task. Please review:

The implementation follows the existing pipeline patterns and integrates with the existing pipeline registry system.

@mkorn1 mkorn1 changed the title Feat/base v2t pipe Add VideoToTextPipeline with smart frame sampling and system prompts Nov 26, 2025
Copy link
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @mkorn1 . I just merged any-to-any pipeline (#40884) and I want us to use it for all modalities when possible. Having a separate pipeline for each modality-to-text is not easy to maintain in long-term

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants