Add VideoToTextPipeline with smart frame sampling and system prompts #42432
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What does this PR do?
This PR creates a placeholder pipeline in anticipation of the addition of video to text models to the HF ecosystem. In the meantime, the approach is multimodal.
This PR adds a comprehensive
VideoToTextPipelineimplementation that enables video-to-text generation using image-text-to-text models. The pipeline supports intelligent frame sampling, system prompts for guided generation, and flexible generation parameters.Key Features
num_framesparameter allows users to specify exact number of frames to samplesystem_promptparameter enables guided generation with custom instructionsgenerate_kwargsto pass any generation parameters to the underlying modelmax_new_tokensin both argument andgenerate_kwargs)Implementation Details
VideoToTextPipelineclass extending the basePipelineclassExample Usage
from transformers import pipeline
Basic usage
captioner = pipeline("video-to-text", model="microsoft/git-base")
result = captioner("path/to/video.mp4")
With custom frame count
result = captioner("path/to/video.mp4", num_frames=32)
With system prompt
result = captioner("path/to/video.mp4", system_prompt="Describe this video in detail.")
With generation parameters
result = captioner("path/to/video.mp4", generate_kwargs={"temperature": 0.7, "max_new_tokens": 100})
Fixes # (issue)
Before submitting
This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?
Who can review?
This PR adds a new pipeline for video-to-text generation, which is a multimodal task. Please review:
The implementation follows the existing pipeline patterns and integrates with the existing pipeline registry system.