Chat template: return vectorized output in processors#34275
Chat template: return vectorized output in processors#34275zucchini-nlp merged 42 commits intohuggingface:mainfrom
Conversation
Rocketknight1
left a comment
There was a problem hiding this comment.
This looks good to me! Just to clarify, the idea is that if you pass a chat to apply_chat_template and some of the content fields contain images or videos, and tokenize=True, then images and videos are loaded and processed, so that the output is ready to pass to the model?
Yep, exactly! |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
qubvel
left a comment
There was a problem hiding this comment.
Thanks for working on this!
The main question that is not clear to me is: what is the backend selection strategy? Should we pass it explicitly? I see load_video is used without passing backend and num_frames arguments
|
welcome back @qubvel ! Okey, I'll add more typehints and better docs. The backend should be selectable by the user, but we default to the one that works in all cases and has no weird cuda related failures. Prob we should document this somewhere, but I didn't yet find a good place for it |
Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>
| if video.startswith("https://www.youtube.com") or video.startswith("http://www.youtube.com"): | ||
| if not is_yt_dlp_available(): | ||
| raise ImportError("To load a video from YouTube url you have to install `yt_dlp` first.") | ||
| buffer = BytesIO() | ||
| with redirect_stdout(buffer), YoutubeDL() as f: | ||
| f.download([video]) | ||
| bytes_obj = buffer.getvalue() | ||
| file_obj = BytesIO(bytes_obj) | ||
| elif video.startswith("http://") or video.startswith("https://"): | ||
| file_obj = BytesIO(requests.get(video).content) |
There was a problem hiding this comment.
Some additional kwargs might be required here, e.g. timeout, but probably fine for now
Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>
Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>
Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>
Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>
Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>
Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>
|
Huh, I don't know why it requested review from some many people, feel free to unsubscribe, sorry |
ArthurZucker
left a comment
There was a problem hiding this comment.
For me super good in terms of API!
I am mostly wondering if this does not pose security threats as we are opening links vs before the user had to open the link explicitly in his code.
|
Hmm, good point about the security. We actually already have a few processors that open links for you, e.g. Idefics and Pixtral. Haven't seen anyone flag it as a security breach so maybe it's not a big deal? |
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
There was a problem hiding this comment.
Let's remove it or put it in the benchmark file, but probably an overkill!
There was a problem hiding this comment.
Also this file seems unrelated
| if is_decord_available(): | ||
| from decord import VideoReader, cpu | ||
|
|
||
| if is_av_available(): | ||
| import av | ||
|
|
||
| if is_cv2_available(): | ||
| import cv2 | ||
|
|
||
| if is_yt_dlp_available(): | ||
| from yt_dlp import YoutubeDL |
There was a problem hiding this comment.
This block breaks lazy importing of cv2 which vllm strictly enforces. It happens when vLLM imports from transformers.image_utils import ImageInput. vLLM cannot upgrade to v4.49.0 because of it vllm-project/vllm#13905.
Would it be possible to delay this import? This would be preferable to lazily importing ImageInput everywhere it's used in vLLM.
What does this PR do?
Part of #33948. This PR adds support for
return_tensors="pt"when calling chat templates for processors. That way users can obtain inputs in tensor format and pass it directly to the model, instead of having to call processor with a formatted prompt + visuals.For images we use the existing functionality
load_imagesand for videos I added a few functions. We usually useavin all video related model docs sincedecordhad problems with CUDA in the past. Apart from that we can useopencvortorchvisionfor video loading. I did a small benchmark run to load and sample uniformly 32 frames from around ~100 videos andavwas the slowest of all them, whiledecordwas the fastest. Therefore I decided to add helper with all possible backends and let users switch whenever they want to. By default we useopencvas it is a more common CV framework than any others provided here.In the future we might start using
torchvisionwhen we addVideoProcessorclass and supportVideoProcessorFast(see #33504).These are the results of small benchmarking with ~100 videos:
Review from @Rocketknight1 for templates and @qubvel for general CV related modifications.