[WIP]: Base multimodal model for VLLM's transformers backend
#36367
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What does this PR do?
This is a draft PR to get your feedback on the ideas I have to enable
transformersbackend for Vision LLMs in VLLM. A few things had to be changed, which are a bit breaking.We need a base model without LM head for all models. We cannot make assumption about model's arch and load vision/LM separately, since some models have mm proj, some have pooling, etc. We can maybe use
AutoModelForImageTextToTextbut that means we have to infer modality from config and also map all future multimodal models through their own auto-class (AutoVideoTextToText?) . So I decided a base class is the most generic approachFor processors, one new helper added to help VLLM get prompt replacements without having to add
_get_prompt_replacementsin each processor. On VLLM side we'll setreturn_mm_token_type_idstoTrueby default and then infer positions by splittingmm_token_type_idsto their respective mm token lengths each. Thus each chunk will be a separate placeholderPseudocode, will add more in draft PR on VLLM repo