[V1][TPU] TPU multimodal model support#13496
[V1][TPU] TPU multimodal model support#13496mgoin wants to merge 4 commits intovllm-project:mainfrom
Conversation
Signed-off-by: Michael Goin <mgoin64@gmail.com>
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Signed-off-by: Michael Goin <mgoin64@gmail.com>
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Michael Goin <mgoin64@gmail.com>
Now that #13049 has landed, this is an updated version of #12133
Currently only focused on usability and correctness for Llava-style multimodal models, not performance.
When using a multimodal model, we will pre-compile the prefills using the
inputs_embedsinput rather thaninput_ids. We will still useinput_idsfor decode in this iteration, but this will change with the addition of proper chunked prefill.This does not deal with pre-compiling the encoder forward pass, so in the event that the model is passed in image/video/audio that is a new shape, it will force compilation during runtime.
Tested Examples
Image
Llava ✅
Audio
Qwen2 Audio ✅