-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support multimodals models with vLLM #3670
Comments
agree |
I am very interested in the support of vision models in localAI particularly Llama-3.2-11B-Vision and Pixtral-12b |
Related to #3670 Signed-off-by: Ettore Di Giacinto <[email protected]>
With #3729 should cover most of the models and add also video understanding. Model configuration files needs to be specify placeholders used by models for image/video tags in the text prompt, going to experiment with this once in master and update the model gallery with few examples. |
* feat(vllm): add support for image-to-text Related to #3670 Signed-off-by: Ettore Di Giacinto <[email protected]> * feat(vllm): add support for video-to-text Closes: #2318 Signed-off-by: Ettore Di Giacinto <[email protected]> * feat(vllm): support CPU installations Signed-off-by: Ettore Di Giacinto <[email protected]> * feat(vllm): add bnb Signed-off-by: Ettore Di Giacinto <[email protected]> * chore: add docs reference Signed-off-by: Ettore Di Giacinto <[email protected]> * Apply suggestions from code review Signed-off-by: Ettore Di Giacinto <[email protected]> --------- Signed-off-by: Ettore Di Giacinto <[email protected]> Signed-off-by: Ettore Di Giacinto <[email protected]>
Great news. However, I miss two Docker images |
) * feat(vllm): add support for image-to-text Related to mudler#3670 Signed-off-by: Ettore Di Giacinto <[email protected]> * feat(vllm): add support for video-to-text Closes: mudler#2318 Signed-off-by: Ettore Di Giacinto <[email protected]> * feat(vllm): support CPU installations Signed-off-by: Ettore Di Giacinto <[email protected]> * feat(vllm): add bnb Signed-off-by: Ettore Di Giacinto <[email protected]> * chore: add docs reference Signed-off-by: Ettore Di Giacinto <[email protected]> * Apply suggestions from code review Signed-off-by: Ettore Di Giacinto <[email protected]> --------- Signed-off-by: Ettore Di Giacinto <[email protected]> Signed-off-by: Ettore Di Giacinto <[email protected]>
Is your feature request related to a problem? Please describe.
Many models are now becoming multi-model, that is they can accept images, videos or audio during inference. The llama.cpp project is currently providing multimodal support and we do as well by using it, however there are models which aren't supported yet (for instance #3535 and #3669, see also ggerganov/llama.cpp#9455 )
Describe the solution you'd like
LocalAI to support vLLM multimodal capabilities
Describe alternatives you've considered
Additional context
See #3535 and #3669, tangentially related to: #2318 #3602
https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_pixtral.py
https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_vision_language_multi_image.py
The text was updated successfully, but these errors were encountered: