Support multimodals models with vLLM #3670

mudler · 2024-09-26T08:25:18Z

Is your feature request related to a problem? Please describe.
Many models are now becoming multi-model, that is they can accept images, videos or audio during inference. The llama.cpp project is currently providing multimodal support and we do as well by using it, however there are models which aren't supported yet (for instance #3535 and #3669, see also ggerganov/llama.cpp#9455 )

Describe the solution you'd like
LocalAI to support vLLM multimodal capabilities

Describe alternatives you've considered

Additional context
See #3535 and #3669, tangentially related to: #2318 #3602

https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_pixtral.py

https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_vision_language_multi_image.py

3unnycheung · 2024-09-26T11:57:48Z

agree

SuperPat45 · 2024-09-26T12:30:58Z

I am very interested in the support of vision models in localAI particularly Llama-3.2-11B-Vision and Pixtral-12b

Related to #3670 Signed-off-by: Ettore Di Giacinto <[email protected]>

mudler · 2024-10-04T17:34:06Z

With #3729 should cover most of the models and add also video understanding. Model configuration files needs to be specify placeholders used by models for image/video tags in the text prompt, going to experiment with this once in master and update the model gallery with few examples.

* feat(vllm): add support for image-to-text Related to #3670 Signed-off-by: Ettore Di Giacinto <[email protected]> * feat(vllm): add support for video-to-text Closes: #2318 Signed-off-by: Ettore Di Giacinto <[email protected]> * feat(vllm): support CPU installations Signed-off-by: Ettore Di Giacinto <[email protected]> * feat(vllm): add bnb Signed-off-by: Ettore Di Giacinto <[email protected]> * chore: add docs reference Signed-off-by: Ettore Di Giacinto <[email protected]> * Apply suggestions from code review Signed-off-by: Ettore Di Giacinto <[email protected]> --------- Signed-off-by: Ettore Di Giacinto <[email protected]> Signed-off-by: Ettore Di Giacinto <[email protected]>

AlexM4H · 2024-10-05T07:56:57Z

Great news. However, I miss two Docker images
master-cublas-cuda12-ffmpeg and master-aio-gpu-nvidia-cuda-12

) * feat(vllm): add support for image-to-text Related to mudler#3670 Signed-off-by: Ettore Di Giacinto <[email protected]> * feat(vllm): add support for video-to-text Closes: mudler#2318 Signed-off-by: Ettore Di Giacinto <[email protected]> * feat(vllm): support CPU installations Signed-off-by: Ettore Di Giacinto <[email protected]> * feat(vllm): add bnb Signed-off-by: Ettore Di Giacinto <[email protected]> * chore: add docs reference Signed-off-by: Ettore Di Giacinto <[email protected]> * Apply suggestions from code review Signed-off-by: Ettore Di Giacinto <[email protected]> --------- Signed-off-by: Ettore Di Giacinto <[email protected]> Signed-off-by: Ettore Di Giacinto <[email protected]>

mudler added enhancement New feature or request roadmap labels Sep 26, 2024

This was referenced Oct 3, 2024

Add support for realtime API #3714

Open

feat(multimodal): allow to template placeholders #3728

Merged

mudler added a commit that referenced this issue Oct 4, 2024

feat(vllm): add support for image-to-text

5717f72

Related to #3670 Signed-off-by: Ettore Di Giacinto <[email protected]>

mudler mentioned this issue Oct 4, 2024

feat(vllm): add support for image-to-text and video-to-text #3729

Merged

1 task

mudler closed this as completed in #3729 Oct 4, 2024

mudler mentioned this issue Oct 9, 2024

feat(multimodal): Video understanding #2318

Closed

This was referenced Oct 17, 2024

fix(vllm): images and videos are base64 by default #3867

Merged

feat(templates): extract text from multimodal requests #3866

Merged

feat(templates): add sprig to multimodal templates #3868

Merged

models(gallery): add phi-3 vision #3890

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support multimodals models with vLLM #3670

Support multimodals models with vLLM #3670

mudler commented Sep 26, 2024 •

edited

Loading

3unnycheung commented Sep 26, 2024

SuperPat45 commented Sep 26, 2024 •

edited

Loading

mudler commented Oct 4, 2024 •

edited

Loading

AlexM4H commented Oct 5, 2024

Support multimodals models with vLLM #3670

Support multimodals models with vLLM #3670

Comments

mudler commented Sep 26, 2024 • edited Loading

3unnycheung commented Sep 26, 2024

SuperPat45 commented Sep 26, 2024 • edited Loading

mudler commented Oct 4, 2024 • edited Loading

AlexM4H commented Oct 5, 2024

mudler commented Sep 26, 2024 •

edited

Loading

SuperPat45 commented Sep 26, 2024 •

edited

Loading

mudler commented Oct 4, 2024 •

edited

Loading