-
Notifications
You must be signed in to change notification settings - Fork 322
Description
Feature request
VL embeddings extends text embeddings by adding embeddings for vision documents (at the moment images, but possibly videos in the future) and possibly other documents as well (audio).
Motivation
I understand from the name of this project that this feature might be out-of-scope ("Text Embeddings Inference"), but TGI ("Text Generation Inference) supports for example Qwen2.5-VL, which is a VLM (https://huggingface.co/docs/text-generation-inference/supported_models).
Embeddings and re-ranking VL models, based on Qwen2-VL are already available:
- https://huggingface.co/MrLight/dse-qwen2-2b-mrl-v1
- https://huggingface.co/Alibaba-NLP/gme-Qwen2-VL-2B-Instruct
Adding support for VL-embeddings in TEI would be excellent for serverless inference.
Your contribution
I could add support for Qwen2.5 as a model but candle does not yet support Conv3d. There is already a PR for Qwen2.5-VL in Candle huggingface/candle#2995