Skip to content

Support for VLM #669

@SorenDreano

Description

@SorenDreano

Feature request

VL embeddings extends text embeddings by adding embeddings for vision documents (at the moment images, but possibly videos in the future) and possibly other documents as well (audio).

Motivation

I understand from the name of this project that this feature might be out-of-scope ("Text Embeddings Inference"), but TGI ("Text Generation Inference) supports for example Qwen2.5-VL, which is a VLM (https://huggingface.co/docs/text-generation-inference/supported_models).

Embeddings and re-ranking VL models, based on Qwen2-VL are already available:

Adding support for VL-embeddings in TEI would be excellent for serverless inference.

Your contribution

I could add support for Qwen2.5 as a model but candle does not yet support Conv3d. There is already a PR for Qwen2.5-VL in Candle huggingface/candle#2995

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions