From 95388dc19745a3033390379f11c1cb4d2be0973b Mon Sep 17 00:00:00 2001 From: gcanlin Date: Tue, 20 Jan 2026 05:15:42 +0000 Subject: [PATCH 1/5] [Docs][Model] Support Qwen3-VL-Embedding & Qwen3-VL-Reranker Signed-off-by: gcanlin --- docs/source/tutorials/Qwen3-VL-Embedding.md | 131 +++++++++++ docs/source/tutorials/Qwen3-VL-Reranker.md | 204 ++++++++++++++++++ .../support_matrix/supported_models.md | 2 + 3 files changed, 337 insertions(+) create mode 100644 docs/source/tutorials/Qwen3-VL-Embedding.md create mode 100644 docs/source/tutorials/Qwen3-VL-Reranker.md diff --git a/docs/source/tutorials/Qwen3-VL-Embedding.md b/docs/source/tutorials/Qwen3-VL-Embedding.md new file mode 100644 index 00000000000..027094155ed --- /dev/null +++ b/docs/source/tutorials/Qwen3-VL-Embedding.md @@ -0,0 +1,131 @@ +# Qwen3-VL-Embedding + +## Introduction + +The Qwen3-VL-Embedding model series are the latest additions to the Qwen family, built upon the recently open-sourced and powerful Qwen3-VL foundation model. Specifically designed for multimodal information retrieval and cross-modal understanding, this suite accepts diverse inputs including text, images, screenshots, and videos, as well as inputs containing a mixture of these modalities. This guide describes how to run the model with vLLM Ascend. + +The model supports three types of embeddings: +- **Text-only**: Generate embeddings from text input alone +- **Image-only**: Generate embeddings from image input alone +- **Image+Text**: Generate combined embeddings from both image and text inputs + +## Supported Features + +Refer to [supported features](../user_guide/support_matrix/supported_models.md) to get the model's supported feature matrix. + +## Environment Preparation + +### Model Weight + +- `Qwen3-VL-Embedding-8B` [Download model weight](https://www.modelscope.cn/models/Qwen/Qwen3-VL-Embedding-8B) +- `Qwen3-VL-Embedding-2B` [Download model weight](https://www.modelscope.cn/models/Qwen/Qwen3-VL-Embedding-2B) + +It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/` + +### Installation + +You can use our official docker image to run `Qwen3-VL-Embedding` series models. + +- Start the docker image on your node, refer to [using docker](../installation.md#set-up-using-docker). + +If you don't want to use the docker image as above, you can also build all from source: + +- Install `vllm-ascend` from source, refer to [installation](../installation.md). + +## Deployment + +Using the Qwen3-VL-Embedding-8B model as an example, first run the docker container with the following command: + +### Online Inference + +```bash +vllm serve Qwen/Qwen3-VL-Embedding-8B --runner pooling +``` + +Once your server is started, you can query the model with input prompts. + +```bash +curl http://127.0.0.1:8000/v1/embeddings -H "Content-Type: application/json" -d '{ + "input": [ + "The capital of China is Beijing.", + "Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun." + ] +}' +``` + +### Offline Inference + +```python +import torch +from vllm import LLM + +def get_detailed_instruct(task_description: str, query: str) -> str: + return f'Instruct: {task_description}\nQuery:{query}' + + +if __name__=="__main__": + # Each query must come with a one-sentence instruction that describes the task + task = 'Given a web search query, retrieve relevant passages that answer the query' + + queries = [ + get_detailed_instruct(task, 'What is the capital of China?'), + get_detailed_instruct(task, 'Explain gravity') + ] + # No need to add instruction for retrieval documents + documents = [ + "The capital of China is Beijing.", + "Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun." + ] + input_texts = queries + documents + + model = LLM(model="Qwen/Qwen3-VL-Embedding-8B", + runner="pooling", + distributed_executor_backend="mp") + + outputs = model.embed(input_texts) + embeddings = torch.tensor([o.outputs.embedding for o in outputs]) + scores = (embeddings[:2] @ embeddings[2:].T) + print(scores.tolist()) +``` + +If you run this script successfully, you can see the info shown below: + +```bash +Adding requests: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 192.47it/s] +Processed prompts: 0%| | 0/4 [00:00system\nJudge whether the Document meets the requirements based on the Query and the Instruct provided. Note that the answer can only be "yes" or "no".<|im_end|>\n<|im_start|>user\n' +suffix = "<|im_end|>\n<|im_start|>assistant\n" + +query_template = "{prefix}: {instruction}\n: {query}\n" +document_template = ": {doc}{suffix}" + +instruction = ( + "Given a search query, retrieve relevant candidates that answer the query." +) + +query = "What is the capital of China?" + +documents = [ + "The capital of China is Beijing.", + "Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun.", +] + +documents = [ + document_template.format(doc=doc, suffix=suffix) for doc in documents +] + +response = requests.post(url, + json={ + "query": query_template.format(prefix=prefix, instruction=instruction, query=query), + "documents": documents, + }).json() + +print(response) +``` + +If you run this script successfully, you will see a list of scores printed to the console, similar to this: + +```bash +{'id': 'rerank-ac3495afa8e12404', 'model': 'Qwen/Qwen3-VL-Reranker-8B', 'usage': {'prompt_tokens': 315, 'total_tokens': 315}, 'results': [{'index': 0, 'document': {'text': ': The capital of China is Beijing.<|im_end|>\n<|im_start|>assistant\n', 'multi_modal': None}, 'relevance_score': 0.6368980407714844}, {'index': 1, 'document': {'text': ': Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun.<|im_end|>\n<|im_start|>assistant\n', 'multi_modal': None}, 'relevance_score': 0.20816077291965485}]} +``` + +### Offline Inference + +```python +from vllm import LLM + +model_name = "Qwen/Qwen3-VL-Reranker-8B" + +# What is the difference between the official original version and one +# that has been converted into a sequence classification model? +# Qwen3-Reranker is a language model that doing reranker by using the +# logits of "no" and "yes" tokens. +# It needs to computing 151669 tokens logits, making this method extremely +# inefficient, not to mention incompatible with the vllm score API. +# A method for converting the original model into a sequence classification +# model was proposed. See:https://huggingface.co/Qwen/Qwen3-Reranker-0.6B/discussions/3 +# Models converted offline using this method can not only be more efficient +# and support the vllm score API, but also make the init parameters more +# concise, for example. +# model = LLM(model="Qwen/Qwen3-VL-Reranker-8B", runner="pooling") + +# If you want to load the official original version, the init parameters are +# as follows. + +model = LLM( + model=model_name, + runner="pooling", + hf_overrides={ + # Manually route to sequence classification architecture + # This tells vLLM to use Qwen3VLForSequenceClassification instead of + # the default Qwen3VLForConditionalGeneration + "architectures": ["Qwen3VLForSequenceClassification"], + # Specify which token logits to extract from the language model head + # The original reranker uses "no" and "yes" token logits for scoring + "classifier_from_token": ["no", "yes"], + # Enable special handling for original Qwen3-Reranker models + # This flag triggers conversion logic that transforms the two token + # vectors into a single classification vector + "is_original_qwen3_reranker": True, + }, +) + +# Why do we need hf_overrides for the official original version: +# vllm converts it to Qwen3VLForSequenceClassification when loaded for +# better performance. +# - Firstly, we need using `"architectures": ["Qwen3VLForSequenceClassification"],` +# to manually route to Qwen3VLForSequenceClassification. +# - Then, we will extract the vector corresponding to classifier_from_token +# from lm_head using `"classifier_from_token": ["no", "yes"]`. +# - Third, we will convert these two vectors into one vector. The use of +# conversion logic is controlled by `using "is_original_qwen3_reranker": True`. + +# Please use the query_template and document_template to format the query and +# document for better reranker results. + +prefix = '<|im_start|>system\nJudge whether the Document meets the requirements based on the Query and the Instruct provided. Note that the answer can only be "yes" or "no".<|im_end|>\n<|im_start|>user\n' +suffix = "<|im_end|>\n<|im_start|>assistant\n" + +query_template = "{prefix}: {instruction}\n: {query}\n" +document_template = ": {doc}{suffix}" + +if __name__ == "__main__": + instruction = ( + "Given a search query, retrieve relevant candidates that answer the query." + ) + + query = "What is the capital of China?" + + documents = [ + "The capital of China is Beijing.", + "Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun.", + ] + + documents = [document_template.format(doc=doc, suffix=suffix) for doc in documents] + + outputs = model.score(query_template.format(prefix=prefix, instruction=instruction, query=query), documents) + + print([output.outputs.score for output in outputs]) +``` + +If you run this script successfully, you will see a list of scores printed to the console, similar to this: + +```bash +Adding requests: 100%|████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 2409.83it/s] +Processed prompts: 0%| | 0/2 [00:00 Date: Tue, 20 Jan 2026 05:22:15 +0000 Subject: [PATCH 2/5] fix Signed-off-by: gcanlin --- docs/source/tutorials/Qwen3-VL-Embedding.md | 4 ++-- docs/source/tutorials/Qwen3-VL-Reranker.md | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/source/tutorials/Qwen3-VL-Embedding.md b/docs/source/tutorials/Qwen3-VL-Embedding.md index 027094155ed..de85acb04f0 100644 --- a/docs/source/tutorials/Qwen3-VL-Embedding.md +++ b/docs/source/tutorials/Qwen3-VL-Embedding.md @@ -60,7 +60,7 @@ import torch from vllm import LLM def get_detailed_instruct(task_description: str, query: str) -> str: - return f'Instruct: {task_description}\nQuery:{query}' + return f'Instruct: {task_description}\nQuery: {query}' if __name__=="__main__": @@ -105,7 +105,7 @@ For more examples, refer to the vLLM official examples: ## Performance Run performance of `Qwen3-VL-Embedding-8B` as an example. -Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/) for more details. +Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/benchmarking/cli/) for more details. Take the `serve` as an example. Run the code as follows. diff --git a/docs/source/tutorials/Qwen3-VL-Reranker.md b/docs/source/tutorials/Qwen3-VL-Reranker.md index aaa67f1cc55..8ed1040e159 100644 --- a/docs/source/tutorials/Qwen3-VL-Reranker.md +++ b/docs/source/tutorials/Qwen3-VL-Reranker.md @@ -97,7 +97,7 @@ model_name = "Qwen/Qwen3-VL-Reranker-8B" # It needs to computing 151669 tokens logits, making this method extremely # inefficient, not to mention incompatible with the vllm score API. # A method for converting the original model into a sequence classification -# model was proposed. See:https://huggingface.co/Qwen/Qwen3-Reranker-0.6B/discussions/3 +# model was proposed. See: https://huggingface.co/Qwen/Qwen3-Reranker-0.6B/discussions/3 # Models converted offline using this method can not only be more efficient # and support the vllm score API, but also make the init parameters more # concise, for example. @@ -178,7 +178,7 @@ For more examples, refer to the vLLM official examples: ## Performance Run performance of `Qwen3-VL-Reranker-8B` as an example. -Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/) for more details. +Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/benchmarking/cli/) for more details. Take the `serve` as an example. Run the code as follows. From 632011a309569074149b27208524d1ca542a925b Mon Sep 17 00:00:00 2001 From: gcanlin Date: Tue, 20 Jan 2026 05:26:17 +0000 Subject: [PATCH 3/5] add template for reranker Signed-off-by: gcanlin --- docs/source/tutorials/Qwen3-VL-Reranker.md | 38 ++++++++++++++++++++-- docs/source/tutorials/index.md | 2 ++ 2 files changed, 38 insertions(+), 2 deletions(-) diff --git a/docs/source/tutorials/Qwen3-VL-Reranker.md b/docs/source/tutorials/Qwen3-VL-Reranker.md index 8ed1040e159..fcc8f82f9a5 100644 --- a/docs/source/tutorials/Qwen3-VL-Reranker.md +++ b/docs/source/tutorials/Qwen3-VL-Reranker.md @@ -25,16 +25,50 @@ if you don't want to use the docker image as above, you can also build all from ## Deployment -Using the Qwen3-VL-Reranker-8B model as an example, first run the docker container with the following command: +Using the Qwen3-VL-Reranker-8B model as an example: + +### Chat Template + +The Qwen3-VL-Reranker model requires a specific chat template for proper formatting. Create a file named `qwen3_vl_reranker.jinja` with the following content: + +```jinja +<|im_start|>system +Judge whether the Document meets the requirements based on the Query and the Instruct provided. Note that the answer can only be "yes" or "no".<|im_end|> +<|im_start|>user +: {{ + messages + | selectattr("role", "eq", "system") + | map(attribute="content") + | first + | default("Given a search query, retrieve relevant candidates that answer the query.") +}}:{{ + messages + | selectattr("role", "eq", "query") + | map(attribute="content") + | first +}} +:{{ + messages + | selectattr("role", "eq", "document") + | map(attribute="content") + | first +}}<|im_end|> +<|im_start|>assistant + +``` + +Save this file to a location of your choice (e.g., `./qwen3_vl_reranker.jinja`). ### Online Inference +Start the server with the following command: + ```bash vllm serve Qwen/Qwen3-VL-Reranker-8B \ --runner pooling \ --max-model-len 4096 \ --hf_overrides '{"architectures": ["Qwen3VLForSequenceClassification"],"classifier_from_token": ["no", "yes"],"is_original_qwen3_reranker": true}' \ - --chat-template /path/to/vllm/examples/pooling/score/template/qwen3_vl_reranker.jinja + --chat-template ./qwen3_vl_reranker.jinja ``` Once your server is started, you can send request with follow examples. diff --git a/docs/source/tutorials/index.md b/docs/source/tutorials/index.md index dc78743dd02..10cf4604bf3 100644 --- a/docs/source/tutorials/index.md +++ b/docs/source/tutorials/index.md @@ -13,7 +13,9 @@ Qwen3-VL-30B-A3B-Instruct.md Qwen3-VL-235B-A22B-Instruct.md Qwen3-Coder-30B-A3B.md Qwen3_embedding.md +Qwen3-VL-Embedding.md Qwen3_reranker.md +Qwen3-VL-Reranker.md Qwen3-8B-W4A8.md Qwen3-32B-W4A4.md Qwen3-Next.md From 49ee97b1ed0d4abb7891fa7bd48531bd76f25874 Mon Sep 17 00:00:00 2001 From: gcanlin Date: Tue, 20 Jan 2026 06:42:27 +0000 Subject: [PATCH 4/5] fix lint Signed-off-by: gcanlin --- docs/source/tutorials/Qwen3-VL-Embedding.md | 4 +++- docs/source/tutorials/Qwen3-VL-Reranker.md | 9 +++++++-- 2 files changed, 10 insertions(+), 3 deletions(-) diff --git a/docs/source/tutorials/Qwen3-VL-Embedding.md b/docs/source/tutorials/Qwen3-VL-Embedding.md index de85acb04f0..d47c36ec0a0 100644 --- a/docs/source/tutorials/Qwen3-VL-Embedding.md +++ b/docs/source/tutorials/Qwen3-VL-Embedding.md @@ -5,6 +5,7 @@ The Qwen3-VL-Embedding model series are the latest additions to the Qwen family, built upon the recently open-sourced and powerful Qwen3-VL foundation model. Specifically designed for multimodal information retrieval and cross-modal understanding, this suite accepts diverse inputs including text, images, screenshots, and videos, as well as inputs containing a mixture of these modalities. This guide describes how to run the model with vLLM Ascend. The model supports three types of embeddings: + - **Text-only**: Generate embeddings from text input alone - **Image-only**: Generate embeddings from image input alone - **Image+Text**: Generate combined embeddings from both image and text inputs @@ -99,6 +100,7 @@ Processed prompts: 100%|██████████████████ ``` For more examples, refer to the vLLM official examples: + - [Offline Vision Embedding Example](https://github.com/vllm-project/vllm/blob/main/examples/pooling/embed/vision_embedding_offline.py) - [Online Vision Embedding Example](https://github.com/vllm-project/vllm/blob/main/examples/pooling/embed/vision_embedding_online.py) @@ -128,4 +130,4 @@ Mean E2EL (ms): 10360.53 Median E2EL (ms): 10354.37 P99 E2EL (ms): 19423.21 ================================================== -``` \ No newline at end of file +``` diff --git a/docs/source/tutorials/Qwen3-VL-Reranker.md b/docs/source/tutorials/Qwen3-VL-Reranker.md index fcc8f82f9a5..740e1a1ca6e 100644 --- a/docs/source/tutorials/Qwen3-VL-Reranker.md +++ b/docs/source/tutorials/Qwen3-VL-Reranker.md @@ -1,6 +1,7 @@ # Qwen3-VL-Reranker ## Introduction + The Qwen3-VL-Embedding and Qwen3-VL-Reranker model series are the latest additions to the Qwen family, built upon the recently open-sourced and powerful Qwen3-VL foundation model. Specifically designed for multimodal information retrieval and cross-modal understanding, this suite accepts diverse inputs including text, images, screenshots, and videos, as well as inputs containing a mixture of these modalities. This guide describes how to run the model with vLLM Ascend. ## Supported Features @@ -17,10 +18,13 @@ Refer to [supported features](../user_guide/support_matrix/supported_models.md) It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/` ### Installation + You can use our official docker image to run `Qwen3-VL-Reranker` series models. + - Start the docker image on your node, refer to [using docker](../installation.md#set-up-using-docker). -if you don't want to use the docker image as above, you can also build all from source: +If you don't want to use the docker image as above, you can also build all from source: + - Install `vllm-ascend` from source, refer to [installation](../installation.md). ## Deployment @@ -206,6 +210,7 @@ Processed prompts: 100%|██████████████████ ``` For more examples, refer to the vLLM official examples: + - [Offline Vision Embedding Example](https://github.com/vllm-project/vllm/blob/main/examples/pooling/score/vision_reranker_offline.py) - [Online Vision Embedding Example](https://github.com/vllm-project/vllm/blob/main/examples/pooling/score/vision_reranker_online.py) @@ -235,4 +240,4 @@ Mean E2EL (ms): 7474.64 Median E2EL (ms): 7528.72 P99 E2EL (ms): 13523.32 ================================================== -``` \ No newline at end of file +``` From b8f99d4b95f48260f1b9a472f2109e2542c31a7e Mon Sep 17 00:00:00 2001 From: gcanlin Date: Tue, 20 Jan 2026 06:43:44 +0000 Subject: [PATCH 5/5] update Signed-off-by: gcanlin --- docs/source/tutorials/Qwen3-VL-Embedding.md | 8 +------- 1 file changed, 1 insertion(+), 7 deletions(-) diff --git a/docs/source/tutorials/Qwen3-VL-Embedding.md b/docs/source/tutorials/Qwen3-VL-Embedding.md index d47c36ec0a0..d39aed9ce5e 100644 --- a/docs/source/tutorials/Qwen3-VL-Embedding.md +++ b/docs/source/tutorials/Qwen3-VL-Embedding.md @@ -2,13 +2,7 @@ ## Introduction -The Qwen3-VL-Embedding model series are the latest additions to the Qwen family, built upon the recently open-sourced and powerful Qwen3-VL foundation model. Specifically designed for multimodal information retrieval and cross-modal understanding, this suite accepts diverse inputs including text, images, screenshots, and videos, as well as inputs containing a mixture of these modalities. This guide describes how to run the model with vLLM Ascend. - -The model supports three types of embeddings: - -- **Text-only**: Generate embeddings from text input alone -- **Image-only**: Generate embeddings from image input alone -- **Image+Text**: Generate combined embeddings from both image and text inputs +The Qwen3-VL-Embedding and Qwen3-VL-Reranker model series are the latest additions to the Qwen family, built upon the recently open-sourced and powerful Qwen3-VL foundation model. Specifically designed for multimodal information retrieval and cross-modal understanding, this suite accepts diverse inputs including text, images, screenshots, and videos, as well as inputs containing a mixture of these modalities. This guide describes how to run the model with vLLM Ascend. ## Supported Features