From 95388dc19745a3033390379f11c1cb4d2be0973b Mon Sep 17 00:00:00 2001
From: gcanlin <canlinguosdu@gmail.com>
Date: Tue, 20 Jan 2026 05:15:42 +0000
Subject: [PATCH 1/5] [Docs][Model] Support Qwen3-VL-Embedding &
 Qwen3-VL-Reranker

Signed-off-by: gcanlin <canlinguosdu@gmail.com>
---
 docs/source/tutorials/Qwen3-VL-Embedding.md   | 131 +++++++++++
 docs/source/tutorials/Qwen3-VL-Reranker.md    | 204 ++++++++++++++++++
 .../support_matrix/supported_models.md        |   2 +
 3 files changed, 337 insertions(+)
 create mode 100644 docs/source/tutorials/Qwen3-VL-Embedding.md
 create mode 100644 docs/source/tutorials/Qwen3-VL-Reranker.md

diff --git a/docs/source/tutorials/Qwen3-VL-Embedding.md b/docs/source/tutorials/Qwen3-VL-Embedding.md
new file mode 100644
index 00000000000..027094155ed
--- /dev/null
+++ b/docs/source/tutorials/Qwen3-VL-Embedding.md
@@ -0,0 +1,131 @@
+# Qwen3-VL-Embedding
+
+## Introduction
+
+The Qwen3-VL-Embedding model series are the latest additions to the Qwen family, built upon the recently open-sourced and powerful Qwen3-VL foundation model. Specifically designed for multimodal information retrieval and cross-modal understanding, this suite accepts diverse inputs including text, images, screenshots, and videos, as well as inputs containing a mixture of these modalities. This guide describes how to run the model with vLLM Ascend.
+
+The model supports three types of embeddings:
+- **Text-only**: Generate embeddings from text input alone
+- **Image-only**: Generate embeddings from image input alone
+- **Image+Text**: Generate combined embeddings from both image and text inputs
+
+## Supported Features
+
+Refer to [supported features](../user_guide/support_matrix/supported_models.md) to get the model's supported feature matrix.
+
+## Environment Preparation
+
+### Model Weight
+
+- `Qwen3-VL-Embedding-8B` [Download model weight](https://www.modelscope.cn/models/Qwen/Qwen3-VL-Embedding-8B)
+- `Qwen3-VL-Embedding-2B` [Download model weight](https://www.modelscope.cn/models/Qwen/Qwen3-VL-Embedding-2B)
+
+It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`
+
+### Installation
+
+You can use our official docker image to run `Qwen3-VL-Embedding` series models.
+
+- Start the docker image on your node, refer to [using docker](../installation.md#set-up-using-docker).
+
+If you don't want to use the docker image as above, you can also build all from source:
+
+- Install `vllm-ascend` from source, refer to [installation](../installation.md).
+
+## Deployment
+
+Using the Qwen3-VL-Embedding-8B model as an example, first run the docker container with the following command:
+
+### Online Inference
+
+```bash
+vllm serve Qwen/Qwen3-VL-Embedding-8B --runner pooling
+```
+
+Once your server is started, you can query the model with input prompts.
+
+```bash
+curl http://127.0.0.1:8000/v1/embeddings -H "Content-Type: application/json" -d '{
+  "input": [
+        "The capital of China is Beijing.",
+        "Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun."
+    ]
+}'
+```
+
+### Offline Inference
+
+```python
+import torch
+from vllm import LLM
+
+def get_detailed_instruct(task_description: str, query: str) -> str:
+    return f'Instruct: {task_description}\nQuery:{query}'
+
+
+if __name__=="__main__":
+    # Each query must come with a one-sentence instruction that describes the task
+    task = 'Given a web search query, retrieve relevant passages that answer the query'
+
+    queries = [
+        get_detailed_instruct(task, 'What is the capital of China?'),
+        get_detailed_instruct(task, 'Explain gravity')
+    ]
+    # No need to add instruction for retrieval documents
+    documents = [
+        "The capital of China is Beijing.",
+        "Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun."
+    ]
+    input_texts = queries + documents
+
+    model = LLM(model="Qwen/Qwen3-VL-Embedding-8B",
+                runner="pooling",
+                distributed_executor_backend="mp")
+
+    outputs = model.embed(input_texts)
+    embeddings = torch.tensor([o.outputs.embedding for o in outputs])
+    scores = (embeddings[:2] @ embeddings[2:].T)
+    print(scores.tolist())
+```
+
+If you run this script successfully, you can see the info shown below:
+
+```bash
+Adding requests: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 192.47it/s]
+Processed prompts:   0%|                                            | 0/4 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s](EngineCore_DP0 pid=2425173) (Worker pid=2425180) INFO 01-09 00:44:40 [acl_graph.py:194] Replaying aclgraph
+(EngineCore_DP0 pid=2425173) (Worker pid=2425180) ('Warning: torch.save with "_use_new_zipfile_serialization = False" is not recommended for npu tensor, which may bring unexpected errors and hopefully set "_use_new_zipfile_serialization = True"', 'if it is necessary to use this, please convert the npu tensor to cpu tensor for saving')
+Processed prompts: 100%|████████████████████████████████████| 4/4 [00:00<00:00, 21.34it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
+[[0.9279120564460754, 0.32747742533683777], [0.4124627113342285, 0.7425257563591003]]
+```
+
+For more examples, refer to the vLLM official examples:
+- [Offline Vision Embedding Example](https://github.com/vllm-project/vllm/blob/main/examples/pooling/embed/vision_embedding_offline.py)
+- [Online Vision Embedding Example](https://github.com/vllm-project/vllm/blob/main/examples/pooling/embed/vision_embedding_online.py)
+
+## Performance
+
+Run performance of `Qwen3-VL-Embedding-8B` as an example.
+Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/) for more details.
+
+Take the `serve` as an example. Run the code as follows.
+
+```bash
+vllm bench serve --model Qwen/Qwen3-VL-Embedding-8B --backend openai-embeddings --dataset-name random --endpoint /v1/embeddings --random-input 200 --save-result --result-dir ./
+```
+
+After about several minutes, you can get the performance evaluation result. With this tutorial, the performance result is:
+
+```bash
+============ Serving Benchmark Result ============
+Successful requests:                     1000
+Failed requests:                         0
+Benchmark duration (s):                  19.53
+Total input tokens:                      200000
+Request throughput (req/s):              51.20
+Total token throughput (tok/s):          10240.42
+----------------End-to-end Latency----------------
+Mean E2EL (ms):                          10360.53
+Median E2EL (ms):                        10354.37
+P99 E2EL (ms):                           19423.21
+==================================================
+```
\ No newline at end of file
diff --git a/docs/source/tutorials/Qwen3-VL-Reranker.md b/docs/source/tutorials/Qwen3-VL-Reranker.md
new file mode 100644
index 00000000000..aaa67f1cc55
--- /dev/null
+++ b/docs/source/tutorials/Qwen3-VL-Reranker.md
@@ -0,0 +1,204 @@
+# Qwen3-VL-Reranker
+
+## Introduction
+The Qwen3-VL-Embedding and Qwen3-VL-Reranker model series are the latest additions to the Qwen family, built upon the recently open-sourced and powerful Qwen3-VL foundation model. Specifically designed for multimodal information retrieval and cross-modal understanding, this suite accepts diverse inputs including text, images, screenshots, and videos, as well as inputs containing a mixture of these modalities. This guide describes how to run the model with vLLM Ascend.
+
+## Supported Features
+
+Refer to [supported features](../user_guide/support_matrix/supported_models.md) to get the model's supported feature matrix.
+
+## Environment Preparation
+
+### Model Weight
+
+- `Qwen3-VL-Reranker-8B` [Download model weight](https://www.modelscope.cn/models/Qwen/Qwen3-VL-Reranker-8B)
+- `Qwen3-VL-Reranker-2B` [Download model weight](https://www.modelscope.cn/models/Qwen/Qwen3-VL-Reranker-2B)
+
+It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`
+
+### Installation
+You can use our official docker image to run `Qwen3-VL-Reranker` series models.
+- Start the docker image on your node, refer to [using docker](../installation.md#set-up-using-docker).
+
+if you don't want to use the docker image as above, you can also build all from source:
+- Install `vllm-ascend` from source, refer to [installation](../installation.md).
+
+## Deployment
+
+Using the Qwen3-VL-Reranker-8B model as an example, first run the docker container with the following command:
+
+### Online Inference
+
+```bash
+vllm serve Qwen/Qwen3-VL-Reranker-8B \
+    --runner pooling \
+    --max-model-len 4096 \
+    --hf_overrides '{"architectures": ["Qwen3VLForSequenceClassification"],"classifier_from_token": ["no", "yes"],"is_original_qwen3_reranker": true}' \
+    --chat-template /path/to/vllm/examples/pooling/score/template/qwen3_vl_reranker.jinja
+```
+
+Once your server is started, you can send request with follow examples.
+
+```python
+import requests
+
+url = "http://127.0.0.1:8000/v1/rerank"
+
+# Please use the query_template and document_template to format the query and
+# document for better reranker results.
+
+prefix = '<|im_start|>system\nJudge whether the Document meets the requirements based on the Query and the Instruct provided. Note that the answer can only be "yes" or "no".<|im_end|>\n<|im_start|>user\n'
+suffix = "<|im_end|>\n<|im_start|>assistant\n"
+
+query_template = "{prefix}<Instruct>: {instruction}\n<Query>: {query}\n"
+document_template = "<Document>: {doc}{suffix}"
+
+instruction = (
+    "Given a search query, retrieve relevant candidates that answer the query."
+)
+
+query = "What is the capital of China?"
+
+documents = [
+    "The capital of China is Beijing.",
+    "Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun.",
+]
+
+documents = [
+    document_template.format(doc=doc, suffix=suffix) for doc in documents
+]
+
+response = requests.post(url,
+                         json={
+                             "query": query_template.format(prefix=prefix, instruction=instruction, query=query),
+                             "documents": documents,
+                         }).json()
+
+print(response)
+```
+
+If you run this script successfully, you will see a list of scores printed to the console, similar to this:
+
+```bash
+{'id': 'rerank-ac3495afa8e12404', 'model': 'Qwen/Qwen3-VL-Reranker-8B', 'usage': {'prompt_tokens': 315, 'total_tokens': 315}, 'results': [{'index': 0, 'document': {'text': '<Document>: The capital of China is Beijing.<|im_end|>\n<|im_start|>assistant\n', 'multi_modal': None}, 'relevance_score': 0.6368980407714844}, {'index': 1, 'document': {'text': '<Document>: Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun.<|im_end|>\n<|im_start|>assistant\n', 'multi_modal': None}, 'relevance_score': 0.20816077291965485}]}
+```
+
+### Offline Inference
+
+```python
+from vllm import LLM
+
+model_name = "Qwen/Qwen3-VL-Reranker-8B"
+
+# What is the difference between the official original version and one
+# that has been converted into a sequence classification model?
+# Qwen3-Reranker is a language model that doing reranker by using the
+# logits of "no" and "yes" tokens.
+# It needs to computing 151669 tokens logits, making this method extremely
+# inefficient, not to mention incompatible with the vllm score API.
+# A method for converting the original model into a sequence classification
+# model was proposed. See：https://huggingface.co/Qwen/Qwen3-Reranker-0.6B/discussions/3
+# Models converted offline using this method can not only be more efficient
+# and support the vllm score API, but also make the init parameters more
+# concise, for example.
+# model = LLM(model="Qwen/Qwen3-VL-Reranker-8B", runner="pooling")
+
+# If you want to load the official original version, the init parameters are
+# as follows.
+
+model = LLM(
+    model=model_name,
+    runner="pooling",
+    hf_overrides={
+        # Manually route to sequence classification architecture
+        # This tells vLLM to use Qwen3VLForSequenceClassification instead of
+        # the default Qwen3VLForConditionalGeneration
+        "architectures": ["Qwen3VLForSequenceClassification"],
+        # Specify which token logits to extract from the language model head
+        # The original reranker uses "no" and "yes" token logits for scoring
+        "classifier_from_token": ["no", "yes"],
+        # Enable special handling for original Qwen3-Reranker models
+        # This flag triggers conversion logic that transforms the two token
+        # vectors into a single classification vector
+        "is_original_qwen3_reranker": True,
+    },
+)
+
+# Why do we need hf_overrides for the official original version:
+# vllm converts it to Qwen3VLForSequenceClassification when loaded for
+# better performance.
+# - Firstly, we need using `"architectures": ["Qwen3VLForSequenceClassification"],`
+# to manually route to Qwen3VLForSequenceClassification.
+# - Then, we will extract the vector corresponding to classifier_from_token
+# from lm_head using `"classifier_from_token": ["no", "yes"]`.
+# - Third, we will convert these two vectors into one vector.  The use of
+# conversion logic is controlled by `using "is_original_qwen3_reranker": True`.
+
+# Please use the query_template and document_template to format the query and
+# document for better reranker results.
+
+prefix = '<|im_start|>system\nJudge whether the Document meets the requirements based on the Query and the Instruct provided. Note that the answer can only be "yes" or "no".<|im_end|>\n<|im_start|>user\n'
+suffix = "<|im_end|>\n<|im_start|>assistant\n"
+
+query_template = "{prefix}<Instruct>: {instruction}\n<Query>: {query}\n"
+document_template = "<Document>: {doc}{suffix}"
+
+if __name__ == "__main__":
+    instruction = (
+        "Given a search query, retrieve relevant candidates that answer the query."
+    )
+
+    query = "What is the capital of China?"
+
+    documents = [
+        "The capital of China is Beijing.",
+        "Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun.",
+    ]
+
+    documents = [document_template.format(doc=doc, suffix=suffix) for doc in documents]
+
+    outputs = model.score(query_template.format(prefix=prefix, instruction=instruction, query=query), documents)
+
+    print([output.outputs.score for output in outputs])
+```
+
+If you run this script successfully, you will see a list of scores printed to the console, similar to this:
+
+```bash
+Adding requests: 100%|████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 2409.83it/s]
+Processed prompts:   0%|                                            | 0/2 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s](EngineCore_DP0 pid=682882) INFO 01-20 04:38:46 [acl_graph.py:188] Replaying aclgraph
+Processed prompts: 100%|████████████████████████████████████| 2/2 [00:00<00:00,  9.44it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
+[0.7235596776008606, 0.0002742875076364726]
+```
+
+For more examples, refer to the vLLM official examples:
+- [Offline Vision Embedding Example](https://github.com/vllm-project/vllm/blob/main/examples/pooling/score/vision_reranker_offline.py)
+- [Online Vision Embedding Example](https://github.com/vllm-project/vllm/blob/main/examples/pooling/score/vision_reranker_online.py)
+
+## Performance
+
+Run performance of `Qwen3-VL-Reranker-8B` as an example.
+Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/) for more details.
+
+Take the `serve` as an example. Run the code as follows.
+
+```bash
+vllm bench serve --model Qwen/Qwen3-VL-Reranker-8B --backend vllm-rerank --dataset-name random-rerank --endpoint /v1/rerank --random-input 200  --save-result --result-dir ./
+```
+
+After about several minutes, you can get the performance evaluation result. With this tutorial, the performance result is:
+
+```bash
+============ Serving Benchmark Result ============
+Successful requests:                     1000
+Failed requests:                         0
+Benchmark duration (s):                  13.70
+Total input tokens:                      265122
+Request throughput (req/s):              72.99
+Total token throughput (tok/s):          19351.23
+----------------End-to-end Latency----------------
+Mean E2EL (ms):                          7474.64
+Median E2EL (ms):                        7528.72
+P99 E2EL (ms):                           13523.32
+==================================================
+```
\ No newline at end of file
diff --git a/docs/source/user_guide/support_matrix/supported_models.md b/docs/source/user_guide/support_matrix/supported_models.md
index 53c1cd9ffb5..3dbd9d66f1f 100644
--- a/docs/source/user_guide/support_matrix/supported_models.md
+++ b/docs/source/user_guide/support_matrix/supported_models.md
@@ -61,7 +61,9 @@ Get the latest info here: <https://github.com/vllm-project/vllm-ascend/issues/16
 | Model                         | Support   | Note                                                                 |    Supported Hardware    |  Doc |
 |-------------------------------|-----------|----------------------------------------------------------------------|--------------------------|------|
 | Qwen3-Embedding               | ✅        |                                                                      |         A2/A3            | [Qwen3_embedding](../../tutorials/Qwen3_embedding.md)|
+| Qwen3-VL-Embedding               | ✅        |                                                                      |         A2/A3            | [Qwen3-VL-Embedding](../../tutorials/Qwen3-VL-Embedding.md)|
 | Qwen3-Reranker                | ✅        |                                                                      |         A2/A3            | [Qwen3_reranker](../../tutorials/Qwen3_reranker.md)|
+| Qwen3-VL-Reranker                | ✅        |                                                                      |         A2/A3            | [Qwen3-VL-Reranker](../../tutorials/Qwen3-VL-Reranker.md)|
 | Molmo                         | ✅        | [1942](https://github.com/vllm-project/vllm-ascend/issues/1942)      |         A2/A3            |      |
 | XLM-RoBERTa-based             | ✅        |                                                                      |         A2/A3            |      |
 | Bert                          | ✅        |                                                                      |         A2/A3            |      |

From 829177abac12b32ba124ca78a0c75d91a9b95c3e Mon Sep 17 00:00:00 2001
From: gcanlin <canlinguosdu@gmail.com>
Date: Tue, 20 Jan 2026 05:22:15 +0000
Subject: [PATCH 2/5] fix

Signed-off-by: gcanlin <canlinguosdu@gmail.com>
---
 docs/source/tutorials/Qwen3-VL-Embedding.md | 4 ++--
 docs/source/tutorials/Qwen3-VL-Reranker.md  | 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/docs/source/tutorials/Qwen3-VL-Embedding.md b/docs/source/tutorials/Qwen3-VL-Embedding.md
index 027094155ed..de85acb04f0 100644
--- a/docs/source/tutorials/Qwen3-VL-Embedding.md
+++ b/docs/source/tutorials/Qwen3-VL-Embedding.md
@@ -60,7 +60,7 @@ import torch
 from vllm import LLM
 
 def get_detailed_instruct(task_description: str, query: str) -> str:
-    return f'Instruct: {task_description}\nQuery:{query}'
+    return f'Instruct: {task_description}\nQuery: {query}'
 
 
 if __name__=="__main__":
@@ -105,7 +105,7 @@ For more examples, refer to the vLLM official examples:
 ## Performance
 
 Run performance of `Qwen3-VL-Embedding-8B` as an example.
-Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/) for more details.
+Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/benchmarking/cli/) for more details.
 
 Take the `serve` as an example. Run the code as follows.
 
diff --git a/docs/source/tutorials/Qwen3-VL-Reranker.md b/docs/source/tutorials/Qwen3-VL-Reranker.md
index aaa67f1cc55..8ed1040e159 100644
--- a/docs/source/tutorials/Qwen3-VL-Reranker.md
+++ b/docs/source/tutorials/Qwen3-VL-Reranker.md
@@ -97,7 +97,7 @@ model_name = "Qwen/Qwen3-VL-Reranker-8B"
 # It needs to computing 151669 tokens logits, making this method extremely
 # inefficient, not to mention incompatible with the vllm score API.
 # A method for converting the original model into a sequence classification
-# model was proposed. See：https://huggingface.co/Qwen/Qwen3-Reranker-0.6B/discussions/3
+# model was proposed. See: https://huggingface.co/Qwen/Qwen3-Reranker-0.6B/discussions/3
 # Models converted offline using this method can not only be more efficient
 # and support the vllm score API, but also make the init parameters more
 # concise, for example.
@@ -178,7 +178,7 @@ For more examples, refer to the vLLM official examples:
 ## Performance
 
 Run performance of `Qwen3-VL-Reranker-8B` as an example.
-Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/) for more details.
+Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/benchmarking/cli/) for more details.
 
 Take the `serve` as an example. Run the code as follows.
 

From 632011a309569074149b27208524d1ca542a925b Mon Sep 17 00:00:00 2001
From: gcanlin <canlinguosdu@gmail.com>
Date: Tue, 20 Jan 2026 05:26:17 +0000
Subject: [PATCH 3/5] add template for reranker

Signed-off-by: gcanlin <canlinguosdu@gmail.com>
---
 docs/source/tutorials/Qwen3-VL-Reranker.md | 38 ++++++++++++++++++++--
 docs/source/tutorials/index.md             |  2 ++
 2 files changed, 38 insertions(+), 2 deletions(-)

diff --git a/docs/source/tutorials/Qwen3-VL-Reranker.md b/docs/source/tutorials/Qwen3-VL-Reranker.md
index 8ed1040e159..fcc8f82f9a5 100644
--- a/docs/source/tutorials/Qwen3-VL-Reranker.md
+++ b/docs/source/tutorials/Qwen3-VL-Reranker.md
@@ -25,16 +25,50 @@ if you don't want to use the docker image as above, you can also build all from
 
 ## Deployment
 
-Using the Qwen3-VL-Reranker-8B model as an example, first run the docker container with the following command:
+Using the Qwen3-VL-Reranker-8B model as an example:
+
+### Chat Template
+
+The Qwen3-VL-Reranker model requires a specific chat template for proper formatting. Create a file named `qwen3_vl_reranker.jinja` with the following content:
+
+```jinja
+<|im_start|>system
+Judge whether the Document meets the requirements based on the Query and the Instruct provided. Note that the answer can only be "yes" or "no".<|im_end|>
+<|im_start|>user
+<Instruct>: {{
+    messages
+    | selectattr("role", "eq", "system")
+    | map(attribute="content")
+    | first
+    | default("Given a search query, retrieve relevant candidates that answer the query.")
+}}<Query>:{{
+    messages
+    | selectattr("role", "eq", "query")
+    | map(attribute="content")
+    | first
+}}
+<Document>:{{
+    messages
+    | selectattr("role", "eq", "document")
+    | map(attribute="content")
+    | first
+}}<|im_end|>
+<|im_start|>assistant
+
+```
+
+Save this file to a location of your choice (e.g., `./qwen3_vl_reranker.jinja`).
 
 ### Online Inference
 
+Start the server with the following command:
+
 ```bash
 vllm serve Qwen/Qwen3-VL-Reranker-8B \
     --runner pooling \
     --max-model-len 4096 \
     --hf_overrides '{"architectures": ["Qwen3VLForSequenceClassification"],"classifier_from_token": ["no", "yes"],"is_original_qwen3_reranker": true}' \
-    --chat-template /path/to/vllm/examples/pooling/score/template/qwen3_vl_reranker.jinja
+    --chat-template ./qwen3_vl_reranker.jinja
 ```
 
 Once your server is started, you can send request with follow examples.
diff --git a/docs/source/tutorials/index.md b/docs/source/tutorials/index.md
index dc78743dd02..10cf4604bf3 100644
--- a/docs/source/tutorials/index.md
+++ b/docs/source/tutorials/index.md
@@ -13,7 +13,9 @@ Qwen3-VL-30B-A3B-Instruct.md
 Qwen3-VL-235B-A22B-Instruct.md
 Qwen3-Coder-30B-A3B.md
 Qwen3_embedding.md
+Qwen3-VL-Embedding.md
 Qwen3_reranker.md
+Qwen3-VL-Reranker.md
 Qwen3-8B-W4A8.md
 Qwen3-32B-W4A4.md
 Qwen3-Next.md

From 49ee97b1ed0d4abb7891fa7bd48531bd76f25874 Mon Sep 17 00:00:00 2001
From: gcanlin <canlinguosdu@gmail.com>
Date: Tue, 20 Jan 2026 06:42:27 +0000
Subject: [PATCH 4/5] fix lint

Signed-off-by: gcanlin <canlinguosdu@gmail.com>
---
 docs/source/tutorials/Qwen3-VL-Embedding.md | 4 +++-
 docs/source/tutorials/Qwen3-VL-Reranker.md  | 9 +++++++--
 2 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/docs/source/tutorials/Qwen3-VL-Embedding.md b/docs/source/tutorials/Qwen3-VL-Embedding.md
index de85acb04f0..d47c36ec0a0 100644
--- a/docs/source/tutorials/Qwen3-VL-Embedding.md
+++ b/docs/source/tutorials/Qwen3-VL-Embedding.md
@@ -5,6 +5,7 @@
 The Qwen3-VL-Embedding model series are the latest additions to the Qwen family, built upon the recently open-sourced and powerful Qwen3-VL foundation model. Specifically designed for multimodal information retrieval and cross-modal understanding, this suite accepts diverse inputs including text, images, screenshots, and videos, as well as inputs containing a mixture of these modalities. This guide describes how to run the model with vLLM Ascend.
 
 The model supports three types of embeddings:
+
 - **Text-only**: Generate embeddings from text input alone
 - **Image-only**: Generate embeddings from image input alone
 - **Image+Text**: Generate combined embeddings from both image and text inputs
@@ -99,6 +100,7 @@ Processed prompts: 100%|██████████████████
 ```
 
 For more examples, refer to the vLLM official examples:
+
 - [Offline Vision Embedding Example](https://github.com/vllm-project/vllm/blob/main/examples/pooling/embed/vision_embedding_offline.py)
 - [Online Vision Embedding Example](https://github.com/vllm-project/vllm/blob/main/examples/pooling/embed/vision_embedding_online.py)
 
@@ -128,4 +130,4 @@ Mean E2EL (ms):                          10360.53
 Median E2EL (ms):                        10354.37
 P99 E2EL (ms):                           19423.21
 ==================================================
-```
\ No newline at end of file
+```
diff --git a/docs/source/tutorials/Qwen3-VL-Reranker.md b/docs/source/tutorials/Qwen3-VL-Reranker.md
index fcc8f82f9a5..740e1a1ca6e 100644
--- a/docs/source/tutorials/Qwen3-VL-Reranker.md
+++ b/docs/source/tutorials/Qwen3-VL-Reranker.md
@@ -1,6 +1,7 @@
 # Qwen3-VL-Reranker
 
 ## Introduction
+
 The Qwen3-VL-Embedding and Qwen3-VL-Reranker model series are the latest additions to the Qwen family, built upon the recently open-sourced and powerful Qwen3-VL foundation model. Specifically designed for multimodal information retrieval and cross-modal understanding, this suite accepts diverse inputs including text, images, screenshots, and videos, as well as inputs containing a mixture of these modalities. This guide describes how to run the model with vLLM Ascend.
 
 ## Supported Features
@@ -17,10 +18,13 @@ Refer to [supported features](../user_guide/support_matrix/supported_models.md)
 It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`
 
 ### Installation
+
 You can use our official docker image to run `Qwen3-VL-Reranker` series models.
+
 - Start the docker image on your node, refer to [using docker](../installation.md#set-up-using-docker).
 
-if you don't want to use the docker image as above, you can also build all from source:
+If you don't want to use the docker image as above, you can also build all from source:
+
 - Install `vllm-ascend` from source, refer to [installation](../installation.md).
 
 ## Deployment
@@ -206,6 +210,7 @@ Processed prompts: 100%|██████████████████
 ```
 
 For more examples, refer to the vLLM official examples:
+
 - [Offline Vision Embedding Example](https://github.com/vllm-project/vllm/blob/main/examples/pooling/score/vision_reranker_offline.py)
 - [Online Vision Embedding Example](https://github.com/vllm-project/vllm/blob/main/examples/pooling/score/vision_reranker_online.py)
 
@@ -235,4 +240,4 @@ Mean E2EL (ms):                          7474.64
 Median E2EL (ms):                        7528.72
 P99 E2EL (ms):                           13523.32
 ==================================================
-```
\ No newline at end of file
+```

From b8f99d4b95f48260f1b9a472f2109e2542c31a7e Mon Sep 17 00:00:00 2001
From: gcanlin <canlinguosdu@gmail.com>
Date: Tue, 20 Jan 2026 06:43:44 +0000
Subject: [PATCH 5/5] update

Signed-off-by: gcanlin <canlinguosdu@gmail.com>
---
 docs/source/tutorials/Qwen3-VL-Embedding.md | 8 +-------
 1 file changed, 1 insertion(+), 7 deletions(-)

diff --git a/docs/source/tutorials/Qwen3-VL-Embedding.md b/docs/source/tutorials/Qwen3-VL-Embedding.md
index d47c36ec0a0..d39aed9ce5e 100644
--- a/docs/source/tutorials/Qwen3-VL-Embedding.md
+++ b/docs/source/tutorials/Qwen3-VL-Embedding.md
@@ -2,13 +2,7 @@
 
 ## Introduction
 
-The Qwen3-VL-Embedding model series are the latest additions to the Qwen family, built upon the recently open-sourced and powerful Qwen3-VL foundation model. Specifically designed for multimodal information retrieval and cross-modal understanding, this suite accepts diverse inputs including text, images, screenshots, and videos, as well as inputs containing a mixture of these modalities. This guide describes how to run the model with vLLM Ascend.
-
-The model supports three types of embeddings:
-
-- **Text-only**: Generate embeddings from text input alone
-- **Image-only**: Generate embeddings from image input alone
-- **Image+Text**: Generate combined embeddings from both image and text inputs
+The Qwen3-VL-Embedding and Qwen3-VL-Reranker model series are the latest additions to the Qwen family, built upon the recently open-sourced and powerful Qwen3-VL foundation model. Specifically designed for multimodal information retrieval and cross-modal understanding, this suite accepts diverse inputs including text, images, screenshots, and videos, as well as inputs containing a mixture of these modalities. This guide describes how to run the model with vLLM Ascend.
 
 ## Supported Features