vllm-project · noooop · Mar 24, 2026 · Mar 24, 2026 · Mar 24, 2026 · Mar 24, 2026
@@ -1,7 +1,8 @@
 # Pooling Models
 
 !!! note
-    We currently support pooling models primarily for convenience. This is not guaranteed to provide any performance improvements over using Hugging Face Transformers or Sentence Transformers directly.
+    We currently support pooling models primarily for convenience. This is not guaranteed to provide any performance
+improvements over using Hugging Face Transformers or Sentence Transformers directly.
 
     We plan to optimize pooling models in vLLM. Please comment on <https://github.com/vllm-project/vllm/issues/21796> if you have any suggestions!
 
@@ -12,22 +13,38 @@ Natural Language Processing (NLP) can be primarily divided into the following tw
 - Natural Language Understanding (NLU)
 - Natural Language Generation (NLG)
 
-The generative models supported by vLLM cover a variety of task types, such as the large language models (LLMs) we are familiar with, multimodal models (VLM) that handle multimodal inputs like images, videos, and audio, speech-to-text transcription models, and real-time models that support streaming input. Their common feature is the ability to generate text. Taking it a step further, vLLM-Omni supports the generation of multimodal content, including images, videos, and audio.
+The generative models supported by vLLM cover a variety of task types, such as the large language models (LLMs) we are
+familiar with, multimodal models (VLM) that handle multimodal inputs like images, videos, and audio, speech-to-text
+transcription models, and real-time models that support streaming input. Their common feature is the ability to generate
+text. Taking it a step further, vLLM-Omni supports the generation of multimodal content, including images, videos, and audio.
 
-As the capabilities of generative models continue to improve, the boundaries of these models are also constantly expanding. However, certain application scenarios still require specialized small language models to efficiently complete specific tasks. These models typically have the following characteristics:
+As the capabilities of generative models continue to improve, the boundaries of these models are also constantly expanding.
+However, certain application scenarios still require specialized small language models to efficiently complete specific tasks.
+These models typically have the following characteristics:
 
 - They do not require content generation.
 - They only need to perform very limited functions, without requiring strong generalization, creativity, or high intelligence.
 - They demand extremely low latency and may operate on cost-constrained hardware.
 - Text-only models typically have fewer than 1 billion parameters, while multimodal models generally have fewer than 10 billion parameters.
 
-Although these models are relatively small in scale, they are still based on the Transformer architecture, similar or even identical to the most advanced large language models today. Many recently released pooling models are also fine-tuned from large language models, allowing them to benefit from the continuous improvements in large models. This architecture similarity enables them to reuse much of vLLM’s infrastructure. If compatible, we would be happy to help them leverage the latest features of vLLM as well.
+Although these models are relatively small in scale, they are still based on the Transformer architecture, similar or
+even identical to the most advanced large language models today. Many recently released pooling models are also fine-tuned
+from large language models, allowing them to benefit from the continuous improvements in large models. This architecture
+similarity enables them to reuse much of vLLM’s infrastructure. If compatible, we would be happy to help them leverage
+the latest features of vLLM as well.
 
 ### Sequence-wise Task and Token-wise Task
 
-The key distinction between sequence-wise task and token-wise task lies in their output granularity: sequence-wise task produces a single result for an entire input sequence, whereas token-wise task yields a result for each individual token within the sequence.
+The key distinction between sequence-wise task and token-wise task lies in their output granularity: sequence-wise task
+produces a single result for an entire input sequence, whereas token-wise task yields a result for each individual token
+within the sequence.
 
-Of course, we also have "plugin" tasks that allow users to customize input and output processors. For more information, please refer to [IO Processor Plugins](../../design/io_processor_plugins.md).
+Many Pooling models support both (sequence) task and token task. When the default pooling task (e.g. a sequence-wise task)
+is not what you want, you need to manually specify (e.g. a token-wise task) via `PoolerConfig(task=<task>)` offline or
+`--pooler-config.task <task>` online.
+
+Of course, we also have "plugin" tasks that allow users to customize input and output processors. For more information,
+please refer to [IO Processor Plugins](../../design/io_processor_plugins.md).
 
 ### Pooling Tasks
 
@@ -39,11 +56,13 @@ Of course, we also have "plugin" tasks that allow users to customize input and o
 | `token_embed`         | Token-wise    | vector representations for each token           |
 
 !!! note
-    Within classification tasks, there is a specialized subcategory: Cross-encoder (aka reranker) models. These models are a subset of classification models that accept two prompts as input and output num_labels equal to 1.
+    Within classification tasks, there is a specialized subcategory: Cross-encoder (aka reranker) models. These models
+are a subset of classification models that accept two prompts as input and output num_labels equal to 1.
 
 ### Score Types
 
-The scoring models is designed to compute similarity scores between two input prompts. It supports three model types (aka `score_type`): `cross-encoder`, `late-interaction`, and `bi-encoder`.
+The scoring models is designed to compute similarity scores between two input prompts. It supports three model types
+(aka `score_type`): `cross-encoder`, `late-interaction`, and `bi-encoder`.
 
 | Pooling Tasks         | Granularity   | Outputs                                      | Score Types        | scoring function         |
 |-----------------------|---------------|----------------------------------------------|--------------------|--------------------------|
@@ -250,11 +269,17 @@ We have split the `encode` task into two more specific token-wise tasks: `token_
 - `token_embed` is the same as `embed`, using normalization as the activation.
 - `token_classify` is the same as `classify`, by default using softmax as the activation.
 
-Pooling models now default support all pooling, you can use it without any settings.
+Pooling models now support token-wise task.
 
 - Extracting hidden states prefers using `token_embed` task.
 - Named Entity Recognition (NER) and reward models prefers using `token_classify` task.
 
 ### Score task
 
-`score` task is deprecated and will be removed in v0.20. Please use `classify` instead. Only when a classification model outputs num_labels equal to 1 can it be used as a scoring model and have its scoring API enabled.
+`score` task is deprecated and will be removed in v0.20. Please use `classify` instead. Only when a
+classification model outputs num_labels equal to 1 can it be used as a scoring model and have its scoring API enabled.
+
+### Pooling multitask support
+
+Pooling multitask support is deprecated and will be removed in v0.20. When the default pooling task is not what you want,
+you need to manually specify it via `PoolerConfig(task=<task>)` offline or `--pooler-config.task <task>` online.
@@ -13,6 +13,12 @@ The key distinction between (sequence) classification and token classification l
 
 Many classification models support both (sequence) classification and token classification. For further details on (sequence) classification, please refer to [this page](classify.md).
 
+!!! note
+
+    Pooling multitask support is deprecated and will be removed in v0.20. When the default pooling task (classify) is not 
+    what you want, you need to manually specify it via `PoolerConfig(task="token_classify")` offline or
+    `--pooler-config.task token_classify` online.
+
 ## Typical Use Cases
 
 ### Named Entity Recognition (NER)

@@ -13,6 +13,12 @@ The difference between the (sequence) embedding task and the token embedding tas
 
 Many embedding models support both (sequence) embedding and token embedding. For further details on (sequence) embedding, please refer to [this page](embed.md).
 
+!!! note
+
+    Pooling multitask support is deprecated and will be removed in v0.20. When the default pooling task (embed) is not 
+    what you want, you need to manually specify it via via `PoolerConfig(task="token_embed")` offline or
+    `--pooler-config.task token_embed` online.
+
 ## Typical Use Cases
 
 ### Multi-Vector Retrieval

@@ -1,6 +1,6 @@
 # SPDX-License-Identifier: Apache-2.0
 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project
-
+import logging
 import weakref
 
 import pytest
@@ -67,8 +67,11 @@ def test_list_prompts(llm: LLM):
 
 
 @pytest.mark.skip_global_cleanup
-def test_token_classify(llm: LLM):
-    outputs = llm.encode(prompt, pooling_task="token_classify", use_tqdm=False)
+def test_token_classify(llm: LLM, caplog_vllm):
+    with caplog_vllm.at_level(level=logging.WARNING, logger="vllm"):
+        outputs = llm.encode(prompt, pooling_task="token_classify", use_tqdm=False)
+        assert "deprecated" in caplog_vllm.text
+
     assert len(outputs) == 1
     assert isinstance(outputs[0], PoolingRequestOutput)
     assert outputs[0].prompt_token_ids == prompt_token_ids
@@ -107,8 +110,8 @@ def test_score_api(llm: LLM):
         llm.score("ping", "pong", use_tqdm=False)
 
 
-@pytest.mark.parametrize("task", ["embed", "token_embed", "plugin"])
+@pytest.mark.parametrize("task", ["embed", "token_embed"])
 def test_unsupported_tasks(llm: LLM, task: PoolingTask):
-    err_msg = f"Unsupported task: '{task}' Supported tasks.+"
+    err_msg = "Embedding API is not supported by this model.+"
     with pytest.raises(ValueError, match=err_msg):
         llm.encode(prompt, pooling_task=task, use_tqdm=False)
@@ -1,19 +1,22 @@
 # SPDX-License-Identifier: Apache-2.0
 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project
-
+import logging
 import weakref
 
 import pytest
 import torch
 import torch.nn.functional as F
 
-from vllm import LLM, PoolingParams
+from vllm import LLM, EmbeddingRequestOutput, PoolingParams
 from vllm.distributed import cleanup_dist_env_and_memory
 from vllm.platforms import current_platform
+from vllm.tasks import PoolingTask
 
 MODEL_NAME = "intfloat/multilingual-e5-small"
 
-prompts = ["The chef prepared a delicious meal."]
+prompt = "The chef prepared a delicious meal."
+prompt_token_ids = [0, 581, 21861, 133888, 10, 8, 150, 60744, 109911, 5, 2]
+embedding_size = 384
 
 
 @pytest.fixture(scope="module")
@@ -44,16 +47,48 @@ def llm():
 
 
 @pytest.mark.skip_global_cleanup
-def test_token_embed(llm: LLM):
-    outputs = llm.encode(prompts, pooling_task="token_embed", use_tqdm=False)
+def test_str_prompts(llm: LLM):
+    outputs = llm.embed(prompt, use_tqdm=False)
+    assert len(outputs) == 1
+    assert isinstance(outputs[0], EmbeddingRequestOutput)
+    assert outputs[0].prompt_token_ids == prompt_token_ids
+    assert len(outputs[0].outputs.embedding) == embedding_size
+
+
+@pytest.mark.skip_global_cleanup
+def test_token_ids_prompts(llm: LLM):
+    outputs = llm.embed([prompt_token_ids], use_tqdm=False)
+    assert len(outputs) == 1
+    assert isinstance(outputs[0], EmbeddingRequestOutput)
+    assert outputs[0].prompt_token_ids == prompt_token_ids
+    assert len(outputs[0].outputs.embedding) == embedding_size
+
+
+@pytest.mark.skip_global_cleanup
+def test_list_prompts(llm: LLM):
+    outputs = llm.embed([prompt, prompt_token_ids], use_tqdm=False)
+    assert len(outputs) == 2
+    for i in range(len(outputs)):
+        assert isinstance(outputs[i], EmbeddingRequestOutput)
+        assert outputs[i].prompt_token_ids == prompt_token_ids
+        assert len(outputs[i].outputs.embedding) == embedding_size
+
+
+@pytest.mark.skip_global_cleanup
+def test_token_embed(llm: LLM, caplog_vllm):
+    with caplog_vllm.at_level(level=logging.WARNING, logger="vllm"):
+        outputs = llm.encode(prompt, pooling_task="token_embed", use_tqdm=False)
+        assert "deprecated" in caplog_vllm.text
+
     multi_vector = outputs[0].outputs.data
     assert multi_vector.shape == (11, 384)
 
 
+@pytest.mark.skip_global_cleanup
 def test_pooling_params(llm: LLM):
     def get_outputs(normalize):
         outputs = llm.embed(
-            prompts,
+            [prompt],
             pooling_params=PoolingParams(use_activation=normalize),
             use_tqdm=False,
         )
@@ -70,3 +105,10 @@ def get_outputs(normalize):
     assert torch.allclose(w_normal, F.normalize(wo_normal, p=2, dim=-1), atol=1e-2), (
         "w_normal should be close to normal(wo_normal)."
     )
+
+
+@pytest.mark.parametrize("task", ["token_classify", "classify"])
+def test_unsupported_tasks(llm: LLM, task: PoolingTask):
+    err_msg = "Classification API is not supported by this model.+"
+    with pytest.raises(ValueError, match=err_msg):
+        llm.encode(prompt, pooling_task=task, use_tqdm=False)
@@ -206,7 +206,12 @@ async def test_pooling_classify(server: RemoteOpenAIServer, model_name: str):
 async def test_pooling_token_classify(server: RemoteOpenAIServer, model_name: str):
     response = requests.post(
         server.url_for("pooling"),
-        json={"model": model_name, "input": input_text, "encoding_format": "float"},
+        json={
+            "model": model_name,
+            "task": "token_classify",
+            "input": input_text,
+            "encoding_format": "float",
+        },
     )
 
     poolings = PoolingResponse.model_validate(response.json())

@@ -0,0 +1,78 @@
+# SPDX-License-Identifier: Apache-2.0
+# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
+import logging
+import weakref
+
+import pytest
+
+from vllm import LLM, PoolingRequestOutput
+from vllm.config import PoolerConfig
+from vllm.distributed import cleanup_dist_env_and_memory
+from vllm.tasks import PoolingTask
+
+MODEL_NAME = "jason9693/Qwen2.5-1.5B-apeach"
+
+prompt = "The chef prepared a delicious meal."
+prompt_token_ids = [785, 29706, 10030, 264, 17923, 15145, 13]
+num_labels = 2
+
+
+@pytest.fixture(scope="module")
+def llm():
+    # pytest caches the fixture so we use weakref.proxy to
+    # enable garbage collection
+    llm = LLM(
+        model=MODEL_NAME,
+        pooler_config=PoolerConfig(task="token_classify"),
+        max_num_batched_tokens=32768,
+        tensor_parallel_size=1,
+        gpu_memory_utilization=0.75,
+        enforce_eager=True,
+        seed=0,
+    )
+
+    yield weakref.proxy(llm)
+
+    del llm
+
+    cleanup_dist_env_and_memory()
+
+
+@pytest.mark.skip_global_cleanup
+def test_str_prompts(llm: LLM):
+    outputs = llm.encode(prompt, pooling_task="token_classify", use_tqdm=False)
+    assert len(outputs) == 1
+    assert isinstance(outputs[0], PoolingRequestOutput)
+    assert outputs[0].prompt_token_ids == prompt_token_ids
+    assert outputs[0].outputs.data.shape == (len(prompt_token_ids), num_labels)
+
+
+@pytest.mark.skip_global_cleanup
+def test_token_ids_prompts(llm: LLM):
+    outputs = llm.encode(
+        [prompt_token_ids], pooling_task="token_classify", use_tqdm=False
+    )
+    assert len(outputs) == 1
+    assert isinstance(outputs[0], PoolingRequestOutput)
+    assert outputs[0].prompt_token_ids == prompt_token_ids
+    assert outputs[0].outputs.data.shape == (len(prompt_token_ids), num_labels)
+
+
+@pytest.mark.skip_global_cleanup
+def test_score_api(llm: LLM):
+    err_msg = "Score API is only enabled for num_labels == 1."
+    with pytest.raises(ValueError, match=err_msg):
+        llm.score("ping", "pong", use_tqdm=False)
+
+
+@pytest.mark.parametrize("task", ["classify", "embed", "token_embed"])
+def test_unsupported_tasks(llm: LLM, task: PoolingTask, caplog_vllm):
+    if task == "classify":
+        with caplog_vllm.at_level(level=logging.WARNING, logger="vllm"):
+            llm.encode(prompt, pooling_task=task, use_tqdm=False)
+        assert "deprecated" in caplog_vllm.text
+    else:
+        err_msg = "Embedding API is not supported by this model.+"
+
+        with pytest.raises(ValueError, match=err_msg):
+            llm.encode(prompt, pooling_task=task, use_tqdm=False)