Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
45 changes: 35 additions & 10 deletions docs/models/pooling_models/README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
# Pooling Models

!!! note
We currently support pooling models primarily for convenience. This is not guaranteed to provide any performance improvements over using Hugging Face Transformers or Sentence Transformers directly.
We currently support pooling models primarily for convenience. This is not guaranteed to provide any performance
improvements over using Hugging Face Transformers or Sentence Transformers directly.

We plan to optimize pooling models in vLLM. Please comment on <https://github.com/vllm-project/vllm/issues/21796> if you have any suggestions!

Expand All @@ -12,22 +13,38 @@ Natural Language Processing (NLP) can be primarily divided into the following tw
- Natural Language Understanding (NLU)
- Natural Language Generation (NLG)

The generative models supported by vLLM cover a variety of task types, such as the large language models (LLMs) we are familiar with, multimodal models (VLM) that handle multimodal inputs like images, videos, and audio, speech-to-text transcription models, and real-time models that support streaming input. Their common feature is the ability to generate text. Taking it a step further, vLLM-Omni supports the generation of multimodal content, including images, videos, and audio.
The generative models supported by vLLM cover a variety of task types, such as the large language models (LLMs) we are
familiar with, multimodal models (VLM) that handle multimodal inputs like images, videos, and audio, speech-to-text
transcription models, and real-time models that support streaming input. Their common feature is the ability to generate
text. Taking it a step further, vLLM-Omni supports the generation of multimodal content, including images, videos, and audio.

As the capabilities of generative models continue to improve, the boundaries of these models are also constantly expanding. However, certain application scenarios still require specialized small language models to efficiently complete specific tasks. These models typically have the following characteristics:
As the capabilities of generative models continue to improve, the boundaries of these models are also constantly expanding.
However, certain application scenarios still require specialized small language models to efficiently complete specific tasks.
These models typically have the following characteristics:

- They do not require content generation.
- They only need to perform very limited functions, without requiring strong generalization, creativity, or high intelligence.
- They demand extremely low latency and may operate on cost-constrained hardware.
- Text-only models typically have fewer than 1 billion parameters, while multimodal models generally have fewer than 10 billion parameters.

Although these models are relatively small in scale, they are still based on the Transformer architecture, similar or even identical to the most advanced large language models today. Many recently released pooling models are also fine-tuned from large language models, allowing them to benefit from the continuous improvements in large models. This architecture similarity enables them to reuse much of vLLM’s infrastructure. If compatible, we would be happy to help them leverage the latest features of vLLM as well.
Although these models are relatively small in scale, they are still based on the Transformer architecture, similar or
even identical to the most advanced large language models today. Many recently released pooling models are also fine-tuned
from large language models, allowing them to benefit from the continuous improvements in large models. This architecture
similarity enables them to reuse much of vLLM’s infrastructure. If compatible, we would be happy to help them leverage
the latest features of vLLM as well.

### Sequence-wise Task and Token-wise Task

The key distinction between sequence-wise task and token-wise task lies in their output granularity: sequence-wise task produces a single result for an entire input sequence, whereas token-wise task yields a result for each individual token within the sequence.
The key distinction between sequence-wise task and token-wise task lies in their output granularity: sequence-wise task
produces a single result for an entire input sequence, whereas token-wise task yields a result for each individual token
within the sequence.

Of course, we also have "plugin" tasks that allow users to customize input and output processors. For more information, please refer to [IO Processor Plugins](../../design/io_processor_plugins.md).
Many Pooling models support both (sequence) task and token task. When the default pooling task (e.g. a sequence-wise task)
is not what you want, you need to manually specify (e.g. a token-wise task) via `PoolerConfig(task=<task>)` offline or
`--pooler-config.task <task>` online.

Of course, we also have "plugin" tasks that allow users to customize input and output processors. For more information,
please refer to [IO Processor Plugins](../../design/io_processor_plugins.md).

### Pooling Tasks

Expand All @@ -39,11 +56,13 @@ Of course, we also have "plugin" tasks that allow users to customize input and o
| `token_embed` | Token-wise | vector representations for each token |

!!! note
Within classification tasks, there is a specialized subcategory: Cross-encoder (aka reranker) models. These models are a subset of classification models that accept two prompts as input and output num_labels equal to 1.
Within classification tasks, there is a specialized subcategory: Cross-encoder (aka reranker) models. These models
are a subset of classification models that accept two prompts as input and output num_labels equal to 1.

### Score Types

The scoring models is designed to compute similarity scores between two input prompts. It supports three model types (aka `score_type`): `cross-encoder`, `late-interaction`, and `bi-encoder`.
The scoring models is designed to compute similarity scores between two input prompts. It supports three model types
(aka `score_type`): `cross-encoder`, `late-interaction`, and `bi-encoder`.

| Pooling Tasks | Granularity | Outputs | Score Types | scoring function |
|-----------------------|---------------|----------------------------------------------|--------------------|--------------------------|
Expand Down Expand Up @@ -250,11 +269,17 @@ We have split the `encode` task into two more specific token-wise tasks: `token_
- `token_embed` is the same as `embed`, using normalization as the activation.
- `token_classify` is the same as `classify`, by default using softmax as the activation.

Pooling models now default support all pooling, you can use it without any settings.
Pooling models now support token-wise task.

- Extracting hidden states prefers using `token_embed` task.
- Named Entity Recognition (NER) and reward models prefers using `token_classify` task.

### Score task

`score` task is deprecated and will be removed in v0.20. Please use `classify` instead. Only when a classification model outputs num_labels equal to 1 can it be used as a scoring model and have its scoring API enabled.
`score` task is deprecated and will be removed in v0.20. Please use `classify` instead. Only when a
classification model outputs num_labels equal to 1 can it be used as a scoring model and have its scoring API enabled.

### Pooling multitask support

Pooling multitask support is deprecated and will be removed in v0.20. When the default pooling task is not what you want,
you need to manually specify it via `PoolerConfig(task=<task>)` offline or `--pooler-config.task <task>` online.
6 changes: 6 additions & 0 deletions docs/models/pooling_models/token_classify.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,12 @@ The key distinction between (sequence) classification and token classification l

Many classification models support both (sequence) classification and token classification. For further details on (sequence) classification, please refer to [this page](classify.md).

!!! note

Pooling multitask support is deprecated and will be removed in v0.20. When the default pooling task (classify) is not
what you want, you need to manually specify it via `PoolerConfig(task="token_classify")` offline or
`--pooler-config.task token_classify` online.

## Typical Use Cases

### Named Entity Recognition (NER)
Expand Down
6 changes: 6 additions & 0 deletions docs/models/pooling_models/token_embed.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,12 @@ The difference between the (sequence) embedding task and the token embedding tas

Many embedding models support both (sequence) embedding and token embedding. For further details on (sequence) embedding, please refer to [this page](embed.md).

!!! note

Pooling multitask support is deprecated and will be removed in v0.20. When the default pooling task (embed) is not
what you want, you need to manually specify it via via `PoolerConfig(task="token_embed")` offline or
`--pooler-config.task token_embed` online.

## Typical Use Cases

### Multi-Vector Retrieval
Expand Down
13 changes: 8 additions & 5 deletions tests/entrypoints/pooling/classify/test_offline.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project

import logging
import weakref

import pytest
Expand Down Expand Up @@ -67,8 +67,11 @@ def test_list_prompts(llm: LLM):


@pytest.mark.skip_global_cleanup
def test_token_classify(llm: LLM):
outputs = llm.encode(prompt, pooling_task="token_classify", use_tqdm=False)
def test_token_classify(llm: LLM, caplog_vllm):
with caplog_vllm.at_level(level=logging.WARNING, logger="vllm"):
outputs = llm.encode(prompt, pooling_task="token_classify", use_tqdm=False)
assert "deprecated" in caplog_vllm.text

assert len(outputs) == 1
assert isinstance(outputs[0], PoolingRequestOutput)
assert outputs[0].prompt_token_ids == prompt_token_ids
Expand Down Expand Up @@ -107,8 +110,8 @@ def test_score_api(llm: LLM):
llm.score("ping", "pong", use_tqdm=False)


@pytest.mark.parametrize("task", ["embed", "token_embed", "plugin"])
@pytest.mark.parametrize("task", ["embed", "token_embed"])
def test_unsupported_tasks(llm: LLM, task: PoolingTask):
err_msg = f"Unsupported task: '{task}' Supported tasks.+"
err_msg = "Embedding API is not supported by this model.+"
with pytest.raises(ValueError, match=err_msg):
llm.encode(prompt, pooling_task=task, use_tqdm=False)
54 changes: 48 additions & 6 deletions tests/entrypoints/pooling/embed/test_offline.py
Original file line number Diff line number Diff line change
@@ -1,19 +1,22 @@
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project

import logging
import weakref

import pytest
import torch
import torch.nn.functional as F

from vllm import LLM, PoolingParams
from vllm import LLM, EmbeddingRequestOutput, PoolingParams
from vllm.distributed import cleanup_dist_env_and_memory
from vllm.platforms import current_platform
from vllm.tasks import PoolingTask

MODEL_NAME = "intfloat/multilingual-e5-small"

prompts = ["The chef prepared a delicious meal."]
prompt = "The chef prepared a delicious meal."
prompt_token_ids = [0, 581, 21861, 133888, 10, 8, 150, 60744, 109911, 5, 2]
embedding_size = 384


@pytest.fixture(scope="module")
Expand Down Expand Up @@ -44,16 +47,48 @@ def llm():


@pytest.mark.skip_global_cleanup
def test_token_embed(llm: LLM):
outputs = llm.encode(prompts, pooling_task="token_embed", use_tqdm=False)
def test_str_prompts(llm: LLM):
outputs = llm.embed(prompt, use_tqdm=False)
assert len(outputs) == 1
assert isinstance(outputs[0], EmbeddingRequestOutput)
assert outputs[0].prompt_token_ids == prompt_token_ids
assert len(outputs[0].outputs.embedding) == embedding_size


@pytest.mark.skip_global_cleanup
def test_token_ids_prompts(llm: LLM):
outputs = llm.embed([prompt_token_ids], use_tqdm=False)
assert len(outputs) == 1
assert isinstance(outputs[0], EmbeddingRequestOutput)
assert outputs[0].prompt_token_ids == prompt_token_ids
assert len(outputs[0].outputs.embedding) == embedding_size


@pytest.mark.skip_global_cleanup
def test_list_prompts(llm: LLM):
outputs = llm.embed([prompt, prompt_token_ids], use_tqdm=False)
assert len(outputs) == 2
for i in range(len(outputs)):
assert isinstance(outputs[i], EmbeddingRequestOutput)
assert outputs[i].prompt_token_ids == prompt_token_ids
assert len(outputs[i].outputs.embedding) == embedding_size


@pytest.mark.skip_global_cleanup
def test_token_embed(llm: LLM, caplog_vllm):
with caplog_vllm.at_level(level=logging.WARNING, logger="vllm"):
outputs = llm.encode(prompt, pooling_task="token_embed", use_tqdm=False)
assert "deprecated" in caplog_vllm.text

multi_vector = outputs[0].outputs.data
assert multi_vector.shape == (11, 384)


@pytest.mark.skip_global_cleanup
def test_pooling_params(llm: LLM):
def get_outputs(normalize):
outputs = llm.embed(
prompts,
[prompt],
pooling_params=PoolingParams(use_activation=normalize),
use_tqdm=False,
)
Expand All @@ -70,3 +105,10 @@ def get_outputs(normalize):
assert torch.allclose(w_normal, F.normalize(wo_normal, p=2, dim=-1), atol=1e-2), (
"w_normal should be close to normal(wo_normal)."
)


@pytest.mark.parametrize("task", ["token_classify", "classify"])
def test_unsupported_tasks(llm: LLM, task: PoolingTask):
err_msg = "Classification API is not supported by this model.+"
with pytest.raises(ValueError, match=err_msg):
llm.encode(prompt, pooling_task=task, use_tqdm=False)
7 changes: 6 additions & 1 deletion tests/entrypoints/pooling/score/test_online_rerank.py
Original file line number Diff line number Diff line change
Expand Up @@ -206,7 +206,12 @@ async def test_pooling_classify(server: RemoteOpenAIServer, model_name: str):
async def test_pooling_token_classify(server: RemoteOpenAIServer, model_name: str):
response = requests.post(
server.url_for("pooling"),
json={"model": model_name, "input": input_text, "encoding_format": "float"},
json={
"model": model_name,
"task": "token_classify",
"input": input_text,
"encoding_format": "float",
},
)

poolings = PoolingResponse.model_validate(response.json())
Expand Down
Empty file.
78 changes: 78 additions & 0 deletions tests/entrypoints/pooling/token_classify/test_offline.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
import logging
import weakref

import pytest

from vllm import LLM, PoolingRequestOutput
from vllm.config import PoolerConfig
from vllm.distributed import cleanup_dist_env_and_memory
from vllm.tasks import PoolingTask

MODEL_NAME = "jason9693/Qwen2.5-1.5B-apeach"

prompt = "The chef prepared a delicious meal."
prompt_token_ids = [785, 29706, 10030, 264, 17923, 15145, 13]
num_labels = 2


@pytest.fixture(scope="module")
def llm():
# pytest caches the fixture so we use weakref.proxy to
# enable garbage collection
llm = LLM(
model=MODEL_NAME,
pooler_config=PoolerConfig(task="token_classify"),
max_num_batched_tokens=32768,
tensor_parallel_size=1,
gpu_memory_utilization=0.75,
enforce_eager=True,
seed=0,
)

yield weakref.proxy(llm)

del llm

cleanup_dist_env_and_memory()


@pytest.mark.skip_global_cleanup
def test_str_prompts(llm: LLM):
outputs = llm.encode(prompt, pooling_task="token_classify", use_tqdm=False)
assert len(outputs) == 1
assert isinstance(outputs[0], PoolingRequestOutput)
assert outputs[0].prompt_token_ids == prompt_token_ids
assert outputs[0].outputs.data.shape == (len(prompt_token_ids), num_labels)


@pytest.mark.skip_global_cleanup
def test_token_ids_prompts(llm: LLM):
outputs = llm.encode(
[prompt_token_ids], pooling_task="token_classify", use_tqdm=False
)
assert len(outputs) == 1
assert isinstance(outputs[0], PoolingRequestOutput)
assert outputs[0].prompt_token_ids == prompt_token_ids
assert outputs[0].outputs.data.shape == (len(prompt_token_ids), num_labels)


@pytest.mark.skip_global_cleanup
def test_score_api(llm: LLM):
err_msg = "Score API is only enabled for num_labels == 1."
with pytest.raises(ValueError, match=err_msg):
llm.score("ping", "pong", use_tqdm=False)


@pytest.mark.parametrize("task", ["classify", "embed", "token_embed"])
def test_unsupported_tasks(llm: LLM, task: PoolingTask, caplog_vllm):
if task == "classify":
with caplog_vllm.at_level(level=logging.WARNING, logger="vllm"):
llm.encode(prompt, pooling_task=task, use_tqdm=False)
assert "deprecated" in caplog_vllm.text
else:
err_msg = "Embedding API is not supported by this model.+"

with pytest.raises(ValueError, match=err_msg):
llm.encode(prompt, pooling_task=task, use_tqdm=False)
Loading
Loading