Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
63 changes: 28 additions & 35 deletions docs/models/pooling_models/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,29 +31,28 @@ Of course, we also have "plugin" tasks that allow users to customize input and o

### Pooling Tasks

| Pooling Tasks | Granularity | Outputs |
|-----------------------|---------------|-------------------------------------------------|
| `classify` (see note) | Sequence-wise | probability vector of classes for each sequence |
| `embed` | Sequence-wise | vector representations for each sequence |
| `token_classify` | Token-wise | probability vector of classes for each token |
| `token_embed` | Token-wise | vector representations for each token |
| Pooling Tasks | Granularity | Outputs |
|--------------------|---------------|-------------------------------------------------|
| `classify` | Sequence-wise | probability vector of classes for each sequence |
| `score` (see note) | Sequence-wise | reranker score for each sequence |
| `embed` | Sequence-wise | vector representations for each sequence |
| `token_classify` | Token-wise | probability vector of classes for each token |
| `token_embed` | Token-wise | vector representations for each token |

!!! note
Within classification tasks, there is a specialized subcategory: Cross-encoder (aka reranker) models. These models are a subset of classification models that accept two prompts as input and output num_labels equal to 1.

### Score Types

The scoring models is designed to compute similarity scores between two input prompts. It supports three model types (aka `score_type`): `cross-encoder`, `late-interaction`, and `bi-encoder`.
| Pooling Tasks | Granularity | Outputs | Score Types | scoring function |
|--------------------|---------------|-------------------------------------------------|--------------------|--------------------------|
| `classify` | Sequence-wise | probability vector of classes for each sequence | nan | nan |
| `score` (see note) | Sequence-wise | reranker score for each sequence | `cross-encoder` | linear classifier |
| `embed` | Sequence-wise | vector representations for each sequence | `bi-encoder` | cosine similarity |
| `token_classify` | Token-wise | probability vector of classes for each token | nan | nan |
| `token_embed` | Token-wise | vector representations for each token | `late-interaction` | late interaction(MaxSim) |

| Pooling Tasks | Granularity | Outputs | Score Types | scoring function |
|-----------------------|---------------|----------------------------------------------|--------------------|--------------------------|
| `classify` (see note) | Sequence-wise | reranker score for each sequence | `cross-encoder` | linear classifier |
| `embed` | Sequence-wise | vector representations for each sequence | `bi-encoder` | cosine similarity |
| `token_classify` | Token-wise | probability vector of classes for each token | nan | nan |
| `token_embed` | Token-wise | vector representations for each token | `late-interaction` | late interaction(MaxSim) |

!!! note
Only when a classification model outputs num_labels equal to 1 can it be used as a scoring model and have its scoring API enabled.
The score models is designed to compute similarity scores between two input prompts. It supports three model types (aka `score_type`): `cross-encoder`, `late-interaction`, and `bi-encoder`.

### Pooling Usages

Expand Down Expand Up @@ -86,16 +85,14 @@ enabling the corresponding APIs.

### Offline APIs corresponding to pooling tasks

| Task | APIs |
|------------------|---------------------------------------------------------------------------------------|
| `embed` | `LLM.embed(...)`, `LLM.encode(..., pooling_task="embed")`, `LLM.score(...)`(see note) |
| `classify` | `LLM.classify(...)`, `LLM.encode(..., pooling_task="classify")`, `LLM.score(...)` |
| `token_classify` | `LLM.reward(...)`, `LLM.encode(..., pooling_task="token_classify")` |
| `token_embed` | `LLM.encode(..., pooling_task="token_embed")`, `LLM.score(...)` |
| `plugin` | `LLM.encode(..., pooling_task="plugin")` |

!!! note
Only when a classification model outputs num_labels equal to 1 can it be used as a scoring model and have its scoring API enabled.
| Task | APIs |
|------------------|----------------------------------------------------------------------------|
| `embed` | `LLM.embed(...)`,`LLM.encode(..., pooling_task="embed")`, `LLM.score(...)` |
| `classify` | `LLM.classify(...)`, `LLM.encode(..., pooling_task="classify")` |
| `score` | `LLM.score(...)` |
| `token_classify` | `LLM.reward(...)`, `LLM.encode(..., pooling_task="token_classify")` |
| `token_embed` | `LLM.encode(..., pooling_task="token_embed")`, `LLM.score(...)` |
| `plugin` | `LLM.encode(..., pooling_task="plugin")` |

### `LLM.classify`

Expand Down Expand Up @@ -209,11 +206,11 @@ If `--runner pooling` has been set (manually or automatically) but the model doe
vLLM will attempt to automatically convert the model according to the architecture names
shown in the table below.

| Architecture | `--convert` | Supported pooling tasks |
|-------------------------------------------------|-------------|------------------------------|
| `*ForTextEncoding`, `*EmbeddingModel`, `*Model` | `embed` | `token_embed`, `embed` |
| `*ForRewardModeling`, `*RewardModel` | `embed` | `token_embed`, `embed` |
| `*For*Classification`, `*ClassificationModel` | `classify` | `token_classify`, `classify` |
| Architecture | `--convert` | Supported pooling tasks |
| ----------------------------------------------- | ----------- | ------------------------------------- |
| `*ForTextEncoding`, `*EmbeddingModel`, `*Model` | `embed` | `token_embed`, `embed` |
| `*ForRewardModeling`, `*RewardModel` | `embed` | `token_embed`, `embed` |
| `*For*Classification`, `*ClassificationModel` | `classify` | `token_classify`, `classify`, `score` |

!!! tip
You can explicitly set `--convert <type>` to specify how to convert the model.
Expand Down Expand Up @@ -254,7 +251,3 @@ Pooling models now default support all pooling, you can use it without any setti

- Extracting hidden states prefers using `token_embed` task.
- Named Entity Recognition (NER) and reward models prefers using `token_classify` task.

### Score task

`score` task is deprecated and will be removed in v0.20. Please use `classify` instead. Only when a classification model outputs num_labels equal to 1 can it be used as a scoring model and have its scoring API enabled.
4 changes: 1 addition & 3 deletions docs/models/pooling_models/classify.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,6 @@ The key distinction between (sequence) classification and token classification l

Many classification models support both (sequence) classification and token classification. For further details on token classification, please refer to [this page](token_classify.md).

Only when a classification model outputs num_labels equal to 1 can it be used as a scoring model and have its scoring API enabled, please refer to [this page](scoring.md).

## Typical Use Cases

### Classification
Expand Down Expand Up @@ -56,7 +54,7 @@ If your model is not in the above list, we will try to automatically convert the

Cross-encoder (aka reranker) models are a subset of classification models that accept two prompts as input and output num_labels equal to 1. Most classification models can also be used as [cross-encoder models](scoring.md#cross-encoder-models). For more information on cross-encoder models, please refer to [this page](scoring.md).

--8<-- "docs/models/pooling_models/scoring.md:supported-cross-encoder-models"
--8<-- "docs/models/pooling_models/scoring.md:supported-score-models"

### Reward Models

Expand Down
17 changes: 7 additions & 10 deletions docs/models/pooling_models/scoring.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,28 +10,25 @@ The score models is designed to compute similarity scores between two input prom
- Model Usage: Scoring
- Pooling Task:

| Score Types | Pooling Tasks | scoring function |
|--------------------|-----------------------|--------------------------|
| `cross-encoder` | `classify` (see note) | linear classifier |
| `late-interaction` | `token_embed` | late interaction(MaxSim) |
| `bi-encoder` | `embed` | cosine similarity |
| Score Types | Pooling Tasks | scoring function |
|--------------------|---------------|--------------------------|
| `cross-encoder` | `score` | linear classifier |
| `late-interaction` | `token_embed` | late interaction(MaxSim) |
| `bi-encoder` | `embed` | cosine similarity |

- Offline APIs:
- `LLM.score`
- Online APIs:
- [Score API](scoring.md#score-api) (`/score`)
- [Rerank API](scoring.md#rerank-api) (`/rerank`, `/v1/rerank`, `/v2/rerank`)

!!! note
Only when a classification model outputs num_labels equal to 1 can it be used as a scoring model and have its scoring API enabled.

## Supported Models

### Cross-encoder models

[Cross-encoder](https://www.sbert.net/examples/applications/cross-encoder/README.html) (aka reranker) models are a subset of classification models that accept two prompts as input and output num_labels equal to 1.

--8<-- [start:supported-cross-encoder-models]
--8<-- [start:supported-score-models]

#### Text-only Models

Expand Down Expand Up @@ -102,7 +99,7 @@ The score models is designed to compute similarity scores between two input prom
vllm serve Qwen/Qwen3-VL-Reranker-2B --hf_overrides '{"architectures": ["Qwen3VLForSequenceClassification"],"classifier_from_token": ["no", "yes"],"is_original_qwen3_reranker": true}'
```

--8<-- [end:supported-cross-encoder-models]
--8<-- [end:supported-score-models]

### Late-interaction models

Expand Down
2 changes: 1 addition & 1 deletion tests/test_pooling_params.py
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,7 @@ def test_embed_dimensions(model_info: EmbedModelInfo):
pooling_params.verify(model_config)


@pytest.mark.parametrize("task", ["classify"])
@pytest.mark.parametrize("task", ["score", "classify"])
def test_classify(task):
model_config = MockModelConfig(pooler_config=PoolerConfig(seq_pooling_type="CLS"))

Expand Down
8 changes: 4 additions & 4 deletions vllm/config/model.py
Original file line number Diff line number Diff line change
Expand Up @@ -1435,10 +1435,10 @@ def requires_raw_input_tokens(self) -> bool:
@property
def score_type(self) -> ScoreType:
"""
Scoring API handles score/rerank for:\n
- "classify" task (score_type: cross-encoder models)\n
- "embed" task (score_type: bi-encoder models)\n
- "token_embed" task (score_type: late interaction models)\n
Score API handles score/rerank for:
- "score" task (score_type: cross-encoder models)
- "embed" task (score_type: bi-encoder models)
- "token_embed" task (score_type: late interaction models)
"""
# fixme: self._model_info.score_type is the score type before
# as_seq_cls_model, which is "bi-encoder", rather than the
Expand Down
4 changes: 2 additions & 2 deletions vllm/entrypoints/llm.py
Original file line number Diff line number Diff line change
Expand Up @@ -1477,9 +1477,9 @@ def _cross_encoding_score(
data_1 = data_1 * len(data_2)

if pooling_params is None:
pooling_params = PoolingParams(task="classify")
pooling_params = PoolingParams(task="score")
elif pooling_params.task is None:
pooling_params.task = "classify"
pooling_params.task = "score"

pooling_params_list = list[PoolingParams]()

Expand Down
14 changes: 5 additions & 9 deletions vllm/entrypoints/openai/api_server.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@
from starlette.datastructures import State

import vllm.envs as envs
from vllm.config import ModelConfig, VllmConfig
from vllm.config import VllmConfig
from vllm.engine.arg_utils import AsyncEngineArgs
from vllm.engine.protocol import EngineClient
from vllm.entrypoints.chat_utils import load_chat_template
Expand Down Expand Up @@ -155,9 +155,7 @@ async def build_async_engine_client_from_engine_args(


def build_app(
args: Namespace,
supported_tasks: tuple["SupportedTask", ...] | None = None,
model_config: ModelConfig | None = None,
args: Namespace, supported_tasks: tuple["SupportedTask", ...] | None = None
) -> FastAPI:
if supported_tasks is None:
warnings.warn(
Expand Down Expand Up @@ -193,7 +191,7 @@ def build_app(
attach_router as register_sagemaker_api_router,
)

register_sagemaker_api_router(app, supported_tasks, model_config)
register_sagemaker_api_router(app, supported_tasks)

if "generate" in supported_tasks:
from vllm.entrypoints.openai.generate.api_router import (
Expand Down Expand Up @@ -244,7 +242,7 @@ def build_app(
if any(task in POOLING_TASKS for task in supported_tasks):
from vllm.entrypoints.pooling import register_pooling_api_routers

register_pooling_api_routers(app, supported_tasks, model_config)
register_pooling_api_routers(app, supported_tasks)

app.root_path = args.root_path
app.add_middleware(
Expand Down Expand Up @@ -585,10 +583,8 @@ async def build_and_serve(
uvicorn_kwargs["log_config"] = log_config

supported_tasks = await engine_client.get_supported_tasks()
model_config = engine_client.model_config

logger.info("Supported tasks: %s", supported_tasks)
app = build_app(args, supported_tasks, model_config)
app = build_app(args, supported_tasks)
await init_app_state(engine_client, app.state, args, supported_tasks)

logger.info("Starting vLLM server on %s", listen_address)
Expand Down
40 changes: 11 additions & 29 deletions vllm/entrypoints/pooling/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,6 @@

from fastapi import FastAPI

from vllm.config import ModelConfig
from vllm.logger import init_logger

if TYPE_CHECKING:
from argparse import Namespace

Expand All @@ -20,30 +17,9 @@
RequestLogger = object
SupportedTask = object

logger = init_logger(__name__)


def enable_scoring_api(
supported_tasks: tuple["SupportedTask", ...],
model_config: ModelConfig | None = None,
) -> bool:
if any(t in supported_tasks for t in ("embed", "token_embed")):
return True

if model_config is not None and "classify" in supported_tasks:
num_labels = getattr(model_config.hf_config, "num_labels", 0)
if num_labels != 1:
logger.debug_once("Score API is only enabled for num_labels == 1.")
return False
return True

return False


def register_pooling_api_routers(
app: FastAPI,
supported_tasks: tuple["SupportedTask", ...],
model_config: ModelConfig | None = None,
app: FastAPI, supported_tasks: tuple["SupportedTask", ...]
):
from vllm.entrypoints.pooling.pooling.api_router import router as pooling_router

Expand All @@ -61,7 +37,11 @@ def register_pooling_api_routers(

app.include_router(embed_router)

if enable_scoring_api(supported_tasks, model_config):
# Score API handles score/rerank for:
# - "score" task (score_type: cross-encoder models)
# - "embed" task (score_type: bi-encoder models)
# - "token_embed" task (score_type: late interaction models)
if any(t in supported_tasks for t in ("score", "embed", "token_embed")):
from vllm.entrypoints.pooling.score.api_router import router as score_router

app.include_router(score_router)
Expand All @@ -81,8 +61,6 @@ def init_pooling_state(
from vllm.entrypoints.pooling.score.serving import ServingScores
from vllm.tasks import POOLING_TASKS

model_config = engine_client.model_config

resolved_chat_template = load_chat_template(args.chat_template)

state.serving_pooling = (
Expand Down Expand Up @@ -124,6 +102,10 @@ def init_pooling_state(
if "classify" in supported_tasks
else None
)
# Score API handles score/rerank for:
# - "score" task (score_type: cross-encoder models)
# - "embed" task (score_type: bi-encoder models)
# - "token_embed" task (score_type: late interaction models)
state.serving_scores = (
ServingScores(
engine_client,
Expand All @@ -132,6 +114,6 @@ def init_pooling_state(
score_template=resolved_chat_template,
log_error_stack=args.log_error_stack,
)
if enable_scoring_api(supported_tasks, model_config)
if any(t in supported_tasks for t in ("embed", "score", "token_embed"))
else None
)
4 changes: 2 additions & 2 deletions vllm/entrypoints/pooling/score/protocol.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ def build_tok_params(self, model_config: ModelConfig) -> TokenizeParams:
max_total_tokens_param="max_model_len",
)

def to_pooling_params(self, task: PoolingTask = "classify"):
def to_pooling_params(self, task: PoolingTask = "score"):
return PoolingParams(
task=task,
use_activation=self.use_activation,
Expand Down Expand Up @@ -111,7 +111,7 @@ def build_tok_params(self, model_config: ModelConfig) -> TokenizeParams:
max_total_tokens_param="max_model_len",
)

def to_pooling_params(self, task: PoolingTask = "classify"):
def to_pooling_params(self, task: PoolingTask = "score"):
return PoolingParams(
task=task,
use_activation=self.use_activation,
Expand Down
2 changes: 1 addition & 1 deletion vllm/entrypoints/pooling/score/serving.py
Original file line number Diff line number Diff line change
Expand Up @@ -413,7 +413,7 @@ async def _cross_encoding_score(
# Schedule the request and get the result generator.
generators: list[AsyncGenerator[PoolingRequestOutput, None]] = []

default_pooling_params = request.to_pooling_params("classify")
default_pooling_params = request.to_pooling_params("score")

for i, engine_prompt in enumerate(engine_prompts):
request_id_item = f"{request_id}-{i}"
Expand Down
Loading
Loading