11# Pooling Models
22
3- vLLM also supports pooling models, including embedding, reranking and reward models.
3+ vLLM also supports pooling models, such as embedding, classification and reward models.
44
55In vLLM, pooling models implement the [ VllmModelForPooling] [ vllm.model_executor.models.VllmModelForPooling ] interface.
6- These models use a [ Pooler] [ vllm.model_executor.layers.Pooler ] to extract the final hidden states of the input
6+ These models use a [ Pooler] [ vllm.model_executor.layers.pooler. Pooler ] to extract the final hidden states of the input
77before returning them.
88
99!!! note
1010 We currently support pooling models primarily as a matter of convenience.
1111 As shown in the [ Compatibility Matrix] ( ../features/compatibility_matrix.md ) , most vLLM features are not applicable to
1212 pooling models as they only work on the generation or decode stage, so performance may not improve as much.
1313
14- If the model doesn't implement this interface, you can set ` --task ` which tells vLLM
15- to convert the model into a pooling model.
14+ ## Configuration
1615
17- | ` --task ` | Model type | Supported pooling tasks |
18- | ------------| ----------------------| -------------------------------|
19- | ` embed ` | Embedding model | ` encode ` , ` embed ` |
20- | ` classify ` | Classification model | ` encode ` , ` classify ` , ` score ` |
21- | ` reward ` | Reward model | ` encode ` |
16+ ### Model Runner
2217
23- ## Pooling Tasks
18+ Run a model in pooling mode via the option ` --runner pooling ` .
2419
25- In vLLM, we define the following pooling tasks and corresponding APIs:
20+ !!! tip
21+ There is no need to set this option in the vast majority of cases as vLLM can automatically
22+ detect the model runner to use via ` --runner auto ` .
23+
24+ ### Model Conversion
25+
26+ vLLM can adapt models for various pooling tasks via the option ` --convert <type> ` .
27+
28+ If ` --runner pooling ` has been set (manually or automatically) but the model does not implement the
29+ [ VllmModelForPooling] [ vllm.model_executor.models.VllmModelForPooling ] interface,
30+ vLLM will attempt to automatically convert the model according to the architecture names
31+ shown in the table below.
32+
33+ | Architecture | ` --convert ` | Supported pooling tasks |
34+ | -------------------------------------------------| -------------| -------------------------------|
35+ | ` *ForTextEncoding ` , ` *EmbeddingModel ` , ` *Model ` | ` embed ` | ` encode ` , ` embed ` |
36+ | ` *For*Classification ` , ` *ClassificationModel ` | ` classify ` | ` encode ` , ` classify ` , ` score ` |
37+ | ` *ForRewardModeling ` , ` *RewardModel ` | ` reward ` | ` encode ` |
38+
39+ !!! tip
40+ You can explicitly set ` --convert <type> ` to specify how to convert the model.
41+
42+ ### Pooling Tasks
43+
44+ Each pooling model in vLLM supports one or more of these tasks according to
45+ [ Pooler.get_supported_tasks] [ vllm.model_executor.layers.pooler.Pooler.get_supported_tasks ] ,
46+ enabling the corresponding APIs:
2647
2748| Task | APIs |
2849| ------------| --------------------|
@@ -31,32 +52,32 @@ In vLLM, we define the following pooling tasks and corresponding APIs:
3152| ` classify ` | ` classify ` |
3253| ` score ` | ` score ` |
3354
34- \* The ` score ` API falls back to ` embed ` task if the model does not support ` score ` task.
55+ \* The ` score ` API falls back to ` embed ` task if the model does not support ` score ` task.
3556
36- Each pooling model in vLLM supports one or more of these tasks according to [ Pooler.get_supported_tasks ] [ vllm.model_executor.layers.Pooler.get_supported_tasks ] .
57+ ### Pooler Configuration
3758
38- By default, the pooler assigned to each task has the following attributes:
59+ #### Predefined models
60+
61+ If the [ Pooler] [ vllm.model_executor.layers.pooler.Pooler ] defined by the model accepts ` pooler_config ` ,
62+ you can override some of its attributes via the ` --override-pooler-config ` option.
63+
64+ #### Converted models
65+
66+ If the model has been converted via ` --convert ` (see above),
67+ the pooler assigned to each task has the following attributes by default:
3968
4069| Task | Pooling Type | Normalization | Softmax |
4170| ------------| ----------------| ---------------| ---------|
4271| ` encode ` | ` ALL ` | ❌ | ❌ |
4372| ` embed ` | ` LAST ` | ✅︎ | ❌ |
4473| ` classify ` | ` LAST ` | ❌ | ✅︎ |
4574
46- These defaults may be overridden by the model's implementation in vLLM.
47-
4875When loading [ Sentence Transformers] ( https://huggingface.co/sentence-transformers ) models,
49- we attempt to override the defaults based on its Sentence Transformers configuration file (` modules.json ` ),
50- which takes priority over the model's defaults.
76+ its Sentence Transformers configuration file (` modules.json ` ) takes priority over the model's defaults.
5177
5278You can further customize this via the ` --override-pooler-config ` option,
5379which takes priority over both the model's and Sentence Transformers's defaults.
5480
55- !!! note
56-
57- The above configuration may be disregarded if the model's implementation in vLLM defines its own pooler
58- that is not based on [PoolerConfig][vllm.config.PoolerConfig].
59-
6081## Offline Inference
6182
6283The [ LLM] [ vllm.LLM ] class provides various methods for offline inference.
@@ -70,7 +91,7 @@ It returns the extracted hidden states directly, which is useful for reward mode
7091``` python
7192from vllm import LLM
7293
73- llm = LLM(model = " Qwen/Qwen2.5-Math-RM-72B" , task = " reward " )
94+ llm = LLM(model = " Qwen/Qwen2.5-Math-RM-72B" , runner = " pooling " )
7495(output,) = llm.encode(" Hello, my name is" )
7596
7697data = output.outputs.data
@@ -85,7 +106,7 @@ It is primarily designed for embedding models.
85106``` python
86107from vllm import LLM
87108
88- llm = LLM(model = " intfloat/e5-mistral-7b-instruct" , task = " embed " )
109+ llm = LLM(model = " intfloat/e5-mistral-7b-instruct" , runner = " pooling " )
89110(output,) = llm.embed(" Hello, my name is" )
90111
91112embeds = output.outputs.embedding
@@ -102,7 +123,7 @@ It is primarily designed for classification models.
102123``` python
103124from vllm import LLM
104125
105- llm = LLM(model = " jason9693/Qwen2.5-1.5B-apeach" , task = " classify " )
126+ llm = LLM(model = " jason9693/Qwen2.5-1.5B-apeach" , runner = " pooling " )
106127(output,) = llm.classify(" Hello, my name is" )
107128
108129probs = output.outputs.probs
@@ -123,7 +144,7 @@ It is designed for embedding models and cross encoder models. Embedding models u
123144``` python
124145from vllm import LLM
125146
126- llm = LLM(model = " BAAI/bge-reranker-v2-m3" , task = " score " )
147+ llm = LLM(model = " BAAI/bge-reranker-v2-m3" , runner = " pooling " )
127148(output,) = llm.score(" What is the capital of France?" ,
128149 " The capital of Brazil is Brasilia." )
129150
@@ -175,7 +196,7 @@ You can change the output dimensions of embedding models that support Matryoshka
175196from vllm import LLM , PoolingParams
176197
177198llm = LLM(model = " jinaai/jina-embeddings-v3" ,
178- task = " embed " ,
199+ runner = " pooling " ,
179200 trust_remote_code = True )
180201outputs = llm.embed([" Follow the white rabbit." ],
181202 pooling_params = PoolingParams(dimensions = 32 ))
0 commit comments