vllm-project · ywang96 · Mar 31, 2026 · Jan 30, 2026 · Jan 30, 2026 · Jan 31, 2026
diff --git a/docs/serving/openai_compatible_server.md b/docs/serving/openai_compatible_server.md
@@ -70,8 +70,9 @@ In addition, we have the following custom APIs:
     - Applicable to all [pooling models](../models/pooling_models.md).
 - [Classification API](#classification-api) (`/classify`)
     - Only applicable to [classification models](../models/pooling_models.md).
-- [Score API](#score-api) (`/score`)
-    - Applicable to [embedding models and cross-encoder models](../models/pooling_models.md).
+- [Score API](#score-api) (`/score`, `/v1/score`)
+    - Applicable to [embedding models, cross-encoder models](../models/pooling_models.md), and [CausalLM models](../models/generative_models.md).
+    - For CausalLM models, computes next-token probabilities for specified `label_token_ids`.
 - [Re-rank API](#re-rank-api) (`/rerank`, `/v1/rerank`, `/v2/rerank`)
     - Implements [Jina AI's v1 re-rank API](https://jina.ai/reranker/)
     - Also compatible with [Cohere's v1 & v2 re-rank APIs](https://docs.cohere.com/v2/reference/rerank)
@@ -826,8 +827,13 @@ these extra parameters are supported instead:
 
 ### Score API
 
-Our Score API can apply a cross-encoder model or an embedding model to predict scores for sentence or multimodal pairs. When using an embedding model the score corresponds to the cosine similarity between each embedding pair.
-Usually, the score for a sentence pair refers to the similarity between two sentences, on a scale of 0 to 1.
+Our Score API provides a unified interface for computing similarity or relevance scores:
+
+- **Embedding models**: Computes cosine similarity between embeddings.
+- **Cross-encoder models**: Predicts relevance scores for sentence pairs.
+- **CausalLM models**: Computes next-token probabilities for specified `label_token_ids` (requires the `label_token_ids` parameter).
+
+For embedding and cross-encoder models, the score typically represents similarity on a scale of 0 to 1.
 
 You can find the documentation for cross encoder models at [sbert.net](https://www.sbert.net/docs/package_reference/cross_encoder/cross_encoder.html).
 
@@ -1056,6 +1062,66 @@ The following extra parameters are supported:
 --8<-- "vllm/entrypoints/pooling/score/protocol.py:score-extra-params"
 ```
 
+#### CausalLM Models (Generative Scoring)
+
+When using a CausalLM model (e.g., Llama, Qwen, Mistral) with the Score API, the endpoint computes the probability of specified token IDs appearing as the next token. This is useful for generative scoring tasks, sentiment analysis, or any scenario where you want to score the likelihood of specific tokens.
+
+**Requirements for CausalLM models:**
+
+- The `label_token_ids` parameter is **required** and must contain **exactly 2 token IDs** (for generative scoring).
+- The score is computed as: `P(label_token_ids[0]) / (P(label_token_ids[0]) + P(label_token_ids[1]))`
+
+##### Example: Score with CausalLM
+
+```bash
+curl -X POST http://localhost:8000/v1/score \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "Qwen/Qwen3-0.6B",
+    "queries": "Is this city the capital of France?",
+    "documents": ["Paris", "London", "Berlin"],
+    "label_token_ids": [9454, 2753]
+  }'
+```
+
+??? console "Response"
+
+    ```json
+    {
+      "id": "score-abc123",
+      "object": "list",
+      "created": 1234567890,
+      "model": "Qwen/Qwen3-0.6B",
+      "data": [
+        {"index": 0, "object": "score", "score": 0.95},
+        {"index": 1, "object": "score", "score": 0.12},
+        {"index": 2, "object": "score", "score": 0.08}
+      ],
+      "usage": {"prompt_tokens": 45, "total_tokens": 48, "completion_tokens": 3}
+    }
+    ```
+
+##### How it works
+
+1. **Prompt Construction**: For each document, builds `prompt = query + document`
+2. **Forward Pass**: Runs the model to get next-token logits
+3. **Probability Extraction**: Extracts logprobs for the 2 specified `label_token_ids`
+4. **Softmax Normalization**: Applies softmax over only the 2 label tokens
+5. **Score Computation**: Returns `P(token[0]) / (P(token[0]) + P(token[1]))` as the score
+
+##### Finding Token IDs
+
+To find the token IDs for your labels, use the tokenizer:
+
+```python
+from transformers import AutoTokenizer
+
+tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-0.6B")
+yes_id = tokenizer.encode("Yes", add_special_tokens=False)[0]
+no_id = tokenizer.encode("No", add_special_tokens=False)[0]
+print(f"Yes: {yes_id}, No: {no_id}")
+```
+
 ### Re-rank API
 
 Our Re-rank API can apply an embedding model or a cross-encoder model to predict relevant scores between a single query, and