Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
f6ef430
add scoring API
vedantjh2 Jan 30, 2026
c4d13da
only compute necessary tokens for scoring rather than all the tokens …
vedantjh2 Jan 30, 2026
8288f48
clean up code
vedantjh2 Jan 31, 2026
2dc6250
combine generative score API in v1/score to unify the scoring endpoin…
vedantjh2 Feb 4, 2026
9777317
update docs
vedantjh2 Feb 4, 2026
b2b5223
change test end point to v1/score
vedantjh2 Feb 5, 2026
21b35e1
remove sampling params that we do not need for scoring
vedantjh2 Feb 5, 2026
16c2f21
update tests
vedantjh2 Feb 5, 2026
c169f6f
remove circular import safety net
vedantjh2 Feb 5, 2026
1a259d2
move files into the correct test folder
vedantjh2 Feb 5, 2026
f4b81b5
allow for heterogenous token_id batching to occur in a batch and limi…
vedantjh2 Feb 6, 2026
d973f81
require exactly 2 token ids for generative scoring
vedantjh2 Feb 6, 2026
8a9316f
require exactly 2 token ids for generative scoring
vedantjh2 Feb 6, 2026
396fb35
move imports to top and solve vircular import
vedantjh2 Feb 7, 2026
933be09
add truncation for tokens
vedantjh2 Feb 7, 2026
8e5b19c
consolidate tests and include changes from recent updates
vedantjh2 Feb 7, 2026
55444b5
optimize batch processing
Feb 13, 2026
9858dea
Merge main into vjhaveri/scoring
Mar 17, 2026
a97b472
Move generative scoring out of pooling into standalone /generative_sc…
Mar 23, 2026
3b30047
refactor generative score api
Mar 23, 2026
090485f
Merge remote-tracking branch 'upstream/main' into vjhaveri/scoring
Mar 23, 2026
0ce2afe
Remove unrelated protocol.py changes
Mar 24, 2026
9de4626
Code review fixes: remove dead code, fix docs, fix test paths
Mar 24, 2026
98490d2
change name to generative_scoring
Mar 26, 2026
6f74962
update docs to separate out pooling and gen scoring
Mar 26, 2026
3ec2d50
Merge branch 'main' into vjhaveri/scoring
DarkLight1337 Mar 27, 2026
24f1337
update engine input to use renderer and lint fixes
Mar 28, 2026
a781350
Merge remote-tracking branch 'upstream/main' into vjhaveri/scoring
Mar 30, 2026
6fce10b
fix changes after integrating upstream for failing ci and testing loc…
Mar 30, 2026
264dd94
fix failing metadata CI
Mar 30, 2026
814db14
Fix test mocks: add renderer to mock engine, fix expected status code
Mar 30, 2026
5dc5be3
Merge branch 'main' into vjhaveri/scoring
DarkLight1337 Mar 31, 2026
c4c0c1d
Add type annotation for tokenizer parameter to fix docs build
Mar 31, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
72 changes: 70 additions & 2 deletions docs/serving/openai_compatible_server.md
Original file line number Diff line number Diff line change
Expand Up @@ -73,8 +73,11 @@ In addition, we have the following custom APIs:
- [Cohere Embed API](../models/pooling_models/embed.md#cohere-embed-api) (`/v2/embed`)
- Compatible with [Cohere's Embed API](https://docs.cohere.com/reference/embed)
- Works with any [embedding model](../models/pooling_models/embed.md#supported-models), including multimodal models.
- [Score API](../models/pooling_models/scoring.md#score-api) (`/score`)
- Applicable to [score models](../models/pooling_models/scoring.md).
- [Score API](../models/pooling_models/scoring.md#score-api) (`/score`, `/v1/score`)
Comment thread
DarkLight1337 marked this conversation as resolved.
- Applicable to [score models](../models/pooling_models/scoring.md) (cross-encoder, bi-encoder, late-interaction).
- [Generative Scoring API](#generative-scoring-api) (`/generative_scoring`)
- Applicable to [CausalLM models](../models/generative_models.md) (task `"generate"`).
- Computes next-token probabilities for specified `label_token_ids`.
- [Rerank API](../models/pooling_models/scoring.md#rerank-api) (`/rerank`, `/v1/rerank`, `/v2/rerank`)
- Implements [Jina AI's v1 rerank API](https://jina.ai/reranker/)
- Also compatible with [Cohere's v1 & v2 rerank APIs](https://docs.cohere.com/v2/reference/rerank)
Expand Down Expand Up @@ -481,6 +484,71 @@ This approach is more robust than index-based access (`messages[0]`, `messages[1

Example template file: [examples/pooling/score/template/nemotron-rerank.jinja](../../examples/pooling/score/template/nemotron-rerank.jinja)

### Generative Scoring API

The `/generative_scoring` endpoint uses a CausalLM model (e.g., Llama, Qwen, Mistral) to compute the probability of specified token IDs appearing as the next token. Each item (document) is concatenated with the query to form a prompt, and the model predicts how likely each label token is as the next token after that prompt. This lets you score items against a query — for example, asking "Is this the capital of France?" and scoring each city by how likely the model is to answer "Yes".

This endpoint is automatically available when the server is started with a generative model (task `"generate"`). It is separate from the pooling-based [Score API](#score-api), which uses cross-encoder, bi-encoder, or late-interaction models.

**Requirements:**

- The `label_token_ids` parameter is **required** and must contain **at least 1 token ID**.
- When 2 label tokens are provided, the score equals `P(label_token_ids[0]) / (P(label_token_ids[0]) + P(label_token_ids[1]))` (softmax over the two labels).
- When more labels are provided, the score is the softmax-normalized probability of the first label token across all label tokens.

#### Example

```bash
curl -X POST http://localhost:8000/generative_scoring \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-0.6B",
"query": "Is this city the capital of France?",
"items": ["Paris", "London", "Berlin"],
"label_token_ids": [9454, 2753]
}'
```

Here, each item is appended to the query to form prompts like `"Is this city the capital of France? Paris"`, `"... London"`, etc. The model then predicts the next token, and the score reflects the probability of "Yes" (token 9454) vs "No" (token 2753).

??? console "Response"

```json
{
"id": "generative-scoring-abc123",
"object": "list",
"created": 1234567890,
"model": "Qwen/Qwen3-0.6B",
"data": [
{"index": 0, "object": "score", "score": 0.95},
{"index": 1, "object": "score", "score": 0.12},
{"index": 2, "object": "score", "score": 0.08}
],
"usage": {"prompt_tokens": 45, "total_tokens": 48, "completion_tokens": 3}
}
```

#### How it works

1. **Prompt Construction**: For each item, builds `prompt = query + item` (or `item + query` if `item_first=true`)
2. **Forward Pass**: Runs the model on each prompt to get next-token logits
3. **Probability Extraction**: Extracts logprobs for the specified `label_token_ids`
4. **Softmax Normalization**: Applies softmax over only the label tokens (when `apply_softmax=true`)
5. **Score**: Returns the normalized probability of the first label token

#### Finding Token IDs

To find the token IDs for your labels, use the tokenizer:

```python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-0.6B")
yes_id = tokenizer.encode("Yes", add_special_tokens=False)[0]
no_id = tokenizer.encode("No", add_special_tokens=False)[0]
print(f"Yes: {yes_id}, No: {no_id}")
```

## Ray Serve LLM

Ray Serve LLM enables scalable, production-grade serving of the vLLM engine. It integrates tightly with vLLM and extends it with features such as auto-scaling, load balancing, and back-pressure.
Expand Down
2 changes: 2 additions & 0 deletions tests/entrypoints/openai/generative_scoring/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
Loading
Loading