Skip to content
Merged
Show file tree
Hide file tree
Changes from 16 commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
f6ef430
add scoring API
vedantjh2 Jan 30, 2026
c4d13da
only compute necessary tokens for scoring rather than all the tokens …
vedantjh2 Jan 30, 2026
8288f48
clean up code
vedantjh2 Jan 31, 2026
2dc6250
combine generative score API in v1/score to unify the scoring endpoin…
vedantjh2 Feb 4, 2026
9777317
update docs
vedantjh2 Feb 4, 2026
b2b5223
change test end point to v1/score
vedantjh2 Feb 5, 2026
21b35e1
remove sampling params that we do not need for scoring
vedantjh2 Feb 5, 2026
16c2f21
update tests
vedantjh2 Feb 5, 2026
c169f6f
remove circular import safety net
vedantjh2 Feb 5, 2026
1a259d2
move files into the correct test folder
vedantjh2 Feb 5, 2026
f4b81b5
allow for heterogenous token_id batching to occur in a batch and limi…
vedantjh2 Feb 6, 2026
d973f81
require exactly 2 token ids for generative scoring
vedantjh2 Feb 6, 2026
8a9316f
require exactly 2 token ids for generative scoring
vedantjh2 Feb 6, 2026
396fb35
move imports to top and solve vircular import
vedantjh2 Feb 7, 2026
933be09
add truncation for tokens
vedantjh2 Feb 7, 2026
8e5b19c
consolidate tests and include changes from recent updates
vedantjh2 Feb 7, 2026
55444b5
optimize batch processing
Feb 13, 2026
9858dea
Merge main into vjhaveri/scoring
Mar 17, 2026
a97b472
Move generative scoring out of pooling into standalone /generative_sc…
Mar 23, 2026
3b30047
refactor generative score api
Mar 23, 2026
090485f
Merge remote-tracking branch 'upstream/main' into vjhaveri/scoring
Mar 23, 2026
0ce2afe
Remove unrelated protocol.py changes
Mar 24, 2026
9de4626
Code review fixes: remove dead code, fix docs, fix test paths
Mar 24, 2026
98490d2
change name to generative_scoring
Mar 26, 2026
6f74962
update docs to separate out pooling and gen scoring
Mar 26, 2026
3ec2d50
Merge branch 'main' into vjhaveri/scoring
DarkLight1337 Mar 27, 2026
24f1337
update engine input to use renderer and lint fixes
Mar 28, 2026
a781350
Merge remote-tracking branch 'upstream/main' into vjhaveri/scoring
Mar 30, 2026
6fce10b
fix changes after integrating upstream for failing ci and testing loc…
Mar 30, 2026
264dd94
fix failing metadata CI
Mar 30, 2026
814db14
Fix test mocks: add renderer to mock engine, fix expected status code
Mar 30, 2026
5dc5be3
Merge branch 'main' into vjhaveri/scoring
DarkLight1337 Mar 31, 2026
c4c0c1d
Add type annotation for tokenizer parameter to fix docs build
Mar 31, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
74 changes: 70 additions & 4 deletions docs/serving/openai_compatible_server.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,8 +70,9 @@ In addition, we have the following custom APIs:
- Applicable to all [pooling models](../models/pooling_models.md).
- [Classification API](#classification-api) (`/classify`)
- Only applicable to [classification models](../models/pooling_models.md).
- [Score API](#score-api) (`/score`)
- Applicable to [embedding models and cross-encoder models](../models/pooling_models.md).
- [Score API](#score-api) (`/score`, `/v1/score`)
- Applicable to [embedding models, cross-encoder models](../models/pooling_models.md), and [CausalLM models](../models/generative_models.md).
- For CausalLM models, computes next-token probabilities for specified `label_token_ids`.
- [Re-rank API](#re-rank-api) (`/rerank`, `/v1/rerank`, `/v2/rerank`)
- Implements [Jina AI's v1 re-rank API](https://jina.ai/reranker/)
- Also compatible with [Cohere's v1 & v2 re-rank APIs](https://docs.cohere.com/v2/reference/rerank)
Expand Down Expand Up @@ -826,8 +827,13 @@ these extra parameters are supported instead:

### Score API

Our Score API can apply a cross-encoder model or an embedding model to predict scores for sentence or multimodal pairs. When using an embedding model the score corresponds to the cosine similarity between each embedding pair.
Usually, the score for a sentence pair refers to the similarity between two sentences, on a scale of 0 to 1.
Our Score API provides a unified interface for computing similarity or relevance scores:

- **Embedding models**: Computes cosine similarity between embeddings.
- **Cross-encoder models**: Predicts relevance scores for sentence pairs.
- **CausalLM models**: Computes next-token probabilities for specified `label_token_ids` (requires the `label_token_ids` parameter).

For embedding and cross-encoder models, the score typically represents similarity on a scale of 0 to 1.

You can find the documentation for cross encoder models at [sbert.net](https://www.sbert.net/docs/package_reference/cross_encoder/cross_encoder.html).

Expand Down Expand Up @@ -1056,6 +1062,66 @@ The following extra parameters are supported:
--8<-- "vllm/entrypoints/pooling/score/protocol.py:score-extra-params"
```

#### CausalLM Models (Generative Scoring)

When using a CausalLM model (e.g., Llama, Qwen, Mistral) with the Score API, the endpoint computes the probability of specified token IDs appearing as the next token. This is useful for generative scoring tasks, sentiment analysis, or any scenario where you want to score the likelihood of specific tokens.

**Requirements for CausalLM models:**

- The `label_token_ids` parameter is **required** and must contain **exactly 2 token IDs** (for generative scoring).
- The score is computed as: `P(label_token_ids[0]) / (P(label_token_ids[0]) + P(label_token_ids[1]))`

##### Example: Score with CausalLM

```bash
curl -X POST http://localhost:8000/v1/score \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-0.6B",
"queries": "Is this city the capital of France?",
"documents": ["Paris", "London", "Berlin"],
Comment thread
vedantjh2 marked this conversation as resolved.
Outdated
"label_token_ids": [9454, 2753]
}'
```

??? console "Response"

```json
{
"id": "score-abc123",
"object": "list",
"created": 1234567890,
"model": "Qwen/Qwen3-0.6B",
"data": [
{"index": 0, "object": "score", "score": 0.95},
{"index": 1, "object": "score", "score": 0.12},
{"index": 2, "object": "score", "score": 0.08}
],
"usage": {"prompt_tokens": 45, "total_tokens": 48, "completion_tokens": 3}
}
```

##### How it works

1. **Prompt Construction**: For each document, builds `prompt = query + document`
2. **Forward Pass**: Runs the model to get next-token logits
3. **Probability Extraction**: Extracts logprobs for the 2 specified `label_token_ids`
4. **Softmax Normalization**: Applies softmax over only the 2 label tokens
5. **Score Computation**: Returns `P(token[0]) / (P(token[0]) + P(token[1]))` as the score

##### Finding Token IDs

To find the token IDs for your labels, use the tokenizer:

```python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-0.6B")
yes_id = tokenizer.encode("Yes", add_special_tokens=False)[0]
no_id = tokenizer.encode("No", add_special_tokens=False)[0]
print(f"Yes: {yes_id}, No: {no_id}")
```

### Re-rank API

Our Re-rank API can apply an embedding model or a cross-encoder model to predict relevant scores between a single query, and
Expand Down
Loading