Generative Scoring by vedantjh2 · Pull Request #34539 · vllm-project/vllm

vedantjh2 · 2026-02-13T21:31:37Z

Purpose

This PR adds a standalone /generative_scoring endpoint for computing next-token probability scores using CausalLM models (e.g., Qwen3-Reranker-0.6B). This enables serving reranker models in their native CausalLM/generative architecture without requiring --hf_overrides to force a SequenceClassification wrapper.

The endpoint computes: score = P(label_token_ids[0]) / (P(label_token_ids[0]) + P(label_token_ids[1])) — i.e., softmax-normalized probability of the first label token over both label tokens.

Key changes:

New /generative_scoring endpoint (vllm/entrypoints/openai/generative_scoring/): Standalone API for generative scoring, registered for generate-task models via api_server.py
New logprob_token_ids field in SamplingParams: Allows requesting logprobs for specific token IDs without materializing the full vocabulary distribution
Efficient logprob computation: Adds gather_specific_token_logprobs() to the V1 sampler using the fused Triton kernel (compute_token_logprobs) for log_softmax + gather, avoiding full vocabulary materialization
Batching support: Requests with different logprob_token_ids values can be batched together (padded to max length)
Automatic prompt truncation: Prompts exceeding max_model_len are truncated to max_model_len - 1 (reserving 1 token for output)
ModelConfig.is_causal_lm property: New helper to detect CausalLM architectures via regex on hf_config.architectures

Test Plan

Unit tests

pytest tests/entrypoints/openai/generative_scoring/test_generative_scoring.py -v

Tests cover:

Protocol model construction (request/response fields, defaults)
Probability computation (softmax normalization, numerical stability, true probs mode)
Score formula: P(token[0]) / (P(token[0]) + P(token[1]))
Input validation (out-of-vocab token IDs, empty items)
Prompt building and item ordering (item_first flag)
Full generation flow with mocked engine

End-to-end tests

pytest tests/entrypoints/openai/generative_scoring/test_generative_scoring_e2e.py -v

Tests cover (requires GPU with Qwen/Qwen3-0.6B):

Basic score request and response structure validation
Multiple documents scoring
Missing label_token_ids returns 422
Empty items returns 400
Out-of-vocab token IDs return 400
Score determinism across identical requests

Manual testing

Launch server:

vllm serve /shared/public/elr-models/Qwen/Qwen3-Reranker-0.6B/ --max-model-len 2048

Generative score request:

curl -X POST http://localhost:8000/generative_score \
  -H "Content-Type: application/json" \
  -d '{
    "query": "Is this city the capital of France?",
    "items": ["Paris", "London", "New York", "Dublin", "Berlin"],
    "label_token_ids": [9454, 2753]
  }'

Pre-tokenized inputs:

curl -X POST http://localhost:8000/generative_score \
  -H "Content-Type: application/json" \
  -d '{
    "query": [100, 200],
    "items": [[300, 400], [500, 600]],
    "label_token_ids": [9454, 2753]
  }'

Error case — empty items:

curl -X POST http://localhost:8000/generative_score \
  -H "Content-Type: application/json" \
  -d '{
    "query": "test",
    "items": [],
    "label_token_ids": [9454, 2753]
  }'

Test Results

Generative score response (`Qwen3-Reranker-0.6B`):

{
    "id": "generative-score-bbd65dabc2dd9206",
    "object": "list",
    "created": 1774308947,
    "model": "/shared/public/elr-models/Qwen/Qwen3-Reranker-0.6B/",
    "data": [
        {"index": 0, "object": "score", "score": 0.5621765008857981},
        {"index": 1, "object": "score", "score": 0.8267117940706734},
        {"index": 2, "object": "score", "score": 0.18242552380635632},
        {"index": 3, "object": "score", "score": 0.23651623644570763},
        {"index": 4, "object": "score", "score": 0.4610167793123159}
    ],
    "usage": {
        "prompt_tokens": 47,
        "total_tokens": 52,
        "completion_tokens": 5,
        "prompt_tokens_details": null
    }
}

Empty items error:

{
    "error": {
        "message": "items must contain at least one item.",
        "type": "BadRequestError",
        "param": null,
        "code": 400
    }
}

Pre-tokenized inputs work correctly:

{
    "id": "generative-score-8305257f1f114b7b",
    "object": "list",
    "created": 1774308973,
    "model": "/shared/public/elr-models/Qwen/Qwen3-Reranker-0.6B/",
    "data": [
        {"index": 0, "object": "score", "score": 0.2751297238231752},
        {"index": 1, "object": "score", "score": 0.05921025074128593}
    ],
    "usage": {
        "prompt_tokens": 8,
        "total_tokens": 10,
        "completion_tokens": 2,
        "prompt_tokens_details": null
    }
}

Future Optimizations

Currently we need to actually generate a token to get the logprobs (max_tokens=1). We would like to introduce a feature in the future where we set max_tokens=0 instead and calculate the logprobs without having to go through a decode phase, making this workload prefill-only.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update.

…which made the API super slow for large batch sizes

…t. generative used when we launch server with a *FaorCausalLM* model

…t number of token ids in request to 2

mergify · 2026-03-27T04:52:42Z

Hi @vedantjh2, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Signed-off-by: Vedant Jhaveri <vjhaveri@linkedin.com>

mergify · 2026-03-28T02:28:53Z

Hi @vedantjh2, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

…ally Signed-off-by: Vedant Jhaveri <vjhaveri@linkedin.com>

Signed-off-by: Vedant Jhaveri <vjhaveri@linkedin.com>

vedantjh2 · 2026-03-30T20:54:05Z

@DarkLight1337 @noooop looks like the CI passed all cases. can yall help me merge please? Thank you!

DarkLight1337 · 2026-03-31T07:52:34Z

Can you fix the docs failure?

WARNING - griffe: vllm/entrypoints/openai/generative_scoring/serving.py:393: No type or annotation for parameter 'tokenizer'

Signed-off-by: Vedant Jhaveri <vjhaveri@linkedin.com>

vedantjh2 · 2026-03-31T17:36:06Z

@DarkLight1337 fixed! Thank you!

Signed-off-by: Vedant Jhaveri <vjhaveri@linkedin.com> Co-authored-by: Vedant Jhaveri <vjhaveri@linkedin.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> Signed-off-by: Rishi Puri <riship@nvidia.com>

Signed-off-by: Vedant Jhaveri <vjhaveri@linkedin.com> Co-authored-by: Vedant Jhaveri <vjhaveri@linkedin.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>

### What this PR does / why we need it? For the fusedmoe: vllm-project/vllm#33049 vllm-project/vllm#35949 FusedMoe refactor For the qwen3_vl: vllm-project/vllm#34539 A new Triton kernel has been added for fast rope position encoding. I've added a patch to fallback to native. We'll consider registering custom operators and implementing ascending later. vllm-project/vllm#38361 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: - vLLM main: vllm-project/vllm@29e4870 --------- Signed-off-by: wangli <wangli858794774@gmail.com>

### What this PR does / why we need it? For the fusedmoe: vllm-project/vllm#33049 vllm-project/vllm#35949 FusedMoe refactor For the qwen3_vl: vllm-project/vllm#34539 A new Triton kernel has been added for fast rope position encoding. I've added a patch to fallback to native. We'll consider registering custom operators and implementing ascending later. vllm-project/vllm#38361 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: - vLLM main: vllm-project/vllm@29e4870 --------- Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: guxin108 <1252896542@qq.com>

### What this PR does / why we need it? For the fusedmoe: vllm-project/vllm#33049 vllm-project/vllm#35949 FusedMoe refactor For the qwen3_vl: vllm-project/vllm#34539 A new Triton kernel has been added for fast rope position encoding. I've added a patch to fallback to native. We'll consider registering custom operators and implementing ascending later. vllm-project/vllm#38361 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: - vLLM main: vllm-project/vllm@29e4870 --------- Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: zouyida2052 <zouyida2002@gmail.com>

vedantjh2 added 16 commits January 30, 2026 21:20

add scoring API

f6ef430

only compute necessary tokens for scoring rather than all the tokens …

c4d13da

…which made the API super slow for large batch sizes

clean up code

8288f48

combine generative score API in v1/score to unify the scoring endpoin…

2dc6250

…t. generative used when we launch server with a *FaorCausalLM* model

update docs

9777317

change test end point to v1/score

b2b5223

remove sampling params that we do not need for scoring

21b35e1

update tests

16c2f21

remove circular import safety net

c169f6f

move files into the correct test folder

1a259d2

allow for heterogenous token_id batching to occur in a batch and limi…

f4b81b5

…t number of token ids in request to 2

require exactly 2 token ids for generative scoring

d973f81

require exactly 2 token ids for generative scoring

8a9316f

move imports to top and solve vircular import

396fb35

add truncation for tokens

933be09

consolidate tests and include changes from recent updates

8e5b19c

vedantjh2 requested review from 22quinn, DarkLight1337, NickLucche, WoosukKwon, aarnphm, chaunceyjiang, houseroad, mgoin, njhill, noooop, robertgshaw2-redhat, russellb, tlrmchlsmth and youkaichao as code owners February 13, 2026 21:31

DarkLight1337 added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 27, 2026

Merge branch 'main' into vjhaveri/scoring

3ec2d50

DarkLight1337 reviewed Mar 27, 2026

View reviewed changes

Comment thread vllm/entrypoints/openai/generative_scoring/serving.py Outdated

update engine input to use renderer and lint fixes

24f1337

Signed-off-by: Vedant Jhaveri <vjhaveri@linkedin.com>

Vedant Jhaveri added 4 commits March 29, 2026 21:47

Merge remote-tracking branch 'upstream/main' into vjhaveri/scoring

a781350

fix changes after integrating upstream for failing ci and testing loc…

6fce10b

…ally Signed-off-by: Vedant Jhaveri <vjhaveri@linkedin.com>

fix failing metadata CI

264dd94

Signed-off-by: Vedant Jhaveri <vjhaveri@linkedin.com>

Fix test mocks: add renderer to mock engine, fix expected status code

814db14

Signed-off-by: Vedant Jhaveri <vjhaveri@linkedin.com>

Merge branch 'main' into vjhaveri/scoring

5dc5be3

DarkLight1337 enabled auto-merge (squash) March 31, 2026 07:38

Add type annotation for tokenizer parameter to fix docs build

c4c0c1d

Signed-off-by: Vedant Jhaveri <vjhaveri@linkedin.com>

auto-merge was automatically disabled March 31, 2026 17:35
Head branch was pushed to by a user without write access

ywang96 merged commit 2e56975 into vllm-project:main Mar 31, 2026
55 of 57 checks passed

Potabk mentioned this pull request Apr 10, 2026

[Misc] Upgrade vllm version to 0408 vllm-project/vllm-ascend#8060

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Generative Scoring#34539

Generative Scoring#34539
ywang96 merged 33 commits intovllm-project:mainfrom
vedantjh2:vjhaveri/scoring

vedantjh2 commented Feb 13, 2026 •

edited

Loading

Uh oh!

mergify Bot commented Mar 27, 2026

Uh oh!

Uh oh!

mergify Bot commented Mar 28, 2026

Uh oh!

vedantjh2 commented Mar 30, 2026

Uh oh!

DarkLight1337 commented Mar 31, 2026 •

edited

Loading

Uh oh!

vedantjh2 commented Mar 31, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Conversation

vedantjh2 commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Key changes:

Test Plan

Unit tests

End-to-end tests

Manual testing

Test Results

Generative score response (Qwen3-Reranker-0.6B):

Empty items error:

Pre-tokenized inputs work correctly:

Future Optimizations

Uh oh!

mergify Bot commented Mar 27, 2026

Uh oh!

Uh oh!

mergify Bot commented Mar 28, 2026

Uh oh!

vedantjh2 commented Mar 30, 2026

Uh oh!

DarkLight1337 commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vedantjh2 commented Mar 31, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

vedantjh2 commented Feb 13, 2026 •

edited

Loading

Generative score response (`Qwen3-Reranker-0.6B`):

DarkLight1337 commented Mar 31, 2026 •

edited

Loading