Skip to content

Generative Scoring#34539

Merged
ywang96 merged 33 commits intovllm-project:mainfrom
vedantjh2:vjhaveri/scoring
Mar 31, 2026
Merged

Generative Scoring#34539
ywang96 merged 33 commits intovllm-project:mainfrom
vedantjh2:vjhaveri/scoring

Conversation

@vedantjh2
Copy link
Copy Markdown
Contributor

@vedantjh2 vedantjh2 commented Feb 13, 2026

Purpose

This PR adds a standalone /generative_scoring endpoint for computing next-token probability scores using CausalLM models (e.g., Qwen3-Reranker-0.6B). This enables serving reranker models in their native CausalLM/generative architecture without requiring --hf_overrides to force a SequenceClassification wrapper.

The endpoint computes: score = P(label_token_ids[0]) / (P(label_token_ids[0]) + P(label_token_ids[1])) — i.e., softmax-normalized probability of the first label token over both label tokens.

Key changes:

  • New /generative_scoring endpoint (vllm/entrypoints/openai/generative_scoring/): Standalone API for generative scoring, registered for generate-task models via api_server.py
  • New logprob_token_ids field in SamplingParams: Allows requesting logprobs for specific token IDs without materializing the full vocabulary distribution
  • Efficient logprob computation: Adds gather_specific_token_logprobs() to the V1 sampler using the fused Triton kernel (compute_token_logprobs) for log_softmax + gather, avoiding full vocabulary materialization
  • Batching support: Requests with different logprob_token_ids values can be batched together (padded to max length)
  • Automatic prompt truncation: Prompts exceeding max_model_len are truncated to max_model_len - 1 (reserving 1 token for output)
  • ModelConfig.is_causal_lm property: New helper to detect CausalLM architectures via regex on hf_config.architectures

Test Plan

Unit tests

pytest tests/entrypoints/openai/generative_scoring/test_generative_scoring.py -v

Tests cover:

  • Protocol model construction (request/response fields, defaults)
  • Probability computation (softmax normalization, numerical stability, true probs mode)
  • Score formula: P(token[0]) / (P(token[0]) + P(token[1]))
  • Input validation (out-of-vocab token IDs, empty items)
  • Prompt building and item ordering (item_first flag)
  • Full generation flow with mocked engine

End-to-end tests

pytest tests/entrypoints/openai/generative_scoring/test_generative_scoring_e2e.py -v

Tests cover (requires GPU with Qwen/Qwen3-0.6B):

  • Basic score request and response structure validation
  • Multiple documents scoring
  • Missing label_token_ids returns 422
  • Empty items returns 400
  • Out-of-vocab token IDs return 400
  • Score determinism across identical requests

Manual testing

Launch server:

vllm serve /shared/public/elr-models/Qwen/Qwen3-Reranker-0.6B/ --max-model-len 2048

Generative score request:

curl -X POST http://localhost:8000/generative_score \
  -H "Content-Type: application/json" \
  -d '{
    "query": "Is this city the capital of France?",
    "items": ["Paris", "London", "New York", "Dublin", "Berlin"],
    "label_token_ids": [9454, 2753]
  }'

Pre-tokenized inputs:

curl -X POST http://localhost:8000/generative_score \
  -H "Content-Type: application/json" \
  -d '{
    "query": [100, 200],
    "items": [[300, 400], [500, 600]],
    "label_token_ids": [9454, 2753]
  }'

Error case — empty items:

curl -X POST http://localhost:8000/generative_score \
  -H "Content-Type: application/json" \
  -d '{
    "query": "test",
    "items": [],
    "label_token_ids": [9454, 2753]
  }'

Test Results

Generative score response (Qwen3-Reranker-0.6B):

{
    "id": "generative-score-bbd65dabc2dd9206",
    "object": "list",
    "created": 1774308947,
    "model": "/shared/public/elr-models/Qwen/Qwen3-Reranker-0.6B/",
    "data": [
        {"index": 0, "object": "score", "score": 0.5621765008857981},
        {"index": 1, "object": "score", "score": 0.8267117940706734},
        {"index": 2, "object": "score", "score": 0.18242552380635632},
        {"index": 3, "object": "score", "score": 0.23651623644570763},
        {"index": 4, "object": "score", "score": 0.4610167793123159}
    ],
    "usage": {
        "prompt_tokens": 47,
        "total_tokens": 52,
        "completion_tokens": 5,
        "prompt_tokens_details": null
    }
}

Empty items error:

{
    "error": {
        "message": "items must contain at least one item.",
        "type": "BadRequestError",
        "param": null,
        "code": 400
    }
}

Pre-tokenized inputs work correctly:

{
    "id": "generative-score-8305257f1f114b7b",
    "object": "list",
    "created": 1774308973,
    "model": "/shared/public/elr-models/Qwen/Qwen3-Reranker-0.6B/",
    "data": [
        {"index": 0, "object": "score", "score": 0.2751297238231752},
        {"index": 1, "object": "score", "score": 0.05921025074128593}
    ],
    "usage": {
        "prompt_tokens": 8,
        "total_tokens": 10,
        "completion_tokens": 2,
        "prompt_tokens_details": null
    }
}

Future Optimizations

Currently we need to actually generate a token to get the logprobs (max_tokens=1). We would like to introduce a feature in the future where we set max_tokens=0 instead and calculate the logprobs without having to go through a decode phase, making this workload prefill-only.


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update.

@DarkLight1337 DarkLight1337 added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 27, 2026
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Mar 27, 2026

Hi @vedantjh2, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Comment thread vllm/entrypoints/openai/generative_scoring/serving.py Outdated
Signed-off-by: Vedant Jhaveri <vjhaveri@linkedin.com>
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Mar 28, 2026

Hi @vedantjh2, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Vedant Jhaveri added 4 commits March 29, 2026 21:47
…ally

Signed-off-by: Vedant Jhaveri <vjhaveri@linkedin.com>
Signed-off-by: Vedant Jhaveri <vjhaveri@linkedin.com>
Signed-off-by: Vedant Jhaveri <vjhaveri@linkedin.com>
@vedantjh2
Copy link
Copy Markdown
Contributor Author

@DarkLight1337 @noooop looks like the CI passed all cases. can yall help me merge please? Thank you!

@DarkLight1337 DarkLight1337 enabled auto-merge (squash) March 31, 2026 07:38
@DarkLight1337
Copy link
Copy Markdown
Member

DarkLight1337 commented Mar 31, 2026

Can you fix the docs failure?

WARNING - griffe: vllm/entrypoints/openai/generative_scoring/serving.py:393: No type or annotation for parameter 'tokenizer'

Signed-off-by: Vedant Jhaveri <vjhaveri@linkedin.com>
auto-merge was automatically disabled March 31, 2026 17:35

Head branch was pushed to by a user without write access

@vedantjh2
Copy link
Copy Markdown
Contributor Author

@DarkLight1337 fixed! Thank you!

@ywang96 ywang96 merged commit 2e56975 into vllm-project:main Mar 31, 2026
55 of 57 checks passed
puririshi98 pushed a commit to puririshi98/vllm that referenced this pull request Apr 7, 2026
Signed-off-by: Vedant Jhaveri <vjhaveri@linkedin.com>
Co-authored-by: Vedant Jhaveri <vjhaveri@linkedin.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Signed-off-by: Rishi Puri <riship@nvidia.com>
mtparet pushed a commit to blackfuel-ai/vllm that referenced this pull request Apr 9, 2026
Signed-off-by: Vedant Jhaveri <vjhaveri@linkedin.com>
Co-authored-by: Vedant Jhaveri <vjhaveri@linkedin.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
wangxiyuan pushed a commit to vllm-project/vllm-ascend that referenced this pull request Apr 10, 2026
### What this PR does / why we need it?
For the fusedmoe:
vllm-project/vllm#33049
vllm-project/vllm#35949
FusedMoe refactor

For the qwen3_vl:
vllm-project/vllm#34539
A new Triton kernel has been added for fast rope position encoding. I've
added a patch to fallback to native. We'll consider registering custom
operators and implementing ascending later.

vllm-project/vllm#38361


### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: 
- vLLM main:
vllm-project/vllm@29e4870

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
paulyu12 pushed a commit to paulyu12/vllm-ascend that referenced this pull request Apr 14, 2026
### What this PR does / why we need it?
For the fusedmoe:
vllm-project/vllm#33049
vllm-project/vllm#35949
FusedMoe refactor

For the qwen3_vl:
vllm-project/vllm#34539
A new Triton kernel has been added for fast rope position encoding. I've
added a patch to fallback to native. We'll consider registering custom
operators and implementing ascending later.

vllm-project/vllm#38361


### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: 
- vLLM main:
vllm-project/vllm@29e4870

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
guxin108 pushed a commit to guxin108/vllm-ascend that referenced this pull request Apr 24, 2026
### What this PR does / why we need it?
For the fusedmoe:
vllm-project/vllm#33049
vllm-project/vllm#35949
FusedMoe refactor

For the qwen3_vl:
vllm-project/vllm#34539
A new Triton kernel has been added for fast rope position encoding. I've
added a patch to fallback to native. We'll consider registering custom
operators and implementing ascending later.

vllm-project/vllm#38361

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version:
- vLLM main:
vllm-project/vllm@29e4870

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
Signed-off-by: guxin108 <1252896542@qq.com>
zouyida2052 pushed a commit to zouyida2052/vllm-ascend that referenced this pull request Apr 28, 2026
### What this PR does / why we need it?
For the fusedmoe:
vllm-project/vllm#33049
vllm-project/vllm#35949
FusedMoe refactor

For the qwen3_vl:
vllm-project/vllm#34539
A new Triton kernel has been added for fast rope position encoding. I've
added a patch to fallback to native. We'll consider registering custom
operators and implementing ascending later.

vllm-project/vllm#38361

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version:
- vLLM main:
vllm-project/vllm@29e4870

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
Signed-off-by: zouyida2052 <zouyida2002@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation frontend ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants