Generative Scoring#34539
Merged
ywang96 merged 33 commits intovllm-project:mainfrom Mar 31, 2026
Merged
Conversation
…which made the API super slow for large batch sizes
…t. generative used when we launch server with a *FaorCausalLM* model
…t number of token ids in request to 2
Contributor
|
Hi @vedantjh2, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: Vedant Jhaveri <vjhaveri@linkedin.com>
Contributor
|
Hi @vedantjh2, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
added 4 commits
March 29, 2026 21:47
…ally Signed-off-by: Vedant Jhaveri <vjhaveri@linkedin.com>
Signed-off-by: Vedant Jhaveri <vjhaveri@linkedin.com>
Signed-off-by: Vedant Jhaveri <vjhaveri@linkedin.com>
Contributor
Author
|
@DarkLight1337 @noooop looks like the CI passed all cases. can yall help me merge please? Thank you! |
Member
|
Can you fix the docs failure?
|
Signed-off-by: Vedant Jhaveri <vjhaveri@linkedin.com>
auto-merge was automatically disabled
March 31, 2026 17:35
Head branch was pushed to by a user without write access
Contributor
Author
|
@DarkLight1337 fixed! Thank you! |
puririshi98
pushed a commit
to puririshi98/vllm
that referenced
this pull request
Apr 7, 2026
Signed-off-by: Vedant Jhaveri <vjhaveri@linkedin.com> Co-authored-by: Vedant Jhaveri <vjhaveri@linkedin.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> Signed-off-by: Rishi Puri <riship@nvidia.com>
mtparet
pushed a commit
to blackfuel-ai/vllm
that referenced
this pull request
Apr 9, 2026
Signed-off-by: Vedant Jhaveri <vjhaveri@linkedin.com> Co-authored-by: Vedant Jhaveri <vjhaveri@linkedin.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
wangxiyuan
pushed a commit
to vllm-project/vllm-ascend
that referenced
this pull request
Apr 10, 2026
### What this PR does / why we need it? For the fusedmoe: vllm-project/vllm#33049 vllm-project/vllm#35949 FusedMoe refactor For the qwen3_vl: vllm-project/vllm#34539 A new Triton kernel has been added for fast rope position encoding. I've added a patch to fallback to native. We'll consider registering custom operators and implementing ascending later. vllm-project/vllm#38361 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: - vLLM main: vllm-project/vllm@29e4870 --------- Signed-off-by: wangli <wangli858794774@gmail.com>
paulyu12
pushed a commit
to paulyu12/vllm-ascend
that referenced
this pull request
Apr 14, 2026
### What this PR does / why we need it? For the fusedmoe: vllm-project/vllm#33049 vllm-project/vllm#35949 FusedMoe refactor For the qwen3_vl: vllm-project/vllm#34539 A new Triton kernel has been added for fast rope position encoding. I've added a patch to fallback to native. We'll consider registering custom operators and implementing ascending later. vllm-project/vllm#38361 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: - vLLM main: vllm-project/vllm@29e4870 --------- Signed-off-by: wangli <wangli858794774@gmail.com>
guxin108
pushed a commit
to guxin108/vllm-ascend
that referenced
this pull request
Apr 24, 2026
### What this PR does / why we need it? For the fusedmoe: vllm-project/vllm#33049 vllm-project/vllm#35949 FusedMoe refactor For the qwen3_vl: vllm-project/vllm#34539 A new Triton kernel has been added for fast rope position encoding. I've added a patch to fallback to native. We'll consider registering custom operators and implementing ascending later. vllm-project/vllm#38361 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: - vLLM main: vllm-project/vllm@29e4870 --------- Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: guxin108 <1252896542@qq.com>
zouyida2052
pushed a commit
to zouyida2052/vllm-ascend
that referenced
this pull request
Apr 28, 2026
### What this PR does / why we need it? For the fusedmoe: vllm-project/vllm#33049 vllm-project/vllm#35949 FusedMoe refactor For the qwen3_vl: vllm-project/vllm#34539 A new Triton kernel has been added for fast rope position encoding. I've added a patch to fallback to native. We'll consider registering custom operators and implementing ascending later. vllm-project/vllm#38361 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: - vLLM main: vllm-project/vllm@29e4870 --------- Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: zouyida2052 <zouyida2002@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
This PR adds a standalone
/generative_scoringendpoint for computing next-token probability scores using CausalLM models (e.g.,Qwen3-Reranker-0.6B). This enables serving reranker models in their native CausalLM/generative architecture without requiring--hf_overridesto force a SequenceClassification wrapper.The endpoint computes:
score = P(label_token_ids[0]) / (P(label_token_ids[0]) + P(label_token_ids[1]))— i.e., softmax-normalized probability of the first label token over both label tokens.Key changes:
/generative_scoringendpoint (vllm/entrypoints/openai/generative_scoring/): Standalone API for generative scoring, registered forgenerate-task models viaapi_server.pylogprob_token_idsfield inSamplingParams: Allows requesting logprobs for specific token IDs without materializing the full vocabulary distributiongather_specific_token_logprobs()to the V1 sampler using the fused Triton kernel (compute_token_logprobs) for log_softmax + gather, avoiding full vocabulary materializationlogprob_token_idsvalues can be batched together (padded to max length)max_model_lenare truncated tomax_model_len - 1(reserving 1 token for output)ModelConfig.is_causal_lmproperty: New helper to detect CausalLM architectures via regex onhf_config.architecturesTest Plan
Unit tests
Tests cover:
P(token[0]) / (P(token[0]) + P(token[1]))item_firstflag)End-to-end tests
Tests cover (requires GPU with
Qwen/Qwen3-0.6B):label_token_idsreturns 422Manual testing
Launch server:
Generative score request:
Pre-tokenized inputs:
Error case — empty items:
Test Results
Generative score response (
Qwen3-Reranker-0.6B):{ "id": "generative-score-bbd65dabc2dd9206", "object": "list", "created": 1774308947, "model": "/shared/public/elr-models/Qwen/Qwen3-Reranker-0.6B/", "data": [ {"index": 0, "object": "score", "score": 0.5621765008857981}, {"index": 1, "object": "score", "score": 0.8267117940706734}, {"index": 2, "object": "score", "score": 0.18242552380635632}, {"index": 3, "object": "score", "score": 0.23651623644570763}, {"index": 4, "object": "score", "score": 0.4610167793123159} ], "usage": { "prompt_tokens": 47, "total_tokens": 52, "completion_tokens": 5, "prompt_tokens_details": null } }Empty items error:
{ "error": { "message": "items must contain at least one item.", "type": "BadRequestError", "param": null, "code": 400 } }Pre-tokenized inputs work correctly:
{ "id": "generative-score-8305257f1f114b7b", "object": "list", "created": 1774308973, "model": "/shared/public/elr-models/Qwen/Qwen3-Reranker-0.6B/", "data": [ {"index": 0, "object": "score", "score": 0.2751297238231752}, {"index": 1, "object": "score", "score": 0.05921025074128593} ], "usage": { "prompt_tokens": 8, "total_tokens": 10, "completion_tokens": 2, "prompt_tokens_details": null } }Future Optimizations
Currently we need to actually generate a token to get the logprobs (
max_tokens=1). We would like to introduce a feature in the future where we setmax_tokens=0instead and calculate the logprobs without having to go through a decode phase, making this workload prefill-only.Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.