[Generative Score API] Scoring(Prefill-only) optimizations. by sundar24295s · Pull Request #9748 · sgl-project/sglang

sundar24295s · 2025-08-28T08:01:34Z

🚀 Motivation

Follow-up to PR #8840 to reduce latency and increase throughput for the generative score API.
Scoring only needs the next-token distribution after the full prompt, not per-token logprobs inside the prompt or any sampling.

Performance Impact:

On Qwen3-0.6B with 300 input tokens, at QPS 100 and 10 items per request, P99 latency improved from 6220 ms to 454 ms (~13.7× faster, ~92.7% reduction) with this PR.
With a P99 latency threshold of 500 ms, throughput increased from 800 to 1000 items/s per H100 GPU (~25% increase).

🔧 Modifications

⚡ Optimization 1: Skip Input Token Logprobs Computation

For scoring requests like:

# Item 1:
full_prompt = "What is the capital of California? Answer Yes or No for each of the following options: Sacramento"
# Item 2: 
full_prompt = "What is the capital of California? Answer Yes or No for each of the following options: San Jose"

We do not need input token logprobs like P(California | What is the capital of), rather only need the next-token distribution after the full prompt namely P(Yes | full_prompt) and P(No | full_prompt).

⚡ Optimization 2: Skip Sampling Step

Added Sampler.compute_logprobs_only() to compute logprobs without sampling for prefill-only scoring:

⚡ Optimization 3: Delayed GPU→CPU Copy with Overlap

Added get_token_ids_logprobs_batch_optimized(logprobs, token_ids, delay_cpu_copy=True), Single vectorized gather on GPU for the entire batch. Optionally defers .tolist() until result processing, improving overlap with the next batch’s compute.

Accuracy Tests

Scores before this PR

$ curl -X POST "http://localhost:30000/v1/score"   -H "Content-Type: application/json"   -d '{
    "query": "What is the capital of California? Answer Yes or No for each of the following options:",
    "items": ["Scaramento", "San Jose", "San Francisco"],
    "label_token_ids": [9454, 2753],
    "model": "/shared/public/elr-models/Qwen/Qwen3-0.6B/c1899de289a04d12100db370d81485cdf75e47ca"           
  }'
{"scores":[[4.234364670752685e-06,1.2348638303110892e-05],[7.162677269222586e-05,0.0003160321422383422],[0.0001203001937164321,0.00030480667807191645]],"model":"/shared/public/elr-models/Qwen/Qwen3-0.6B/c1899de289a04d12100db370d81485cdf75e47ca","usage":null,"object":"scoring"}

Scores after this PR

$ curl -X POST "http://localhost:30000/v1/score"   -H "Content-Type: application/json"   -d '{
    "query": "What is the capital of California? Answer Yes or No for each of the following options:",
    "items": ["Scaramento", "San Jose", "San Francisco"],
    "label_token_ids": [9454, 2753],
    "model": "/shared/public/elr-models/Qwen/Qwen3-0.6B/c1899de289a04d12100db370d81485cdf75e47ca"           
  }'
{"scores":[[4.234364670752685e-06,1.2348638303110892e-05],[7.162677269222586e-05,0.0003160321422383422],[0.0001203001937164321,0.00030480667807191645]],"model":"/shared/public/elr-models/Qwen/Qwen3-0.6B/c1899de289a04d12100db370d81485cdf75e47ca","usage":null,"object":"scoring"}

Benchmarking and Profiling

🧪 Benchmark Comparison: Qwen3-0.6B on H100 (CUDA 12.8)

Setup:

Model: Qwen3-0.6B
Prompt length: 300 tokens
Hardware: H100 GPU
Duration: 120s
Target RPS: 70 , 80, 90, 100
Item Count: 10 per request
Distribution: Poisson

Server Start:

(sglang) jobuser [ /shared/user/repos3/sglang ]$ python -m sglang.launch_server --model-path Qwen/Qwen3-0.6B --port 30000 --host 0.0.0.0 --chunked-prefill-size -1 --enable-torch-compile --dtype float16 --max-prefill-tokens 31000 --mem-fraction-static 0.5 --enable-tokenizer-batch-encode --disable-radix-cache --disable-cuda-graph

Benchmark Script:

python3.10 sglang/benchmark/score/bench_score.py

🔍 Summary of Improvement

Items Per Second	Baseline P99 Latency (ms)	This PR P99 Latency (ms)
600	226.00	139.16
700	282.21	193.78
800	413.14	227.20
900	1200.72	302.39
1000	6220.00	454.20
1100	8694.97	1459.81
1200	11606.46	6406.18

Profiling

Current baseline profile showing sampling, logits extraction causing memsyncs and delay in next batch scheduling:

Profile after optimizations showing almost zero gap in GPU kernel launches between batches under high load.

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

fortunecookiee

LGTM!

python/sglang/srt/layers/sampler.py

python/sglang/srt/managers/schedule_batch.py

sundar24295s · 2025-09-09T21:24:52Z

Moved the benchmarking scripts to a separate PR will rebase on this.

sundar24295s marked this pull request as ready for review August 28, 2025 08:57

sundar24295s requested review from BBuf, Edwardf0t1, HaiShaw, Ying1123, ch-wan, hnyls2002, ispobock, kushanam, merrymercy, xiezhq-hermann and zhyncs as code owners August 28, 2025 08:57

fortunecookiee approved these changes Sep 3, 2025

View reviewed changes

chanh reviewed Sep 5, 2025

View reviewed changes

hnyls2002 approved these changes Sep 9, 2025

View reviewed changes

zhyncs assigned hnyls2002 Sep 9, 2025

zhyncs added the high priority label Sep 9, 2025

sundar24295s mentioned this pull request Sep 9, 2025

[Benchmark] Prefil-only benchmark scripts #10240

Merged

4 tasks

sundar24295s added 3 commits September 10, 2025 03:52

Prefill-only scoring optimizations

0421d75

updates

2fa73be

updates

97f0d91

sundar24295s requested review from ByronHsu, CatherineSue, FlamingoPg, JustinTong0323, mickqian, slin1237 and yizhang2077 as code owners September 10, 2025 03:55

sundar24295s force-pushed the suramach/overlap branch from d065789 to 97f0d91 Compare September 10, 2025 04:14

sundar24295s removed request for ByronHsu, CatherineSue, FlamingoPg, JustinTong0323, slin1237 and yizhang2077 September 10, 2025 04:17

sundar24295s added 6 commits September 10, 2025 04:19

update

4f779ea

fix unit tests

0632b53

fix unit tests

aeaf9cd

Merge branch 'main' into suramach/overlap

2bb13ec

fix unit tests

d519f71

missed commit

b5f9fbd

hnyls2002 merged commit a360511 into sgl-project:main Sep 13, 2025
128 of 140 checks passed

haNa-meister mentioned this pull request Dec 2, 2025

[Generative Score API] Fix on prefill-only scheduler running batch loss track problem #14320

Merged

6 tasks

sundar24295s mentioned this pull request Dec 17, 2025

[Roadmap] SGLang Prefill-Only 2026 CY26H1 Roadmap #15344

Open

18 tasks

ch-wan mentioned this pull request Dec 27, 2025

[scheduler] fix: correcting extend_logprob_start_len calculation #15922

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Generative Score API] Scoring(Prefill-only) optimizations.#9748

[Generative Score API] Scoring(Prefill-only) optimizations.#9748
hnyls2002 merged 9 commits intosgl-project:mainfrom
sundar24295s:suramach/overlap

sundar24295s commented Aug 28, 2025 •

edited

Loading

Uh oh!

fortunecookiee left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sundar24295s commented Sep 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

sundar24295s commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🚀 Motivation

🔧 Modifications

⚡ Optimization 1: Skip Input Token Logprobs Computation

⚡ Optimization 2: Skip Sampling Step

⚡ Optimization 3: Delayed GPU→CPU Copy with Overlap

Accuracy Tests

Benchmarking and Profiling

🧪 Benchmark Comparison: Qwen3-0.6B on H100 (CUDA 12.8)

🔍 Summary of Improvement

Profiling

Checklist

Uh oh!

fortunecookiee left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sundar24295s commented Sep 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

sundar24295s commented Aug 28, 2025 •

edited

Loading