Skip to content

[Generative Score API] Scoring(Prefill-only) optimizations.#9748

Merged
hnyls2002 merged 9 commits intosgl-project:mainfrom
sundar24295s:suramach/overlap
Sep 13, 2025
Merged

[Generative Score API] Scoring(Prefill-only) optimizations.#9748
hnyls2002 merged 9 commits intosgl-project:mainfrom
sundar24295s:suramach/overlap

Conversation

@sundar24295s
Copy link
Copy Markdown
Collaborator

@sundar24295s sundar24295s commented Aug 28, 2025

🚀 Motivation

  • Follow-up to PR #8840 to reduce latency and increase throughput for the generative score API.
  • Scoring only needs the next-token distribution after the full prompt, not per-token logprobs inside the prompt or any sampling.

Performance Impact:

  • On Qwen3-0.6B with 300 input tokens, at QPS 100 and 10 items per request, P99 latency improved from 6220 ms to 454 ms (~13.7× faster, ~92.7% reduction) with this PR.
  • With a P99 latency threshold of 500 ms, throughput increased from 800 to 1000 items/s per H100 GPU (~25% increase).
image

🔧 Modifications

⚡ Optimization 1: Skip Input Token Logprobs Computation

For scoring requests like:

# Item 1:
full_prompt = "What is the capital of California? Answer Yes or No for each of the following options: Sacramento"
# Item 2: 
full_prompt = "What is the capital of California? Answer Yes or No for each of the following options: San Jose"
  • We do not need input token logprobs like P(California | What is the capital of), rather only need the next-token distribution after the full prompt namely P(Yes | full_prompt) and P(No | full_prompt).

⚡ Optimization 2: Skip Sampling Step

  • Added Sampler.compute_logprobs_only() to compute logprobs without sampling for prefill-only scoring:

⚡ Optimization 3: Delayed GPU→CPU Copy with Overlap

  • Added get_token_ids_logprobs_batch_optimized(logprobs, token_ids, delay_cpu_copy=True), Single vectorized gather on GPU for the entire batch. Optionally defers .tolist() until result processing, improving overlap with the next batch’s compute.

Accuracy Tests

  • Scores before this PR
$ curl -X POST "http://localhost:30000/v1/score"   -H "Content-Type: application/json"   -d '{
    "query": "What is the capital of California? Answer Yes or No for each of the following options:",
    "items": ["Scaramento", "San Jose", "San Francisco"],
    "label_token_ids": [9454, 2753],
    "model": "/shared/public/elr-models/Qwen/Qwen3-0.6B/c1899de289a04d12100db370d81485cdf75e47ca"           
  }'
{"scores":[[4.234364670752685e-06,1.2348638303110892e-05],[7.162677269222586e-05,0.0003160321422383422],[0.0001203001937164321,0.00030480667807191645]],"model":"/shared/public/elr-models/Qwen/Qwen3-0.6B/c1899de289a04d12100db370d81485cdf75e47ca","usage":null,"object":"scoring"}
  • Scores after this PR
$ curl -X POST "http://localhost:30000/v1/score"   -H "Content-Type: application/json"   -d '{
    "query": "What is the capital of California? Answer Yes or No for each of the following options:",
    "items": ["Scaramento", "San Jose", "San Francisco"],
    "label_token_ids": [9454, 2753],
    "model": "/shared/public/elr-models/Qwen/Qwen3-0.6B/c1899de289a04d12100db370d81485cdf75e47ca"           
  }'
{"scores":[[4.234364670752685e-06,1.2348638303110892e-05],[7.162677269222586e-05,0.0003160321422383422],[0.0001203001937164321,0.00030480667807191645]],"model":"/shared/public/elr-models/Qwen/Qwen3-0.6B/c1899de289a04d12100db370d81485cdf75e47ca","usage":null,"object":"scoring"}

Benchmarking and Profiling

🧪 Benchmark Comparison: Qwen3-0.6B on H100 (CUDA 12.8)

Setup:

  • Model: Qwen3-0.6B
  • Prompt length: 300 tokens
  • Hardware: H100 GPU
  • Duration: 120s
  • Target RPS: 70 , 80, 90, 100
  • Item Count: 10 per request
  • Distribution: Poisson

Server Start:

(sglang) jobuser [ /shared/user/repos3/sglang ]$ python -m sglang.launch_server --model-path Qwen/Qwen3-0.6B --port 30000 --host 0.0.0.0 --chunked-prefill-size -1 --enable-torch-compile --dtype float16 --max-prefill-tokens 31000 --mem-fraction-static 0.5 --enable-tokenizer-batch-encode --disable-radix-cache --disable-cuda-graph

Benchmark Script:

python3.10 sglang/benchmark/score/bench_score.py

🔍 Summary of Improvement

Items Per Second Baseline P99 Latency (ms) This PR P99 Latency (ms)
600 226.00 139.16
700 282.21 193.78
800 413.14 227.20
900 1200.72 302.39
1000 6220.00 454.20
1100 8694.97 1459.81
1200 11606.46 6406.18

Profiling

  • Current baseline profile showing sampling, logits extraction causing memsyncs and delay in next batch scheduling:
Screenshot 2025-08-28 at 1 47 07 AM
  • Profile after optimizations showing almost zero gap in GPU kernel launches between batches under high load.
Screenshot 2025-08-28 at 1 50 35 AM

Checklist

Copy link
Copy Markdown

@fortunecookiee fortunecookiee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@sundar24295s
Copy link
Copy Markdown
Collaborator Author

Moved the benchmarking scripts to a separate PR will rebase on this.

@hnyls2002 hnyls2002 merged commit a360511 into sgl-project:main Sep 13, 2025
128 of 140 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants