[Generative Score API] Fix on prefill-only scheduler running batch loss track problem by haNa-meister · Pull Request #14320 · sgl-project/sglang

haNa-meister · 2025-12-02T22:56:17Z

Motivation

Metric missing problem

Currently, in last change for prefill-only: PR, to have higher throughput, we decided to skip decode scheduling stage. However, in its implementation, it will skip running_batch to merge with last_batch which makes running_batch is always empty.

Affects load monitor on sgl model gateway

In sgl model gateway, it uses get_load api to get load infos on sglang server, however, during a prefill benchmarking it shows 0 on requests and 0 on token is use, below are an example for polling get_load during that period:

01:18:50.658    10621.4ms [{"rid":null,"http_worker_ipc":null,"dp_rank":null,"num_reqs":0,"num_waiting_reqs":0,"num_tokens":0,"ts_tic":2345401.308139087}]
01:18:50.708    10671.5ms [{"rid":null,"http_worker_ipc":null,"dp_rank":null,"num_reqs":0,"num_waiting_reqs":0,"num_tokens":0,"ts_tic":2345401.358289817}]
01:18:50.758    10721.8ms [{"rid":null,"http_worker_ipc":null,"dp_rank":null,"num_reqs":0,"num_waiting_reqs":0,"num_tokens":0,"ts_tic":2345401.408551827}]
01:18:50.808    10771.7ms

Same polling after this fix:

05:10:54.087      472.4ms [{"rid":null,"http_worker_ipc":null,"dp_rank":null,"num_reqs":0,"num_waiting_reqs":0,"num_tokens":0,"ts_tic":2359324.756161275}]
05:10:54.137      509.3ms [{"rid":null,"http_worker_ipc":null,"dp_rank":null,"num_reqs":12,"num_waiting_reqs":12,"num_tokens":6625,"ts_tic":2359324.790857924}]
05:10:54.187      618.7ms [{"rid":null,"http_worker_ipc":null,"dp_rank":null,"num_reqs":14,"num_waiting_reqs":1,"num_tokens":1530,"ts_tic":2359324.902086325}]
05:10:54.255      657.7ms [{"rid":null,"http_worker_ipc":null,"dp_rank":null,"num_reqs":2,"num_waiting_reqs":0,"num_tokens":507,"ts_tic":2359324.924529118}]
05:10:54.305      697.1ms [{"rid":null,"http_worker_ipc":null,"dp_rank":null,"num_reqs":7,"num_waiting_reqs":7,"num_tokens":4102,"ts_tic":2359324.980895537}]
05:10:54.355      799.7ms [{"rid":null,"http_worker_ipc":null,"dp_rank":null,"num_reqs":15,"num_waiting_reqs":0,"num_tokens":0,"ts_tic":2359325.083107827}]
05:10:54.436      844.2ms [{"rid":null,"http_worker_ipc":null,"dp_rank":null,"num_reqs":0,"num_waiting_reqs":0,"num_tokens":511,"ts_tic":2359325.099850724}]
05:10:54.486      851.3ms [{"rid":null,"http_worker_ipc":null,"dp_rank":null,"num_reqs":0,"num_waiting_reqs":0,"num_tokens":0,"ts_tic":2359325.135359237}]
05:10:54.536      976.0ms

Logging on model gateway:

PowerOfTwo policies: {"http://localhost:8080": 0, "http://localhost:8082": 0, "http://localhost:8083": 0, "http://localhost:8081": 0}
2026-02-26 05:23:47 DEBUG smg::core::worker_manager: /home/jobuser/learn/sglang/sgl-model-gateway/src/core/worker_manager.rs:408: Fetched loads from 4 workers, updating 1 PowerOfTwo policies: {"http://localhost:8080": 510, "http://localhost:8083": 1535, "http://localhost:8082": 512, "http://localhost:8081": 1032}
2026-02-26 05:23:48 DEBUG smg::core::worker_manager: /home/jobuser/learn/sglang/sgl-model-gateway/src/core/worker_manager.rs:408: Fetched loads from 4 workers, updating 1 PowerOfTwo policies: {"http://localhost:8082": 510, "http://localhost:8083": 0, "http://localhost:8080": 2562, "http://localhost:8081": 0}
2026-02-26 05:23:49 DEBUG smg::core::worker_manager: /home/jobuser/learn/sglang/sgl-model-gateway/src/core/worker_manager.rs:408: Fetched loads from 4 workers, updating 1 PowerOfTwo policies: {"http://localhost:8083": 2040, "http://localhost:8082": 511, "http://localhost:8080": 0, "http://localhost:8081": 3603}
2026-02-26 05:23:50 DEBUG smg::core::worker_manager: /home/jobuser/learn/sglang/sgl-model-gateway/src/core/worker_manager.rs:408: Fetched loads from 4 workers, updating 1 PowerOfTwo policies: {"http://localhost:8081": 0, "http://localhost:8082": 1029, "http://localhost:8083": 0, "http://localhost:8080": 1528}

Metrics

Fix sglang:num_running_reqs is always 0.0 problem

Safe mechanism

Since, for this problem running_batch is always empty, and running_lens is always 0. Thus, lots of safe mechanism in scheduler is not enabled, for example: link, link

Modifications

Move skip decode for prefill-only logic into run decode branch.
Filter running_batch in each scheduling loop to avoid keep tracking finished requests.

Accuracy Tests

Test env:
GPU: H100.
Model: Qwen3-0.6B.

Running on this PR

curl -X POST "http://localhost:8080/v1/score"   -H "Content-Type: application/json"   -d '{
    "query": "What is the capital of California? Answer Yes or No for each of the following options:",
    "items": ["Scaramento", "San Jose", "San Francisco"],
    "label_token_ids": [9454, 2753],
  }'
{"scores":[[0.00014663469283989953,6.92653243690209e-05],[0.00016726256415974512,5.4302204151145536e-05],[0.0002321074899909774,5.179018141333686e-05]], ...

Running on last version

curl -X POST "http://localhost:8080/v1/score"   -H "Content-Type: application/json"   -d '{
    "query": "What is the capital of California? Answer Yes or No for each of the following options:",
    "items": ["Scaramento", "San Jose", "San Francisco"],
    "label_token_ids": [9454, 2753],
  }'
{"scores":[[0.00014663469283989953,6.92653243690209e-05],[0.00016726256415974512,5.4302204151145536e-05],[0.0002321074899909774,5.179018141333686e-05]],"model":"...,"usage":null,"object":"scoring"}

Metrics

It is clear that for prefill-only request, the sglang:num_running_reqs will be always 0.0. Because it is tracking the running_batch's length. Below is the example metrics api response during benchmark period.

Metrics before this PR

# TYPE sglang:num_running_reqs gauge
sglang:num_running_reqs{engine_type="unified",model_name="qwen",pp_rank="0",tp_rank="0"} 0.0
...
# TYPE sglang:num_running_reqs gauge
sglang:num_running_reqs{engine_type="unified",model_name="qwen",pp_rank="0",tp_rank="0"} 0.0

Metrics with this PR

 # TYPE sglang:num_running_reqs gauge
sglang:num_running_reqs{engine_type="unified",model_name="qwen",pp_rank="0",tp_rank="0"} 30.0

# TYPE sglang:num_running_reqs gauge
sglang:num_running_reqs{engine_type="unified",model_name="qwen",pp_rank="0",tp_rank="0"} 0.0

Benchmarking and Profiling

Benchmark env:
GPU: H100.
Model: Qwen3-0.6B.
QPS: 160.
Items per request: 10.
Tokens per query: 120.
Tokens per items: 180

Running on this PR

Overall Summary for RPS 160, Duration 60s, Item Count 10:
  Test duration:         60 seconds
  Server type:           HTTP
  HTTP mode:             SCORE
  Target RPS:            160
  Item count:            10
  Distribution:          POISSON
  Unique requests generated: 100
  Total requests sent:   9600
  Successful requests:   7162
  Failed requests:       2438
  Overall successful items/sec: 1197.76
  Time to send all requests: 59.73 seconds
  Time for all requests to complete: 59.80 seconds
  Average response time: 120.51 ms
  P50 response time:     118.85 ms
  P90 response time:     161.84 ms
  P95 response time:     175.62 ms
  P99 response time:     255.14 ms

Running on last version

Overall Summary for RPS 160, Duration 60s, Item Count 10:
  Test duration:         60 seconds
  Server type:           HTTP
  HTTP mode:             SCORE
  Target RPS:            160
  Item count:            10
  Distribution:          POISSON
  Unique requests generated: 100
  Total requests sent:   9600
  Successful requests:   6760
  Failed requests:       2840
  Overall successful items/sec: 1133.12
  Time to send all requests: 59.58 seconds
  Time for all requests to complete: 59.66 seconds
  Average response time: 130.39 ms
  P50 response time:     131.42 ms
  P90 response time:     174.63 ms
  P95 response time:     186.60 ms
  P99 response time:     249.31 ms

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.
Work with maintainers to merge your PR. See the PR Merge Process

gemini-code-assist · 2025-12-02T22:56:33Z

Summary of Changes

Hello @haNa-meister, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a critical issue in the generative score API's scheduler where prefill-only requests were not being properly tracked. This led to inaccurate sglang:num_running_reqs metrics and disabled important safety mechanisms. The changes refactor the batch handling and decode skipping logic for prefill-only requests, ensuring correct metric reporting, re-enabling safety checks, and demonstrating improved throughput and success rates in benchmarks.

Highlights

Corrected Metric Tracking: Addressed an issue where sglang:num_running_reqs was consistently reporting 0.0 for prefill-only requests, now accurately reflecting the number of active requests.
Re-enabled Safety Mechanisms: Fixed a problem where critical safety mechanisms in the scheduler were not being activated due to running_batch being empty for prefill-only requests.
Refined Prefill-Only Logic: Moved the logic for skipping decode steps for prefill-only batches into the run decode branch, ensuring these batches are correctly processed without unnecessary decode operations.
Improved Batch Filtering: Implemented a mechanism to filter out finished requests from running_batch in each scheduling loop specifically for prefill-only batches, maintaining an accurate state of active requests.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request correctly fixes an issue where prefill-only requests were not being tracked in the running_batch, leading to incorrect metrics and disabled safety mechanisms. The changes are well-reasoned and implemented cleanly. By allowing prefill-only batches to be merged into the running_batch and then explicitly skipping the decode step for them, the core issue is resolved. The addition of a manual filtering step for prefill-only running batches is a necessary and correct adjustment to ensure finished requests are properly cleaned up. The provided benchmarks also indicate a slight performance improvement, which is a great result. The code is clear and the changes are solid.

sundar24295s

LGTM! Thanks for fixing this

Scheduler (python): - Allow prefill-only batches to merge into running_batch so num_running_reqs reports correctly (was always 0) - Add filter_batch() for prefill-only running batches to clean up finished requests that won't go through decode - Guard decode step to skip prefill-only batches (Backport of PR sgl-project#14320) Router (rust): - Set load monitor polling interval to 1s (decoupled from worker_startup_check_interval_secs which stays at 5s) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

sundar24295s · 2026-03-04T00:59:34Z

/tag-and-rerun-ci

sundar24295s · 2026-03-04T19:25:28Z

/tag-and-rerun-ci

hnyls2002 · 2026-03-11T20:15:39Z

All prefill-only tests passed.

…ss track problem (sgl-project#14320) Co-authored-by: Wenyan Yao <wenyao@linkedin.com> Co-authored-by: Sundara Raman Ramachandran <sundar24295@gmail.com>

haNa-meister requested review from Ying1123, hnyls2002, merrymercy, xiezhq-hermann and zhyncs as code owners December 2, 2025 22:56

gemini-code-assist bot reviewed Dec 2, 2025

View reviewed changes

haNa-meister force-pushed the main branch from 801baeb to b10c608 Compare December 9, 2025 07:23

sundar24295s added the run-ci label Dec 10, 2025

haNa-meister force-pushed the main branch from b10c608 to 79460b6 Compare December 11, 2025 01:14

sundar24295s approved these changes Dec 11, 2025

View reviewed changes

merrymercy approved these changes Dec 17, 2025

View reviewed changes

sundar24295s enabled auto-merge (squash) December 17, 2025 23:42

auto-merge was automatically disabled December 19, 2025 21:23
Head branch was pushed to by a user without write access

haNa-meister force-pushed the main branch from 140c6ac to 5eb5c20 Compare December 19, 2025 21:23

hnyls2002 enabled auto-merge (squash) December 26, 2025 12:10

fix prefill skip tracking running batch problem

a6ed979

auto-merge was automatically disabled February 26, 2026 22:03
Head branch was pushed to by a user without write access

haNa-meister force-pushed the main branch from 68b1b97 to a6ed979 Compare February 26, 2026 22:03

Merge branch 'main' into main

8cc6203

Merge branch 'main' into main

ddb8461

sundar24295s enabled auto-merge (squash) March 4, 2026 21:31

sundar24295s and others added 4 commits March 4, 2026 13:33

Merge branch 'main' into main

4b82520

Merge branch 'main' into main

2a77bb8

Merge branch 'main' into main

697e6b1

Merge branch 'sgl-project:main' into main

427bfeb

hnyls2002 approved these changes Mar 11, 2026

View reviewed changes

hnyls2002 disabled auto-merge March 11, 2026 20:15

hnyls2002 merged commit 252ef90 into sgl-project:main Mar 11, 2026
186 of 246 checks passed

happierpig mentioned this pull request Apr 1, 2026

scheduler: add prefill-only update in merge batch #21840

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Generative Score API] Fix on prefill-only scheduler running batch loss track problem#14320

[Generative Score API] Fix on prefill-only scheduler running batch loss track problem#14320
hnyls2002 merged 7 commits intosgl-project:mainfrom
haNa-meister:main

haNa-meister commented Dec 2, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Dec 2, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

sundar24295s left a comment

Uh oh!

sundar24295s commented Mar 4, 2026

Uh oh!

sundar24295s commented Mar 4, 2026

Uh oh!

hnyls2002 commented Mar 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

haNa-meister commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Metric missing problem

Affects load monitor on sgl model gateway

Metrics

Safe mechanism

Modifications

Accuracy Tests

Running on this PR

Running on last version

Metrics

Metrics before this PR

Metrics with this PR

Benchmarking and Profiling

Running on this PR

Running on last version

Checklist

Uh oh!

gemini-code-assist bot commented Dec 2, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

sundar24295s left a comment

Choose a reason for hiding this comment

Uh oh!

sundar24295s commented Mar 4, 2026

Uh oh!

sundar24295s commented Mar 4, 2026

Uh oh!

hnyls2002 commented Mar 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

haNa-meister commented Dec 2, 2025 •

edited

Loading