Gateway supports dp rank scheduling and scheduling with the minimun number of tokens by jiashaokun-1 · Pull Request #16742 · sgl-project/sglang

jiashaokun-1 · 2026-01-08T15:22:48Z

Motivation

#19268 The routed_dp_rank and disagg_prefill_dp_rank support external DP dispatch w/ PD-disaggregation mode. This modification supports this function in the gateway and schedules requests to the dprank with the minimum number of tokens.

In the current PD-Disaggregation mode. the prefill instances only support the round_robin scheduling policy. Under this scheduling policy, which DP group a request is routed to is determined by the result of bootstrap_room mod dpsize. This can lead to load imbalance among DP groups, and this imbalance becomes more pronounced when the input requests are variable-length sequences.
Based on the above situation, I have added a new feature to select the DP group with the lightest load to process requests, thereby achieving and supporting DP load balance for PD-Disaggregation mode. The current load is measured by the number of tokens, which can be adjusted as needed in the future.The enabling or disabling of this feature is controlled by the parameter dp_minimum_tokens_scheduler.

Key Changes

1.Load Collection

The /metrics interface is used instead of /get_load to obtain the instance load. A new WorkerLoadManager class has been added to manage the engine's load, storing the number of used tokens (num_used_tokens) for each DP group as the load.
）Modify the load query interface, add the extract_gauge_metrics method to parse the response returned by Prometheus.

        let load_url = format!("{}/metrics", url);
        let mut req = client.get(&load_url).timeout(REQUEST_TIMEOUT);
        if let Some(key) = api_key {
            req = req.bearer_auth(key);
        }

        match req.send().await {
            Ok(r) if r.status().is_success() => {
                if let Ok(text) = r.text().await {
                    return crate::core::metrics_manager::extract_gauge_metrics(text, "sglang_num_used_tokens");
                }
                HashMap::new()
            },
            _ => HashMap::new(),
        }

2. Load Management

A new DPLoadManager struct has been added to manage loads, providing three methods: update_dp_loads ,get_lowest_dp_load , and load_increment.

#[derive(Debug, Default)]
pub struct WorkerLoadManager {
    // <worker, <dp_rank, loads>>
    dp_cached_loads: RwLock<HashMap<String, HashMap<isize, isize>>>,
}

The update_dp_loads method updates the load. After periodically collecting load data, the LoadMonitor calls this method to perform the update.
The get_lowest_dp_load method takes a worker as input and returns the dp_rank with the lowest load among the workers.
The load_increment method is used to add an increment to the load of a specific dp_rank for a worker. This is done to prevent all requests from being scheduled to the same DP group during the interval between two load reports.

3. New interfaces are added.

The DPRankLoadPolicy and MinimumTokensPolicy APIs are added. The worker is passed in, and the dpRank with the minimum number of tokens is selected.

#[async_trait]
pub trait DPRankLoadPolicy: Send + Sync + Debug {
    async fn select_dp_rank(&self, worker: &dyn Worker, text_str: isize) -> Option<isize>;
}

impl DPRankLoadPolicy for MinimumTokensPolicy {
    async fn select_dp_rank(&self, worker: &dyn Worker, text_str: isize) -> Option<isize> {
        if let Some(worker_load) = self.worker_load_manager.as_ref() {
            let lowest_tokens_dp_rank = worker_load.get_lowest_dp_load(worker);
            if let Some(dp_rank) = lowest_tokens_dp_rank {
                worker_load.load_increment(worker, dp_rank, text_str);
            }
            return lowest_tokens_dp_rank;
        }
        None
    }
}

4. Configuration Parameters

A new configuration parameter dp-minimum-tokens-scheduler has been added to enable scheduling of the minimum load DP group when declaring the parameter.
A new configuration parameter worker-load-check-interval has been added to specify the interval for load collection. Previously, the load collection interval reused the configuration of worker-startup-check-interval. Now, this new configuration item is separate from the startup check.

Benchmarking and Profiling

Comparing the performance gains of several scheduling strategies on variable-length datasets before and after enabling dp-minimum-tokens-scheduler, the Mean TPOP performance improved by approximately 9%, and the Mean TTFT (with max_out_len=1 set to eliminate the impact of decoding on prefill) performance improved by about 15%. The specific data is as follows:

Policy	Mean TTFT(ms) max_out_len=1			Mean TPOT (ms)
	benckmark	enable dp minimum tokens	Performance	benckmark	enable dp minimum tokense	Performance
Random	2631	2249	+14.52%	53.41	48.15	+9.85%
Round_robin	2646	2164	+18.22%	55.42	49.02	+11.55%
Cache_aware	2601	2216	+14.80%	52.67	48.48	+7.96%

router

python -m sglang_router.launch_router   --pd-disaggregation   --prefill-policy cache_aware --dp-minimum-tokens-scheduler --worker-load-check-interval 1   --prefill grpc://141.61.29.204:6699   --decode grpc://127.0.0.1:7699   --model-path /home/weights/DeepSeek-R1_w8a8   --tokenizer-path /home/weights/DeepSeek-R1_w8a8   --host 0.0.0.0 --port 4567

prefill

python3 -m sglang.launch_server --model-path ${MODEL_PATH} --tp-size 1 --dp-size 2 --base-gpu-id 8 --disaggregation-mode prefill --trust-remote-code --attention-backend ascend --disaggregation-transfer-backend ascend --device npu --quantization modelslim --watchdog-timeout 9000 --host 141.61.29.204 --grpc-mode --metrics-port 10000 --port 6699 --cuda-graph-bs 8 16 24 28 32 36 --mem-fraction-static 0.71 --max-running-requests 144 --chunked-prefill-size -1  --dtype bfloat16 --load-balance-method round_robin --disable-overlap-schedule --enable-metrics

decode

python3 -m sglang.launch_server --model-path ${MODEL_PATH} --tp-size 1 --dp-size 2 --base-gpu-id 12 --disaggregation-mode decode --trust-remote-code --attention-backend ascend --disaggregation-transfer-backend ascend --device npu --quantization modelslim --watchdog-timeout 9000 --host 141.61.29.204 --grpc-mode --metrics-port 10001 --port 7699 --cuda-graph-bs 8 16 24 28 32 36 --mem-fraction-static 0.71 --max-running-requests 144 --chunked-prefill-size -1  --dtype bfloat16 --load-balance-method round_robin --disable-overlap-schedule --load-balance-method round_robin --prefill-round-robin-balance --enable-metrics

benchmark
The model evaluation tool I used is AISBench. Click the link below to learn more: AISBench. Variable-length datasets are prone to load imbalance scenarios. During testing, I constructed a variable-length dataset. When using the benchmark's --dataset-name random option to specify the dataset, the input prompts were being split. Therefore, I used AISBench to replace the original dataset with a variable-length dataset during the test runs.

ais_bench --models vllm_api_stream_chat --datasets gsm8k_gen_0_shot_cot_str_perf  --debug --summarizer default_perf --mode perf

╒══════════════════════════╤═════════╤═════════════════╤═════════════════╤═════════════════╤═════════════════╤═════════════════╤═════════════════╤═════════════════╤══════╕
│ Performance Parameters   │ Stage   │ Average         │ Min             │ Max             │ Median          │ P75             │ P90             │ P99             │  N   │
╞══════════════════════════╪═════════╪═════════════════╪═════════════════╪═════════════════╪═════════════════╪═════════════════╪═════════════════╪═════════════════╪══════╡
│ E2EL                     │ total   │ 55044.4573 ms   │ 11143.2514 ms   │ 195472.4726 ms  │ 35057.9889 ms   │ 69464.8019 ms   │ 132295.1279 ms  │ 185509.4833 ms  │ 1000 │
├──────────────────────────┼─────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼──────┤
│ TTFT                     │ total   │ 893.5725 ms     │ 348.2403 ms     │ 1549.0416 ms    │ 901.9457 ms     │ 1084.3782 ms    │ 1215.7292 ms    │ 1469.2704 ms    │ 1000 │
├──────────────────────────┼─────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼──────┤
│ TPOT                     │ total   │ 48.4854 ms      │ 33.7401 ms      │ 63.8094 ms      │ 48.7747 ms      │ 52.4151 ms      │ 55.4097 ms      │ 59.9007 ms      │ 1000 │
├──────────────────────────┼─────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼──────┤
│ ITL                      │ total   │ 114.1014 ms     │ 0.0068 ms       │ 594.2452 ms     │ 115.7806 ms     │ 119.678 ms      │ 123.1585 ms     │ 251.8668 ms     │ 1000 │
├──────────────────────────┼─────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼──────┤
│ InputTokens              │ total   │ 2560.5243       │ 1481.0          │ 4120.0          │ 2216.0          │ 3402.25         │ 3943.3          │ 4107.49         │ 1000 │
├──────────────────────────┼─────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼──────┤
│ OutputTokens             │ total   │ 1159.8058       │ 187.0           │ 4068.0          │ 715.5           │ 1358.0          │ 2769.4          │ 4068.0          │ 1000 │
├──────────────────────────┼─────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼──────┤
│ OutputTokenThroughput    │ total   │ 20.3425 token/s │ 14.7171 token/s │ 28.1505 token/s │ 19.9678 token/s │ 22.0362 token/s │ 23.6102 token/s │ 26.6782 token/s │ 1000 │
╘══════════════════════════╧═════════╧═════════════════╧═════════════════╧═════════════════╧═════════════════╧═════════════════╧═════════════════╧═════════════════╧══════╛
╒══════════════════════════╤═════════╤═══════════════════╕
│ Common Metric            │ Stage   │ Value             │
╞══════════════════════════╪═════════╪═══════════════════╡
│ Benchmark Duration       │ total   │ 392829.9277 ms    │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Requests           │ total   │ 1000              │
├──────────────────────────┼─────────┼───────────────────┤
│ Failed Requests          │ total   │ 0                 │
├──────────────────────────┼─────────┼───────────────────┤
│ Success Requests         │ total   │ 1000              │
├──────────────────────────┼─────────┼───────────────────┤
│ Concurrency              │ total   │ 86.5959           │
├──────────────────────────┼─────────┼───────────────────┤
│ Max Concurrency          │ total   │ 270               │
├──────────────────────────┼─────────┼───────────────────┤
│ Request Throughput       │ total   │ 2.5456 req/s      │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Input Tokens       │ total   │ 1582404           │
├──────────────────────────┼─────────┼───────────────────┤
│ Prefill Token Throughput │ total   │ 2865.4913 token/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Total generated tokens   │ total   │ 716760            │
├──────────────────────────┼─────────┼───────────────────┤
│ Input Token Throughput   │ total   │ 4028.2165 token/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Output Token Throughput  │ total   │ 1824.6064 token/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Token Throughput   │ total   │ 5852.8229 token/s │
╘══════════════════════════╧═════════╧═══════════════════╛

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

jovany-wang · 2026-01-09T07:38:53Z

@Misaka9468 CC

ping1jing2 · 2026-01-12T05:59:10Z

/tag-and-rerun-ci

iforgetmyname · 2026-01-12T14:32:52Z

/tag-and-rerun-ci

JustinTong0323 · 2026-01-12T15:58:47Z

/tag-and-rerun-ci

jiashaokun-1 · 2026-01-13T01:12:16Z

Merged based on dependency #14378

…umber of tokens 1/n

jiashaokun-1 requested review from ByronHsu, CatherineSue, JustinTong0323, Ying1123, hnyls2002, ispobock, key4ng, merrymercy, slin1237 and xiezhq-hermann as code owners January 8, 2026 15:22

github-actions bot added the model-gateway label Jan 8, 2026

jiashaokun-1 force-pushed the main branch 2 times, most recently from 390f4ca to 4442cf7 Compare January 9, 2026 04:25

ping1jing2 self-assigned this Jan 12, 2026

github-actions bot added the run-ci label Jan 12, 2026

jiashaokun-1 force-pushed the main branch from 74c342d to 3413215 Compare January 12, 2026 14:10

jiashaokun-1 force-pushed the main branch 2 times, most recently from ea8d2ce to c53a2d7 Compare March 12, 2026 07:54

Gateway supports dp rank scheduling and scheduling with the minimun n…

e52b4ee

…umber of tokens 1/n

jiashaokun-1 force-pushed the main branch from c53a2d7 to e52b4ee Compare March 12, 2026 08:49

jiashaokun-1 changed the title ~~router select dp group with the minimum number of tokens [grpc]~~ Gateway supports dp rank scheduling and scheduling with the minimun number of tokens Mar 12, 2026

jiashaokun-1 closed this Mar 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gateway supports dp rank scheduling and scheduling with the minimun number of tokens#16742

Gateway supports dp rank scheduling and scheduling with the minimun number of tokens#16742
jiashaokun-1 wants to merge 1 commit intosgl-project:mainfrom
jiashaokun-1:main

jiashaokun-1 commented Jan 8, 2026 •

edited

Loading

Uh oh!

jovany-wang commented Jan 9, 2026

Uh oh!

ping1jing2 commented Jan 12, 2026

Uh oh!

iforgetmyname commented Jan 12, 2026

Uh oh!

JustinTong0323 commented Jan 12, 2026

Uh oh!

jiashaokun-1 commented Jan 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

jiashaokun-1 commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Key Changes

1.Load Collection

2. Load Management

3. New interfaces are added.

4. Configuration Parameters

Benchmarking and Profiling

Checklist

Uh oh!

jovany-wang commented Jan 9, 2026

Uh oh!

ping1jing2 commented Jan 12, 2026

Uh oh!

iforgetmyname commented Jan 12, 2026

Uh oh!

JustinTong0323 commented Jan 12, 2026

Uh oh!

jiashaokun-1 commented Jan 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jiashaokun-1 commented Jan 8, 2026 •

edited

Loading