Skip to content

Gateway supports dp rank scheduling and scheduling with the minimun number of tokens#16742

Closed
jiashaokun-1 wants to merge 1 commit intosgl-project:mainfrom
jiashaokun-1:main
Closed

Gateway supports dp rank scheduling and scheduling with the minimun number of tokens#16742
jiashaokun-1 wants to merge 1 commit intosgl-project:mainfrom
jiashaokun-1:main

Conversation

@jiashaokun-1
Copy link
Copy Markdown
Contributor

@jiashaokun-1 jiashaokun-1 commented Jan 8, 2026

Motivation

#19268 The routed_dp_rank and disagg_prefill_dp_rank support external DP dispatch w/ PD-disaggregation mode. This modification supports this function in the gateway and schedules requests to the dprank with the minimum number of tokens.

In the current PD-Disaggregation mode. the prefill instances only support the round_robin scheduling policy. Under this scheduling policy, which DP group a request is routed to is determined by the result of bootstrap_room mod dpsize. This can lead to load imbalance among DP groups, and this imbalance becomes more pronounced when the input requests are variable-length sequences.
Based on the above situation, I have added a new feature to select the DP group with the lightest load to process requests, thereby achieving and supporting DP load balance for PD-Disaggregation mode. The current load is measured by the number of tokens, which can be adjusted as needed in the future.The enabling or disabling of this feature is controlled by the parameter dp_minimum_tokens_scheduler.

Key Changes

1.Load Collection

The /metrics interface is used instead of /get_load to obtain the instance load. A new WorkerLoadManager class has been added to manage the engine's load, storing the number of used tokens (num_used_tokens) for each DP group as the load.
)Modify the load query interface, add the extract_gauge_metrics method to parse the response returned by Prometheus.

        let load_url = format!("{}/metrics", url);
        let mut req = client.get(&load_url).timeout(REQUEST_TIMEOUT);
        if let Some(key) = api_key {
            req = req.bearer_auth(key);
        }

        match req.send().await {
            Ok(r) if r.status().is_success() => {
                if let Ok(text) = r.text().await {
                    return crate::core::metrics_manager::extract_gauge_metrics(text, "sglang_num_used_tokens");
                }
                HashMap::new()
            },
            _ => HashMap::new(),
        }

2. Load Management

A new DPLoadManager struct has been added to manage loads, providing three methods: update_dp_loads ,get_lowest_dp_load , and load_increment.

#[derive(Debug, Default)]
pub struct WorkerLoadManager {
    // <worker, <dp_rank, loads>>
    dp_cached_loads: RwLock<HashMap<String, HashMap<isize, isize>>>,
}
  1. The update_dp_loads method updates the load. After periodically collecting load data, the LoadMonitor calls this method to perform the update.
  2. The get_lowest_dp_load method takes a worker as input and returns the dp_rank with the lowest load among the workers.
  3. The load_increment method is used to add an increment to the load of a specific dp_rank for a worker. This is done to prevent all requests from being scheduled to the same DP group during the interval between two load reports.

3. New interfaces are added.

The DPRankLoadPolicy and MinimumTokensPolicy APIs are added. The worker is passed in, and the dpRank with the minimum number of tokens is selected.

#[async_trait]
pub trait DPRankLoadPolicy: Send + Sync + Debug {
    async fn select_dp_rank(&self, worker: &dyn Worker, text_str: isize) -> Option<isize>;
}

impl DPRankLoadPolicy for MinimumTokensPolicy {
    async fn select_dp_rank(&self, worker: &dyn Worker, text_str: isize) -> Option<isize> {
        if let Some(worker_load) = self.worker_load_manager.as_ref() {
            let lowest_tokens_dp_rank = worker_load.get_lowest_dp_load(worker);
            if let Some(dp_rank) = lowest_tokens_dp_rank {
                worker_load.load_increment(worker, dp_rank, text_str);
            }
            return lowest_tokens_dp_rank;
        }
        None
    }
}

4. Configuration Parameters

  1. A new configuration parameter dp-minimum-tokens-scheduler has been added to enable scheduling of the minimum load DP group when declaring the parameter.
  2. A new configuration parameter worker-load-check-interval has been added to specify the interval for load collection. Previously, the load collection interval reused the configuration of worker-startup-check-interval. Now, this new configuration item is separate from the startup check.

Benchmarking and Profiling

Comparing the performance gains of several scheduling strategies on variable-length datasets before and after enabling dp-minimum-tokens-scheduler, the Mean TPOP performance improved by approximately 9%, and the Mean TTFT (with max_out_len=1 set to eliminate the impact of decoding on prefill) performance improved by about 15%. The specific data is as follows:

Policy Mean TTFT(ms) max_out_len=1 Mean TPOT (ms)
benckmark enable dp minimum tokens Performance benckmark enable dp minimum tokense Performance
Random 2631 2249 +14.52% 53.41 48.15 +9.85%
Round_robin 2646 2164 +18.22% 55.42 49.02 +11.55%
Cache_aware 2601 2216 +14.80% 52.67 48.48 +7.96%

router

python -m sglang_router.launch_router   --pd-disaggregation   --prefill-policy cache_aware --dp-minimum-tokens-scheduler --worker-load-check-interval 1   --prefill grpc://141.61.29.204:6699   --decode grpc://127.0.0.1:7699   --model-path /home/weights/DeepSeek-R1_w8a8   --tokenizer-path /home/weights/DeepSeek-R1_w8a8   --host 0.0.0.0 --port 4567

prefill

python3 -m sglang.launch_server --model-path ${MODEL_PATH} --tp-size 1 --dp-size 2 --base-gpu-id 8 --disaggregation-mode prefill --trust-remote-code --attention-backend ascend --disaggregation-transfer-backend ascend --device npu --quantization modelslim --watchdog-timeout 9000 --host 141.61.29.204 --grpc-mode --metrics-port 10000 --port 6699 --cuda-graph-bs 8 16 24 28 32 36 --mem-fraction-static 0.71 --max-running-requests 144 --chunked-prefill-size -1  --dtype bfloat16 --load-balance-method round_robin --disable-overlap-schedule --enable-metrics

decode

python3 -m sglang.launch_server --model-path ${MODEL_PATH} --tp-size 1 --dp-size 2 --base-gpu-id 12 --disaggregation-mode decode --trust-remote-code --attention-backend ascend --disaggregation-transfer-backend ascend --device npu --quantization modelslim --watchdog-timeout 9000 --host 141.61.29.204 --grpc-mode --metrics-port 10001 --port 7699 --cuda-graph-bs 8 16 24 28 32 36 --mem-fraction-static 0.71 --max-running-requests 144 --chunked-prefill-size -1  --dtype bfloat16 --load-balance-method round_robin --disable-overlap-schedule --load-balance-method round_robin --prefill-round-robin-balance --enable-metrics

benchmark
The model evaluation tool I used is AISBench. Click the link below to learn more: AISBench. Variable-length datasets are prone to load imbalance scenarios. During testing, I constructed a variable-length dataset. When using the benchmark's --dataset-name random option to specify the dataset, the input prompts were being split. Therefore, I used AISBench to replace the original dataset with a variable-length dataset during the test runs.

ais_bench --models vllm_api_stream_chat --datasets gsm8k_gen_0_shot_cot_str_perf  --debug --summarizer default_perf --mode perf

╒══════════════════════════╤═════════╤═════════════════╤═════════════════╤═════════════════╤═════════════════╤═════════════════╤═════════════════╤═════════════════╤══════╕
│ Performance Parameters   │ Stage   │ Average         │ Min             │ Max             │ Median          │ P75             │ P90             │ P99             │  N   │
╞══════════════════════════╪═════════╪═════════════════╪═════════════════╪═════════════════╪═════════════════╪═════════════════╪═════════════════╪═════════════════╪══════╡
│ E2EL                     │ total   │ 55044.4573 ms   │ 11143.2514 ms   │ 195472.4726 ms  │ 35057.9889 ms   │ 69464.8019 ms   │ 132295.1279 ms  │ 185509.4833 ms  │ 1000 │
├──────────────────────────┼─────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼──────┤
│ TTFT                     │ total   │ 893.5725 ms     │ 348.2403 ms     │ 1549.0416 ms    │ 901.9457 ms     │ 1084.3782 ms    │ 1215.7292 ms    │ 1469.2704 ms    │ 1000 │
├──────────────────────────┼─────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼──────┤
│ TPOT                     │ total   │ 48.4854 ms      │ 33.7401 ms      │ 63.8094 ms      │ 48.7747 ms      │ 52.4151 ms      │ 55.4097 ms      │ 59.9007 ms      │ 1000 │
├──────────────────────────┼─────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼──────┤
│ ITL                      │ total   │ 114.1014 ms     │ 0.0068 ms       │ 594.2452 ms     │ 115.7806 ms     │ 119.678 ms      │ 123.1585 ms     │ 251.8668 ms     │ 1000 │
├──────────────────────────┼─────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼──────┤
│ InputTokens              │ total   │ 2560.5243       │ 1481.0          │ 4120.0          │ 2216.0          │ 3402.25         │ 3943.3          │ 4107.49         │ 1000 │
├──────────────────────────┼─────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼──────┤
│ OutputTokens             │ total   │ 1159.8058       │ 187.0           │ 4068.0          │ 715.5           │ 1358.0          │ 2769.4          │ 4068.0          │ 1000 │
├──────────────────────────┼─────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼──────┤
│ OutputTokenThroughput    │ total   │ 20.3425 token/s │ 14.7171 token/s │ 28.1505 token/s │ 19.9678 token/s │ 22.0362 token/s │ 23.6102 token/s │ 26.6782 token/s │ 1000 │
╘══════════════════════════╧═════════╧═════════════════╧═════════════════╧═════════════════╧═════════════════╧═════════════════╧═════════════════╧═════════════════╧══════╛
╒══════════════════════════╤═════════╤═══════════════════╕
│ Common Metric            │ Stage   │ Value             │
╞══════════════════════════╪═════════╪═══════════════════╡
│ Benchmark Duration       │ total   │ 392829.9277 ms    │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Requests           │ total   │ 1000              │
├──────────────────────────┼─────────┼───────────────────┤
│ Failed Requests          │ total   │ 0                 │
├──────────────────────────┼─────────┼───────────────────┤
│ Success Requests         │ total   │ 1000              │
├──────────────────────────┼─────────┼───────────────────┤
│ Concurrency              │ total   │ 86.5959           │
├──────────────────────────┼─────────┼───────────────────┤
│ Max Concurrency          │ total   │ 270               │
├──────────────────────────┼─────────┼───────────────────┤
│ Request Throughput       │ total   │ 2.5456 req/s      │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Input Tokens       │ total   │ 1582404           │
├──────────────────────────┼─────────┼───────────────────┤
│ Prefill Token Throughput │ total   │ 2865.4913 token/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Total generated tokens   │ total   │ 716760            │
├──────────────────────────┼─────────┼───────────────────┤
│ Input Token Throughput   │ total   │ 4028.2165 token/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Output Token Throughput  │ total   │ 1824.6064 token/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Token Throughput   │ total   │ 5852.8229 token/s │
╘══════════════════════════╧═════════╧═══════════════════╛

Checklist

@jovany-wang
Copy link
Copy Markdown

@Misaka9468 CC

@ping1jing2 ping1jing2 self-assigned this Jan 12, 2026
@ping1jing2
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

@iforgetmyname
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

1 similar comment
@JustinTong0323
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

@jiashaokun-1
Copy link
Copy Markdown
Contributor Author

Merged based on dependency #14378

@jiashaokun-1 jiashaokun-1 force-pushed the main branch 2 times, most recently from ea8d2ce to c53a2d7 Compare March 12, 2026 07:54
@jiashaokun-1 jiashaokun-1 changed the title router select dp group with the minimum number of tokens [grpc] Gateway supports dp rank scheduling and scheduling with the minimun number of tokens Mar 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants