Gateway supports dp rank scheduling and scheduling with the minimun number of tokens#20435
Gateway supports dp rank scheduling and scheduling with the minimun number of tokens#20435jiashaokun-1 wants to merge 1 commit intosgl-project:mainfrom
Conversation
…umber of tokens 1/n
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the gateway's ability to manage load distribution in a distributed inference setup, particularly for models operating in PD-Disaggregation mode. By introducing a token-based load balancing strategy, it aims to mitigate load imbalances that arise with variable-length input sequences, leading to more efficient resource utilization and improved overall performance. The changes involve a new load monitoring and management system, integration with Prometheus metrics, and new configuration options to activate and fine-tune this feature. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a new scheduling policy for DP-enabled workers to balance load based on the minimum number of tokens, which is a great addition for handling variable-length inputs more efficiently. The implementation is well-structured, including a new WorkerLoadManager and MinimumTokensPolicy. My review includes suggestions to improve a docstring, a log message, and a variable name for better clarity. I also identified a potential feature gap where the new scheduling policy is not applied to the gRPC PD router, which should be addressed for consistency.
| ctx.policy_registry.set_prefill_policy(prefill_policy); | ||
| ctx.policy_registry.set_decode_policy(decode_policy); |
There was a problem hiding this comment.
The new DP scheduling policy (MinimumTokensPolicy) is configured for the HTTP PD router (PDRouter) but not for the gRPC PD router (GrpcPDRouter). This seems like an omission that makes the feature available only for HTTP requests. For consistency, this should probably be added to create_grpc_pd_router as well.
ctx.policy_registry.set_prefill_policy(prefill_policy);
ctx.policy_registry.set_decode_policy(decode_policy);
let config = ctx.router_config.clone();
if config.dp_minimum_tokens_scheduler {
let mini_tokens_policy = MinimumTokensPolicy::new(
ctx.load_monitor
.as_ref()
.map(|load_monitor_arc| load_monitor_arc.worker_load_manager.clone()),
);
let dp_rank_policy: Arc<dyn DPRankLoadPolicy> = Arc::new(mini_tokens_policy);
ctx.policy_registry.set_dp_rank_policy(dp_rank_policy);
}| port: Port number to bind the router server. Default: 3001 | ||
| worker_startup_timeout_secs: Timeout in seconds for worker startup and registration. Large models can take significant time to load into GPU memory. Default: 1800 (30 minutes) | ||
| worker_startup_check_interval: Interval in seconds between checks for worker initialization. Default: 10 | ||
| worker_load_check_interval: Interval in seconds between get loads for worker initialization. Default: 10 |
There was a problem hiding this comment.
The docstring for worker_load_check_interval is a bit misleading. It says "...for worker initialization", which suggests it's only used during startup. This interval is for periodic load checking during the router's lifetime. A clearer description would be better.
| worker_load_check_interval: Interval in seconds between get loads for worker initialization. Default: 10 | |
| worker_load_check_interval: Interval in seconds between getting loads for workers. Default: 10 |
| Err(e) => { | ||
| warn!("The metric is missing the dp_rank label{}, skipping.", e); | ||
| continue; | ||
| } |
There was a problem hiding this comment.
The warning message here seems to be incorrect. The error e comes from sample.get_labelset(), which can fail if there's no label set at all, not specifically because dp_rank is missing. The check for dp_rank happens later. A more generic error message would be more accurate.
| Err(e) => { | |
| warn!("The metric is missing the dp_rank label{}, skipping.", e); | |
| continue; | |
| } | |
| Err(e) => { | |
| warn!("Failed to get label set for metric sample: {}, skipping.", e); | |
| continue; | |
| } |
| return loads | ||
| .iter() | ||
| .min_by_key(|&(_, load)| load) | ||
| .map(|(&rand_id, _)| rand_id); |
Motivation
#19268 The
routed_dp_rankanddisagg_prefill_dp_ranksupport external DP dispatch w/ PD-disaggregation mode. This modification supports this function in the gateway and schedules requests to the dprank with the minimum number of tokens.In the current PD-Disaggregation mode. the prefill instances only support the round_robin scheduling policy. Under this scheduling policy, which DP group a request is routed to is determined by the result of
bootstrap_roommoddpsize. This can lead to load imbalance among DP groups, and this imbalance becomes more pronounced when the input requests are variable-length sequences.Based on the above situation, I have added a new feature to select the DP group with the lightest load to process requests, thereby achieving and supporting DP load balance for PD-Disaggregation mode. The current load is measured by the number of tokens, which can be adjusted as needed in the future.The enabling or disabling of this feature is controlled by the parameter
dp_minimum_tokens_scheduler.Key Changes
1.Load Collection
The /metrics interface is used instead of /get_load to obtain the instance load. A new WorkerLoadManager class has been added to manage the engine's load, storing the number of used tokens (num_used_tokens) for each DP group as the load.
)Modify the load query interface, add the
extract_gauge_metricsmethod to parse the response returned by Prometheus.2. Load Management
A new
DPLoadManagerstruct has been added to manage loads, providing three methods:update_dp_loads,get_lowest_dp_load, andload_increment.update_dp_loadsmethod updates the load. After periodically collecting load data, the LoadMonitor calls this method to perform the update.get_lowest_dp_loadmethod takes a worker as input and returns the dp_rank with the lowest load among the workers.load_incrementmethod is used to add an increment to the load of a specific dp_rank for a worker. This is done to prevent all requests from being scheduled to the same DP group during the interval between two load reports.3. New interfaces are added.
The DPRankLoadPolicy and MinimumTokensPolicy APIs are added. The worker is passed in, and the dpRank with the minimum number of tokens is selected.
4. Configuration Parameters
dp-minimum-tokens-schedulerhas been added to enable scheduling of the minimum load DP group when declaring the parameter.worker-load-check-intervalhas been added to specify the interval for load collection. Previously, the load collection interval reused the configuration ofworker-startup-check-interval. Now, this new configuration item is separate from the startup check.Benchmarking and Profiling
Comparing the performance gains of several scheduling strategies on variable-length datasets before and after enabling dp-minimum-tokens-scheduler, the Mean TPOP performance improved by approximately 9%, and the Mean TTFT (with max_out_len=1 set to eliminate the impact of decoding on prefill) performance improved by about 15%. The specific data is as follows:
router
prefill
python3 -m sglang.launch_server --model-path ${MODEL_PATH} --tp-size 1 --dp-size 2 --base-gpu-id 8 --disaggregation-mode prefill --trust-remote-code --attention-backend ascend --disaggregation-transfer-backend ascend --device npu --quantization modelslim --watchdog-timeout 9000 --host 141.61.29.204 --grpc-mode --metrics-port 10000 --port 6699 --cuda-graph-bs 8 16 24 28 32 36 --mem-fraction-static 0.71 --max-running-requests 144 --chunked-prefill-size -1 --dtype bfloat16 --load-balance-method round_robin --disable-overlap-schedule --enable-metricsdecode
python3 -m sglang.launch_server --model-path ${MODEL_PATH} --tp-size 1 --dp-size 2 --base-gpu-id 12 --disaggregation-mode decode --trust-remote-code --attention-backend ascend --disaggregation-transfer-backend ascend --device npu --quantization modelslim --watchdog-timeout 9000 --host 141.61.29.204 --grpc-mode --metrics-port 10001 --port 7699 --cuda-graph-bs 8 16 24 28 32 36 --mem-fraction-static 0.71 --max-running-requests 144 --chunked-prefill-size -1 --dtype bfloat16 --load-balance-method round_robin --disable-overlap-schedule --load-balance-method round_robin --prefill-round-robin-balance --enable-metricsbenchmark
The model evaluation tool I used is AISBench. Click the link below to learn more: AISBench. Variable-length datasets are prone to load imbalance scenarios. During testing, I constructed a variable-length dataset. When using the benchmark's
--dataset-name randomoption to specify the dataset, the input prompts were being split. Therefore, I used AISBench to replace the original dataset with a variable-length dataset during the test runs.Checklist