Gateway supports dp rank scheduling and scheduling with the minimun number of tokens#16742
Closed
jiashaokun-1 wants to merge 1 commit intosgl-project:mainfrom
Closed
Gateway supports dp rank scheduling and scheduling with the minimun number of tokens#16742jiashaokun-1 wants to merge 1 commit intosgl-project:mainfrom
jiashaokun-1 wants to merge 1 commit intosgl-project:mainfrom
Conversation
390f4ca to
4442cf7
Compare
|
@Misaka9468 CC |
Collaborator
|
/tag-and-rerun-ci |
Collaborator
|
/tag-and-rerun-ci |
1 similar comment
Collaborator
|
/tag-and-rerun-ci |
Contributor
Author
|
Merged based on dependency #14378 |
ea8d2ce to
c53a2d7
Compare
…umber of tokens 1/n
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
#19268 The
routed_dp_rankanddisagg_prefill_dp_ranksupport external DP dispatch w/ PD-disaggregation mode. This modification supports this function in the gateway and schedules requests to the dprank with the minimum number of tokens.In the current PD-Disaggregation mode. the prefill instances only support the round_robin scheduling policy. Under this scheduling policy, which DP group a request is routed to is determined by the result of
bootstrap_roommoddpsize. This can lead to load imbalance among DP groups, and this imbalance becomes more pronounced when the input requests are variable-length sequences.Based on the above situation, I have added a new feature to select the DP group with the lightest load to process requests, thereby achieving and supporting DP load balance for PD-Disaggregation mode. The current load is measured by the number of tokens, which can be adjusted as needed in the future.The enabling or disabling of this feature is controlled by the parameter
dp_minimum_tokens_scheduler.Key Changes
1.Load Collection
The /metrics interface is used instead of /get_load to obtain the instance load. A new WorkerLoadManager class has been added to manage the engine's load, storing the number of used tokens (num_used_tokens) for each DP group as the load.
)Modify the load query interface, add the
extract_gauge_metricsmethod to parse the response returned by Prometheus.2. Load Management
A new
DPLoadManagerstruct has been added to manage loads, providing three methods:update_dp_loads,get_lowest_dp_load, andload_increment.update_dp_loadsmethod updates the load. After periodically collecting load data, the LoadMonitor calls this method to perform the update.get_lowest_dp_loadmethod takes a worker as input and returns the dp_rank with the lowest load among the workers.load_incrementmethod is used to add an increment to the load of a specific dp_rank for a worker. This is done to prevent all requests from being scheduled to the same DP group during the interval between two load reports.3. New interfaces are added.
The DPRankLoadPolicy and MinimumTokensPolicy APIs are added. The worker is passed in, and the dpRank with the minimum number of tokens is selected.
4. Configuration Parameters
dp-minimum-tokens-schedulerhas been added to enable scheduling of the minimum load DP group when declaring the parameter.worker-load-check-intervalhas been added to specify the interval for load collection. Previously, the load collection interval reused the configuration ofworker-startup-check-interval. Now, this new configuration item is separate from the startup check.Benchmarking and Profiling
Comparing the performance gains of several scheduling strategies on variable-length datasets before and after enabling dp-minimum-tokens-scheduler, the Mean TPOP performance improved by approximately 9%, and the Mean TTFT (with max_out_len=1 set to eliminate the impact of decoding on prefill) performance improved by about 15%. The specific data is as follows:
router
prefill
python3 -m sglang.launch_server --model-path ${MODEL_PATH} --tp-size 1 --dp-size 2 --base-gpu-id 8 --disaggregation-mode prefill --trust-remote-code --attention-backend ascend --disaggregation-transfer-backend ascend --device npu --quantization modelslim --watchdog-timeout 9000 --host 141.61.29.204 --grpc-mode --metrics-port 10000 --port 6699 --cuda-graph-bs 8 16 24 28 32 36 --mem-fraction-static 0.71 --max-running-requests 144 --chunked-prefill-size -1 --dtype bfloat16 --load-balance-method round_robin --disable-overlap-schedule --enable-metricsdecode
python3 -m sglang.launch_server --model-path ${MODEL_PATH} --tp-size 1 --dp-size 2 --base-gpu-id 12 --disaggregation-mode decode --trust-remote-code --attention-backend ascend --disaggregation-transfer-backend ascend --device npu --quantization modelslim --watchdog-timeout 9000 --host 141.61.29.204 --grpc-mode --metrics-port 10001 --port 7699 --cuda-graph-bs 8 16 24 28 32 36 --mem-fraction-static 0.71 --max-running-requests 144 --chunked-prefill-size -1 --dtype bfloat16 --load-balance-method round_robin --disable-overlap-schedule --load-balance-method round_robin --prefill-round-robin-balance --enable-metricsbenchmark
The model evaluation tool I used is AISBench. Click the link below to learn more: AISBench. Variable-length datasets are prone to load imbalance scenarios. During testing, I constructed a variable-length dataset. When using the benchmark's
--dataset-name randomoption to specify the dataset, the input prompts were being split. Therefore, I used AISBench to replace the original dataset with a variable-length dataset during the test runs.Checklist