[feat] support minimum token load balance in dp attention #7379

WANG-GH · 2025-06-20T05:30:40Z

Motivation & Modifications

When DP attention is enabled, the system decides which DP group to dispatch a request to based on the current load of each DP group.

The load data from all DP groups is gathered once to the TP0 node, where TP0 interacts with the dispatcher via shared memory to share the load information in real time.

The load data consists of two parts:

holding_tokens: the number of tokens currently being processed by each DP group's scheduler.
onfly_req: the requests that have been dispatched by the dispatcher but have not yet been accepted by the scheduler (a scheduler only accepts a request when it reaches the recv_req phase).

The dispatcher sums these two parts and selects the DP group with the lowest load.

The advantage of using onfly is that we only need to track the tokens currently being processed by each worker, without worrying about whether it's MTP, overlap, etc.
I added an extra field dp_balance_id: int to the request, so we just need to gather this integer.

To enable this feature, simply add --load-balance-method minimum_tokens to the startup arguments.

Performence

Both TBT and TTFT have improvements.

python3 -m sglang.bench_serving --backend sglang --num-prompt 100

minimum_token:

python3 -m sglang.launch_server \
    --model-path /DeepSeek-R1 \
    --tp 16 \
    --nnodes 2 --node-rank 0 --trust-remote-code \
    --load-balance-method minimum_tokens \
    --dp-size 16 --cuda-graph-max-bs 128 \
    --enable-dp-attention 

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 not set   
Successful requests:                     100       
Benchmark duration (s):                  57.58     
Total input tokens:                      34308     
Total generated tokens:                  21395     
Total generated tokens (retokenized):    21292     
Request throughput (req/s):              1.74      
Input token throughput (tok/s):          595.83    
Output token throughput (tok/s):         371.57    
Total token throughput (tok/s):          967.40    
Concurrency:                             31.53     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   18153.78  
Median E2E Latency (ms):                 15861.65  
---------------Time to First Token----------------
Mean TTFT (ms):                          4767.87   
Median TTFT (ms):                        4254.61   
P99 TTFT (ms):                           9760.97   
---------------Inter-Token Latency----------------
Mean ITL (ms):                           63.03     
Median ITL (ms):                         41.31     
P95 ITL (ms):                            57.08     
P99 ITL (ms):                            864.98    
Max ITL (ms):                            7880.85   
==================================================

rr:

python3 -m sglang.launch_server \
    --model-path /DeepSeek-R1 \
    --tp 16  \
    --nnodes 2 --node-rank 0 --trust-remote-code \
    --dp-size 16 --cuda-graph-max-bs 128 \
    --enable-dp-attention 

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 not set   
Successful requests:                     100       
Benchmark duration (s):                  63.01     
Total input tokens:                      34308     
Total generated tokens:                  21395     
Total generated tokens (retokenized):    21309     
Request throughput (req/s):              1.59      
Input token throughput (tok/s):          544.44    
Output token throughput (tok/s):         339.52    
Total token throughput (tok/s):          883.97    
Concurrency:                             36.49     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   22992.97  
Median E2E Latency (ms):                 20915.81  
---------------Time to First Token----------------
Mean TTFT (ms):                          6945.23   
Median TTFT (ms):                        7699.87   
P99 TTFT (ms):                           14878.23  
---------------Inter-Token Latency----------------
Mean ITL (ms):                           75.57     
Median ITL (ms):                         41.34     
P95 ITL (ms):                            55.28     
P99 ITL (ms):                            1460.84   
Max ITL (ms):                            12318.38  
==================================================

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.

gemini-code-assist

Summary of Changes

Hello @WANG-GH, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the data parallel attention mechanism by introducing a new, more intelligent load balancing strategy. The core idea is to minimize the total token load across DP groups by considering both currently processed and in-flight requests, leveraging shared memory for efficient state synchronization. This change is expected to improve overall serving performance and resource utilization.

Highlights

New Load Balancing Method: I've introduced a new load balancing method for Data Parallel (DP) attention called minimum_tokens. This method aims to distribute incoming requests more efficiently by dispatching them to the DP group with the lowest total token load.
Load Calculation Logic: The 'minimum_tokens' method calculates the total load for each DP group by summing two components: holding_tokens (tokens currently being processed by the scheduler) and onfly_req (requests dispatched by the controller but not yet accepted by the scheduler). This provides a more comprehensive view of each group's real-time load.
Shared Memory for Load Data: To facilitate real-time load balancing, I've implemented a shared memory mechanism (DPBalanceMeta). The TP0 node (dispatcher) gathers load data from all DP groups via this shared memory, allowing it to make informed dispatching decisions. A unique dp_balance_id is added to each request to track its status in the 'on-the-fly' queue.
Performance Improvements: Initial benchmarks show that using the minimum_tokens load balancing method leads to improvements in both Time-to-First-Token (TTFT) and Total Batch Throughput (TBT) compared to the default round-robin method, indicating better overall system performance.
Shared Memory Cleanup: I've added explicit cleanup for the POSIX shared memory segments (sglang_dp_balance_onfly_info, sglang_dp_balance_local_tokens) during process termination. This prevents shared memory leaks, which could otherwise cause issues with subsequent server launches.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a new load balancing strategy, minimum_tokens, for Data Parallel (DP) attention. This method aims to improve performance by dispatching requests to the DP group with the fewest combined holding and in-flight tokens. Key changes include:

A new DPBalanceMeta class to manage shared memory for tracking token loads (onfly_info and local_tokens) across DP workers, including initialization and cleanup logic.
Modifications to the DataParallelController to incorporate the new scheduling strategy, initialize DPBalanceMeta, and ensure shared memory is released.
Updates to the Scheduler to interact with DPBalanceMeta, gather load information from DP workers, and update shared memory via the TP0 rank.
Additions to io_struct to include dp_balance_id in request objects for tracking.
Enhanced kill_process_tree utility to clean up the newly introduced shared memory segments.

The core logic for load balancing and shared memory management appears sound, with proper locking mechanisms. However, there's a critical issue in how local_tokens shared memory is handled (missing size header and assertion), and some areas could benefit from improved clarity with constants and variable naming. The performance benchmarks provided in the PR description show improvements with the new method.

gemini-code-assist · 2025-06-20T05:32:32Z

python/sglang/srt/managers/data_parallel_meta.py

There's a critical issue in how local_tokens shared memory is handled:

set_shared_local_tokens writes the pickled data but does not prefix it with its size, unlike set_shared_onfly_info.

get_shared_local_tokens reads the entire shared memory buffer (bytes(shm.buf)) and attempts to unpickle it. If the actual pickled data is smaller than self.local_tokens_size, pickle.loads will try to interpret trailing garbage data, leading to potential errors or incorrect deserialization.

set_shared_local_tokens does not assert that the serialized_data fits within self.local_tokens_size.

To fix this, set_shared_local_tokens should be modified to first write the size of the serialized data (using struct.pack like in set_shared_onfly_info), and get_shared_local_tokens should read this size to correctly slice the buffer before unpickling. An assertion for data size should also be added to set_shared_local_tokens.

gemini-code-assist · 2025-06-20T05:32:32Z

python/sglang/srt/managers/data_parallel_meta.py

The calculation for onfly_info_size and local_tokens_size uses a magic number 8 (commented as dByte).

For onfly_info_size, 512 represents max_onfly_req_per_worker. The * 8 seems to be an estimate for the pickled size of dictionary entries. While the assertion in set_shared_onfly_info provides safety, consider making MAX_ONFLY_REQ_PER_WORKER = 512 and ESTIMATED_BYTES_PER_ENTRY = 8 (or a more descriptive name if 8 has a specific meaning) as named constants for better readability and maintainability.

For local_tokens_size = num_workers * 8, this calculation might be too small for storing a pickled list of num_workers integers, especially considering Python's pickle overhead. For example, pickle.dumps([0]*16) is 50 bytes, not 16*8=128. More importantly, set_shared_local_tokens currently lacks an assertion to check if the pickled data fits within local_tokens_size. This, combined with the issue in get_shared_local_tokens (see separate comment), is problematic.

gemini-code-assist · 2025-06-20T05:32:32Z

python/sglang/srt/managers/data_parallel_controller.py

This log message uses logger.info and includes potentially large data structures (local_tokens, onfly_info). If requests are frequent, this could lead to excessive logging and performance overhead. Consider changing this to logger.debug or making it conditional, for example, logging only every N requests or if a specific debug flag is enabled.

gemini-code-assist · 2025-06-20T05:32:32Z

python/sglang/srt/managers/data_parallel_meta.py

This line self.shm1.name accesses the name attribute of the shared memory object but does not use the value. It has no side effect and can be removed if it's not intended for a specific purpose (e.g., debugging during development).

gemini-code-assist · 2025-06-20T05:32:33Z

python/sglang/srt/managers/scheduler.py

The parameter holding_tokens_list in gather_dp_balance_info is misleading. Based on its usage in handle_dp_balance_data (line 1854-1855), it's a single integer representing the current worker's holding tokens, not a list.

Later, within gather_dp_balance_info (line 1814), holding_tokens_list is re-assigned to be a list of holding tokens gathered from all workers if self.tp_rank == 0.

This dual meaning and misnaming can cause confusion. Consider renaming the parameter to something like current_worker_holding_tokens: int and the locally gathered list to all_workers_holding_tokens: List[int].

gemini-code-assist · 2025-06-20T05:32:33Z

python/sglang/srt/managers/scheduler.py

The values 511 and 512 (for gather_tensor_size) are used here and seem related to max_onfly_req_per_worker defined implicitly in DPBalanceMeta. It would be better to define these as named constants, possibly in data_parallel_meta.py or a shared constants module, and import them here. This improves readability and makes it easier to update if the underlying limits change.

gemini-code-assist · 2025-06-20T05:32:33Z

python/sglang/srt/managers/scheduler.py

The return type hint Union[None, List[List[int]]] for gather_dp_balance_info is incorrect. The function actually returns a tuple: (gathered_id_list_per_worker, holding_tokens_list).
Based on the logic:

If self.tp_rank != 0, it returns (None, input_argument_holding_tokens).

If self.tp_rank == 0, it returns (List[List[int]], List[int]) (where the second list is the gathered holding tokens for all workers).

A more accurate type hint would be Tuple[Optional[List[List[int]]], List[int]].

Suggested change

def gather_dp_balance_info(holding_tokens_list) -> Union[None, List[List[int]]]:

"""gather recv_dp_balance_id_this_term and holding tokens per worker for dp balance"""

recv_list = self.recv_dp_balance_id_this_term

assert len(recv_list) <= 511, (

"The number of requests received this round is too large. "

"Please increase gather_tensor_size and onfly_info_size."

)

gather_tensor_size = 512

# recv_tensor: | holding_tokens | len(recv_dp_balance_id) | recv_dp_balance_ids

recv_tensor = torch.zeros(gather_tensor_size, dtype=torch.int32)

recv_tensor[0] = holding_tokens_list

recv_tensor[1] = len(

recv_list

) # The first element is the length of the list.

recv_tensor[2 : len(recv_list) + 2] = torch.tensor(

recv_list, dtype=torch.int32

)

if self.tp_rank == 0:

gathered_list = [

torch.zeros(gather_tensor_size, dtype=torch.int32)

for _ in range(self.balance_meta.num_workers)

]

else:

gathered_list = None

torch.distributed.gather(

recv_tensor, gathered_list, group=self.tp_cpu_group

)

gathered_id_list_per_worker = None

if self.tp_rank == 0:

gathered_id_list_per_worker = []

holding_tokens_list = []

for tensor in gathered_list:

holding_tokens_list.append(tensor[0].item())

list_length = tensor[1].item()

gathered_id_list_per_worker.append(

tensor[2 : list_length + 2].tolist()

)

return gathered_id_list_per_worker, holding_tokens_list

def gather_dp_balance_info(current_worker_holding_tokens: int) -> Tuple[Optional[List[List[int]]], List[int]]:

Edenzzzz · 2025-06-23T03:05:42Z

python/sglang/srt/managers/data_parallel_controller.py

might be more formal to name these as "on_the_fly" or "in_flight"?

Sure, we can first discuss the current implementation of the algorithm, and then I’ll make a unified update accordingly.

Edenzzzz · 2025-06-23T03:08:32Z

I see, previously it only uses round robin without checking token count

JustinTong0323 · 2025-06-30T08:14:46Z

I think you should clean the commit history🤣

WANG-GH · 2025-06-30T08:53:30Z

I've just cleaned up the commit history and resolved the conflicts with the main branch. Could you please review this PR?
@JustinTong0323

JustinTong0323 · 2025-06-30T23:20:55Z

Plz fix the lint, you could ref to contribution guide.

WANG-GH · 2025-07-01T03:15:58Z

Hi, i finished the lint error, could you please trigger the ci?

Qiaolin-Yu · 2025-07-24T05:07:20Z

python/sglang/srt/managers/scheduler.py

+                "Please increase gather_tensor_size and onfly_info_size."
+            )
+
+            gather_tensor_size = 512


Could you add explanation for this value?

it means the maximum size of the tensor used for gathering data.

Qiaolin-Yu · 2025-07-24T05:09:31Z

python/sglang/srt/managers/data_parallel_controller.py

+        def get_next_global_balance_id() -> int:
+            INT32_MAX = 2147483647
+            current_id = self.global_balance_id
+            self.global_balance_id = (self.global_balance_id + 1) % INT32_MAX


Could you explain the meaning of this variable?

sure, this variable corresponds to the balance_id in TokenizedGenerateReqInput.
We use it to to control the number of onfly tokens (requests dispatched to workers but not yet received).

Qiaolin-Yu · 2025-07-24T05:13:17Z

python/sglang/srt/managers/data_parallel_meta.py

+    def __init__(self, num_workers: int):
+        self.num_workers = num_workers
+        self._manager = mp.Manager()
+        self.mutex = self._manager.Lock()


I was thinking if we could abstract some methods to python/sglang/srt/utils.py and python/sglang/srt/distributed/parallel_state.py ?

sure, i moved it to the manager/utils.py

WANG-GH · 2025-07-25T12:08:17Z

I tested on 2*node H20 with 4096 reqs

# test command
python3 -m sglang.bench_serving --backend sglang --num-prompts 4096 --dataset-path /sgl-workspace/sharegpt.json

rr：

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 not set   
Successful requests:                     4096      
Benchmark duration (s):                  263.31    
Total input tokens:                      1294121   
Total generated tokens:                  787217    
Total generated tokens (retokenized):    783897    
Request throughput (req/s):              15.56     
Input token throughput (tok/s):          4914.74   
Output token throughput (tok/s):         2989.65   
Total token throughput (tok/s):          7904.38   
Concurrency:                             2123.68   
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   136522.09 
Median E2E Latency (ms):                 130958.79 
---------------Time to First Token----------------
Mean TTFT (ms):                          60330.23  
Median TTFT (ms):                        63801.38  
P99 TTFT (ms):                           113752.40 
---------------Inter-Token Latency----------------
Mean ITL (ms):                           396.08    
Median ITL (ms):                         143.72    
P95 ITL (ms):                            654.97    
P99 ITL (ms):                            1512.92   
Max ITL (ms):                            105775.01 
==================================================

minimum token：

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 not set   
Successful requests:                     4096      
Benchmark duration (s):                  262.36    
Total input tokens:                      1294121   
Total generated tokens:                  787217    
Total generated tokens (retokenized):    783888    
Request throughput (req/s):              15.61     
Input token throughput (tok/s):          4932.61   
Output token throughput (tok/s):         3000.52   
Total token throughput (tok/s):          7933.12   
Concurrency:                             2136.54   
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   136851.55 
Median E2E Latency (ms):                 129838.93 
---------------Time to First Token----------------
Mean TTFT (ms):                          61737.04  
Median TTFT (ms):                        64347.30  
P99 TTFT (ms):                           111977.50 
---------------Inter-Token Latency----------------
Mean ITL (ms):                           390.17    
Median ITL (ms):                         143.37    
P95 ITL (ms):                            567.51    
P99 ITL (ms):                            1079.40   
Max ITL (ms):                            104110.49 
==================================================

…p_balance

ltdo111 · 2025-07-31T01:29:03Z

I've been following your PR for a long time. I was wondering when it will be merged into the main repository. Do you have a plan for that?

WANG-GH · 2025-07-31T02:39:05Z

I've been following your PR for a long time. I was wondering when it will be merged into the main repository. Do you have a plan for that?

Thank you for the reply! I also hope to merge it into the main branch as soon as possible. Over the past few weeks, I've been actively communicating with Qiaolin and Cheng Wan from the SGLang community, and they have provided many great suggestions for this PR. We are currently confirming with the PD separation team to check if there are any conflicts between our designs.
@ch-wan @Qiaolin-Yu

ltdo111 · 2025-07-31T03:15:20Z

I've been following your PR for a long time. I was wondering when it will be merged into the main repository. Do you have a plan for that?

Thank you for the reply! I also hope to merge it into the main branch as soon as possible. Over the past few weeks, I've been actively communicating with Qiaolin and Cheng Wan from the SGLang community, and they have provided many great suggestions for this PR. We are currently confirming with the PD separation team to check if there are any conflicts between our designs. @ch-wan @Qiaolin-Yu

I've read through your PR thoroughly and have two points I'd like to confirm with you:

The first question: There is a get_load method in the scheduler, but it isn't reused in this PR. After reading through your code, my understanding is that the current minimum token load balancing method takes into account the load brought by output token IDs during the decode phase, making the load statistics more accurate. I wonder if this understanding is correct.

The second question is about the implementation of data_parallel_controller and IPC communication. The community originally adopted a unified communication method using zmq, but this PR introduces a new mp API to implement communication between dpc and the scheduler. Have you considered switching the communication method from mp to IPC communication? This would be more consistent with the community style, provide a single unified implementation for inter-process communication, and make it easier for subsequent open-source contributors to read the code.

PanXun2 · 2025-07-31T03:16:14Z

@WANG-GH Glad to know this is actively developed. Currently, the Router has the load balancer function which get_loads from each sglang instance. It has the function for collecting workloads. Do you have any plan to reuse such loads information? Or any efforts to make code better reused for these similar functions?

WANG-GH · 2025-07-31T03:57:03Z

I've read through your PR thoroughly and have two points I'd like to confirm with you:

The first question: There is a get_load method in the scheduler, but it isn't reused in this PR. After reading through your code, my understanding is that the current minimum token load balancing method takes into account the load brought by output token IDs during the decode phase, making the load statistics more accurate. I wonder if this understanding is correct.

The second question is about the implementation of data_parallel_controller and IPC communication. The community originally adopted a unified communication method using zmq, but this PR introduces a new mp API to implement communication between dpc and the scheduler. Have you considered switching the communication method from mp to IPC communication? This would be more consistent with the community style, provide a single unified implementation for inter-process communication, and make it easier for subsequent open-source contributors to read the code.

Thanks for ask.

For the first question: I hadn't noticed the get_load function before. When calculating the load, this function, combined with the current batch size, can better accommodate PD separation.

For the second question: IPC communication is more suitable for the producer-consumer model, but the current controller needs to frequently check the global instance's status. This programming paradigm is more like using locks to maintain a global state, which doesn't align well with the producer-consumer model. Changing to IPC would be quite awkward.

WANG-GH · 2025-07-31T04:00:33Z

@WANG-GH Glad to know this is actively developed. Currently, the Router has the load balancer function which get_loads from each sglang instance. It has the function for collecting workloads. Do you have any plan to reuse such loads information? Or any efforts to make code better reused for these similar functions?

I hadn't noticed the get_load function before. When calculating the load, this function, combined with the current batch size, can better accommodate PD separation. I will refactor the handle_dp_balance_data func to use get_loads before gather.

ltdo111 · 2025-07-31T13:03:43Z

I've read through your PR thoroughly and have two points I'd like to confirm with you:
The first question: There is a get_load method in the scheduler, but it isn't reused in this PR. After reading through your code, my understanding is that the current minimum token load balancing method takes into account the load brought by output token IDs during the decode phase, making the load statistics more accurate. I wonder if this understanding is correct.
The second question is about the implementation of data_parallel_controller and IPC communication. The community originally adopted a unified communication method using zmq, but this PR introduces a new mp API to implement communication between dpc and the scheduler. Have you considered switching the communication method from mp to IPC communication? This would be more consistent with the community style, provide a single unified implementation for inter-process communication, and make it easier for subsequent open-source contributors to read the code.

Thanks for ask.

For the first question: I hadn't noticed the get_load function before. When calculating the load, this function, combined with the current batch size, can better accommodate PD separation.

For the second question: IPC communication is more suitable for the producer-consumer model, but the current controller needs to frequently check the global instance's status. This programming paradigm is more like using locks to maintain a global state, which doesn't align well with the producer-consumer model. Changing to IPC would be quite awkward.

thx for reply, i see。

ollybbmonster

Thx for your work on this feature, it helps me a lot.

ollybbmonster · 2025-08-11T07:11:58Z

python/sglang/srt/managers/data_parallel_controller.py

+        with self.balance_meta.mutex:
+            # 1. local_tokens represents the tokens currently inferring on the worker,
+            #  while onfly refers to the requests dispatched by the dispatcher but not yet received by the scheduler.
+            onfly_info = self.balance_meta.get_shared_onfly()


I’m wondering how long onfly_reqs can actually survive.
It seems the scheduler receives reqs from DPC almost immediately.
Meanwhile, onfly_reqs are appended in process_input_requests then excluded at the end of get_next_batch_to_run per 40 iterations.

Given this flow, the comment here might be misleading.

ollybbmonster · 2025-08-11T07:17:03Z

python/sglang/srt/managers/utils.py

+        return list(self.shared_state.local_tokens)
+
+    def set_shared_local_tokens(self, data: List[int]):
+        self.shared_state.local_tokens = data


Not sure if this is intentional, but in both set_* functions, passing a Python list directly into a multiprocessing.Manager().List() could replace the managed object, losing the cross-process synchronization.

ollybbmonster · 2025-08-11T07:18:22Z

python/sglang/srt/managers/scheduler.py

+                "Please increase gather_tensor_size and onfly_info_size."
+            )
+            # The maximum size of the tensor used for gathering data from all workers.
+            gather_tensor_size = 512


Should this be assert len(recv_list) < 511?
Or recv_tensor could be 1 + 1 + 511 in length, which would exceed 512.

Also, as the Gemini bot mentioned, holding_tokens_list is misleading when it’s actually length of tokens rather than a list — especially since it’s later mixed with real list operations.

…t#7379)

whybeyoung · 2025-09-09T03:23:40Z

LGTM

Merge branch 'sglang_public_tracker of [email protected]:Theta/SGLang.git into main https://code.alipay.com/Theta/SGLang/pull_requests/192 Reviewed-by: 得泽 <[email protected]> * fix duplicate args in schedule_batch (sgl-project#7816) * [AMD] Fail gracefully when AITER is unavailable gfx90a GPUs (sgl-project#7187) * docs: update README (sgl-project#7821) * [theta] add py-spy deps * feat: support DeepSeek-R1-W4AFP8 model with ep-moe mode (sgl-project#7762) * Enable ModelOpt Llama4 fp8 checkpoint deployment in SGLang (sgl-project#7129) * [Minor] Fix sporadic CI timeout caused by underestimated tests. (sgl-project#7850) * [Bugfix] Fix two batch overlap with auto DeepEP Dispatch (sgl-project#7853) * Fix cache modules of triton import error (sgl-project#7832) * [router] forward stream_options in request (sgl-project#7860) * Fix illegal memory in trtllm allreduce fusion (sgl-project#7864) * Fix llama4 vision (sgl-project#7840) * Support Mimo-VL (sgl-project#7579) * fix: Handles input_embeds in GenerateReqInput when n>1 (sgl-project#7830) * [Multimodal][Perf] Use `pybase64` instead of `base64` (sgl-project#7724) * Bump xgrammar's version to 0.1.20 (sgl-project#7866) * [CPU]convert topk_weights to fp32 for INT8 and FP8 paths (for llama4) and fix LmHead weight pack (sgl-project#7818) * [PD] Add guidance for prefill bootstrap timeout (sgl-project#7846) * Update native_api doc to match the change in the `get_model_info` endpoint (sgl-project#7660) * Revert "Embedding parallel by attn_tp (sgl-project#7623)" (sgl-project#7880) * chore: bump v0.4.9.post1 (sgl-project#7882) * Fixes typo in assertion message (sgl-project#7895) * [CI] Add deepep tests to CI (sgl-project#7872) * [CPU] [FP8] set SGLANG_CPU_FP8_CVT_FTZ in CMakeLists.txt (sgl-project#7885) * [CPU][Qwen3 MoE] Enable fused_topk CPU fusion and enhance FP8 TP padding (sgl-project#7838) * Remove unused imports (sgl-project#7898) * [router] Update metrics when request completes (sgl-project#7899) * [feature] Add start step profile argument in /start_profile (sgl-project#7608) * [bugfix] add pd router policy validation (sgl-project#7904) * vlm: support video as an input modality (sgl-project#5888) * Feat: Support Phi-3.5-MoE in SGLang (sgl-project#7907) * add sentencepiece as dependency explicitly (sgl-project#7922) * Fix bug of deepseek-v3 under DP+EP mode with large batchsize/seqlen (sgl-project#6449) * [feature]Ascend quantization support (sgl-project#7791) * [ready b200] fuse allreduce+add_rmsnorm in prepare_attention + mlp module (sgl-project#7775) * Support Kimi K2 (sgl-project#7940) * [feature] kv transfer support of ascend npu (sgl-project#7795) * fix: minor fix for modelopt weight load compatibility (sgl-project#7953) * temporarily disable deepep-8-gpu and activate two small tests (sgl-project#7961) * [fix]Update unitest for fp8_blockwise_scaled_grouped_mm kernel (sgl-project#7932) * chore: bump sgl-kernel v0.2.5 (sgl-project#7964) * Revert "[PD Disaggregation] replace transfer with batch transfer for better performance (sgl-project#7236)" (sgl-project#7968) * chore: upgrade xgrammar 0.1.21 (sgl-project#7962) * delete uselese code caused by fuse allreduce+add_rmsnorm pr (sgl-project#7970) * Fix wrong gemm branch cause 250us slower (sgl-project#7969) * [router] add worker abstraction (sgl-project#7960) * chore: upgrade sgl-kernel 0.2.5 (sgl-project#7971) * chore: bump v0.4.9.post2 (sgl-project#7963) * [minor fix] llama4 hybrid memory (sgl-project#7950) * [minor fix] SWA missing methods (sgl-project#7972) * [script] update loogle test (sgl-project#7975) * perf: add kimi k2 fused_moe tuning config for h20_3e * [theta] perf: add kimi k2 fused_moe tuning config for h200 * [minor fix] SWA missing methods (sgl-project#7972) * [script] update loogle test (sgl-project#7975) * perf: add kimi k2 fused_moe tuning config for h30_3e * docs: update README (sgl-project#7985) * Overlap the gating function with shared experts in DeepSeek (sgl-project#7978) * [BugFix] fix pre_reorder_triton_kernel default int32 issue (sgl-project#7814) * [minor] Add server_args check for Llama4 with hybrid (sgl-project#7988) * Tiny fix mooncake log warning wrong output (sgl-project#7952) * [BugFix] add verify logit_bias to avoid crash because of IndexError (sgl-project#7749) * SWA Prefix Cache (sgl-project#7367) * chore: remove unnecessary limits on quantization methods in test script (sgl-project#7997) * Refactor dynamic LoRA update to fix incorrect handling of variant weight shapes (sgl-project#7844) * Support for Phi-1.5 & Phi-2 models (sgl-project#7862) * [Dockerfile] Multi-arch support for ROCm (sgl-project#7902) * [CPU] fix no attribute 'can_fuse_mlp_allreduce' error (sgl-project#8010) * perf: add kimi k2 fused_moe tuning config for h30_3e (sgl-project#8021) * [ci] CI supports use cached models (sgl-project#7874) * [Minor] Remove redundant print (sgl-project#8005) * [Feature]TP Group Switching for PD-Multiplexing (sgl-project#7653) * [Feature] CUDA Green Context Support (sgl-project#7649) * Fix flaky CI: test_vlm_models (sgl-project#8006) * Fix Bug 'get_cpu_copy not Implemented' in pd offloading mode (sgl-project#7982) * prevent server crash from potential invalid grammar (sgl-project#7897) * Setup workflow for releasing mi300x and mi350x dockers. (sgl-project#8035) * fix: modality length mismatch with image_data (sgl-project#7887) * Update CODEOWNERS (sgl-project#8044) * perf: add qwen3-30b-a3b fused moe tuning config for h20 * [feat]Support fusion kernel for constructing quant input and scale factor for fp8_blockwise_scaled_grouped_mm (sgl-project#8023) * feat: update multimodal data handling in engine entrypoint (sgl-project#8002) * fix: remove redundant rotary embedding cache recomputation in MiniCPM (sgl-project#8022) * Fix the input tools format and history tool_calls in OpenAI API (sgl-project#6556) * fix: resolve arm build issue (sgl-project#8052) * concurrently load weights of DeepseekV2ForCausalLM (sgl-project#7943) * H20 tune config for Kimi (sgl-project#8047) * Update amd docker image. (sgl-project#8045) * feat: replace Decord with video_reader-rs (sgl-project#5163) * remove kv_a.congigous in DeepseekV2AttentionMLA (sgl-project#8058) * update transformers to 4.53.2 (sgl-project#8029) * Fix different device type adjustment in PP (sgl-project#7760) * Use device_group for all_gather when disabling overlap scheduling (sgl-project#8001) * Revert "feat: replace Decord with video_reader-rs" (sgl-project#8077) * Fix CI xeon test with triton 3.3.1 (sgl-project#8086) * fix greenctx stream compability (sgl-project#8090) * [misc] update nvshmem and pin deepEP commit hash (sgl-project#8098) * [Feature] Layer-wise Prefill (sgl-project#7634) * [1/n] chore: decouple quantization implementation from vLLM dependency (sgl-project#7992) * refactor: unify names of the feature field of MultimodalDataItem (sgl-project#8075) * feat: add tp_rank, pp_rank and dp_rank labels for scheduler metrics (sgl-project#7597) * [ci] limit cmake build nproc (sgl-project#8100) * [ci] disable memory imbalance check for draft worker (sgl-project#8108) * [Fix] ensure DeepGEMM is only enabled for FP8_W8A8 models (sgl-project#8110) * [ci] recover 8-gpu deepep test (sgl-project#8105) * Refactor: move all quantization-related code to `srt/layer/quantization` (sgl-project#7989) * [kernel] opt moe align block kernel by block/warp scan algorithm (sgl-project#7884) * Super tiny fix typo (sgl-project#8046) * fix: update HostKVCache init to report correct msg when available memory is not enough (sgl-project#8102) * [Hunyuan]: Fix Dense Model Support (sgl-project#8117) * feat: add production metric for retracted requests due to insufficient kvcache (sgl-project#7030) * refactor: simply MultimodalTokens logic (sgl-project#7924) * [Fix][Ready]Fix register spilling in cutlass nvfp4 gemm kernel on Blackwell (sgl-project#8127) * Feat: Support Granite 3.0 MoE in SGLang (sgl-project#7959) * load draft model fix (sgl-project#7506) * [CPU][Llama4] Fix Llama4 MoE inputs with "apply_router_weight_on_input" (sgl-project#7889) * [Quantization][w8a8_int8] Fix weight loading issue for w8a8_int8 path with "ignore" layer list in quantization config (sgl-project#7820) * Hicache Storage Layer Prototype (sgl-project#7704) * Revert "Fix different device type adjustment in PP" (sgl-project#8141) * feat: enchance green context stream creation robust with backward compatibility (sgl-project#8136) * fix compressed tensors WNA16 imports (sgl-project#8142) * [Bugfix] Fix w8a8_int8 import error on NPU (sgl-project#8147) * [3/n] chore: decouple AWQ implementation from vLLM dependency (sgl-project#8113) * [router] Refactor router and policy traits with dependency injection (sgl-project#7987) * [AMD] Add triton awq_dequantize kernel to support AWQ on ROCm (sgl-project#7661) * [Doc] Steps to add a new attention backend (sgl-project#8155) * chore: tune mem fraction static for vlm (sgl-project#6881) * Support NVFP4 quantized dense models on AMD CDNA2/CDNA3 GPUs (sgl-project#7302) * Feat: Support audio in Phi4-mm model (sgl-project#8048) * [PD] Support non-MLA models PD different TP with DP attention (sgl-project#7931) * [health_generate] fix: fix the /health_generate always success bug (sgl-project#8028) * [router] router metrics cleanup (sgl-project#8158) * [router] allow router to have empty workers (sgl-project#8160) * Add GB200 wide-EP docker (sgl-project#8157) * [1/N] MoE Refactor: refactor `select_experts` (sgl-project#7966) * chore: bump sgl-kernel v0.2.6 (sgl-project#8165) * chore: upgrade sgl-kernel 0.2.6 (sgl-project#8166) * [theta] sync bailing * Fix suffix mismatch for the metrics. (sgl-project#8168) * Update README.md (sgl-project#8171) * Clean up server args (sgl-project#8161) * Fix LoRA buffer contamination during adapter eviction (sgl-project#8103) * Fix Dockerfile.gb200 (sgl-project#8169) * [router] add ut for worker and errors (sgl-project#8170) * bugfix: fix sglang crash in NVIDIA MIG container (sgl-project#8167) * Support start up LoRA server without initial adapters (sgl-project#8019) * Clean warning logs for gate_proj loading in Lora (sgl-project#8172) * Fix tuning_fused_moe_triton.py (sgl-project#8175) * [Feature] Simple Improve Health Check Mechanism for Production-Grade Stability (sgl-project#8115) * Add bf16 output option for dsv3_router_gemm kernel (sgl-project#7999) * Enable FlashInfer support encoder models and add head_dim padding workaround (sgl-project#6230) * Add get_hidden_dim to qwen3.py for correct lora (sgl-project#7312) * feat: add h200 tp 16 kimi k2 moe config (sgl-project#8176) * feat: add b200 tp 16 kimi k2 moe config (sgl-project#8178) * fix moe gate dtype, fix tbo, fix fake dispatch (sgl-project#7825) * Revert "[Feature] Simple Improve Health Check Mechanism for Production-Grade Stability" (sgl-project#8181) * feat: update nccl 2.27.6 (sgl-project#8182) * Feat: Support for Persimmon Model (sgl-project#7983) * feat: add h200 tp 16 kimi k2 moe config (sgl-project#8183) * Fix eagle3 cuda graph (sgl-project#8163) * fix: fix the bug of loading Internvl3 (sgl-project#8067) * Fix dtype error in CI (sgl-project#8197) * Cherry-pick commit 2dc5de40 "perf: add bailing mo..." 到当前分支 * [router] add ut for pd request, metrics and config (sgl-project#8184) * [feature] enable NPU CI (sgl-project#7935) * [fix] fix modelopt fp4 on b200 (sgl-project#8195) * chore: bump sgl-kernel v0.2.6.post1 (sgl-project#8200) * Apply fused sorted token ids padding (sgl-project#8193) * [Refactor] simplify multimodal data processing (sgl-project#8107) * [theta] feat vl name * [router] add ut for pd router (sgl-project#8208) * [router] upgade router version to 0.1.6 (sgl-project#8209) * Remve router gemm output dtype conversion (sgl-project#8204) * chore: upgrade sgl-kernel 0.2.6.post1 (sgl-project#8202) * [Feature] Add a test for Layer-wise Prefill (sgl-project#8231) * docs: update 2025 h2 roadmap (sgl-project#8237) * fix: retrieve mm token by modality, raise error if none (sgl-project#8221) * [AMD] Remove vllm's scaled_fp8_quant and moe_sum when SGLANG_USE_AITER=1 (sgl-project#7484) * [theta] tune h20 config for qwen3 235b * [theta] tune h20 config for qwen3 235b * fix: sgl-router remove dead code (sgl-project#8257) * [fix] benchmark : routed_scaling_factor is None (sgl-project#8059) * [Benchmark] add disable-auto-run param for hicache/bench_multiturn (sgl-project#7822) * Preliminary Support for Qwen3XMLDetector (sgl-project#8260) * chore: bump v0.4.9.post3 (sgl-project#8265) * PullRequest: 178 perf: add qwen235b h20-3e fused moe kernel config * [theta] tune h20 config for qwen3 480b * Skip llama4 vision module loading when multimodal disabled (sgl-project#8272) * PullRequest: 180 新增Qwen480B和Qwen235B在NVIDIA H20-3e上的Fused MoE Triton配置 * Fix sgl-kernel ci test (sgl-project#8284) * [theta] tune h200 config for qwen3 480b * Introduce Stable LoRA ID System for Overlapped Updates and Prefix Caching (sgl-project#8261) * Hicache IO kernel refactoring (sgl-project#8264) * bug fix and tag (sgl-project#8282) * HiCache Fix (sgl-project#8288) * [sgl-kernel] Opt per_token_quant_fp8 with warp reduce (sgl-project#8130) * [router] add common ut infra to mock worker and app (sgl-project#8295) * fix: workaround for deepgemm warmup issue (sgl-project#8302) * [Performance][PD Disaggregation] optimize TokenToKVPoolAllocator by sorting free pages (sgl-project#8133) * Fix the issue of incorrect finish reason in final stream response chunk returned during tool call (sgl-project#7708) * fix: match chat-template for internvl3 (sgl-project#8262) * Fix gemma3n with hybrid swa (sgl-project#8240) * chore: upgrade sgl-kernel 0.2.7 (sgl-project#8304) * fix: prevent crashes due to logit bias dimension mismatch (sgl-project#7685) * feat(function call): complete utility method for KimiK2Detector and enhance documentation (sgl-project#8043) * Fix incomplete tool call capture issue in streaming response of DeepSeek-V3 when enable MTP (sgl-project#7562) * [AMD] Pull latest image for AMD CI (sgl-project#8070) * Pin the version of petit kernel to fix the APIs (sgl-project#8235) * [bug] fix pd completion protocol for batching support (sgl-project#8317) * [router] fix pd model completion request (sgl-project#8303) * fix bug when eos_ids==0 (sgl-project#8315) * [router] add endpoint unit test (sgl-project#8298) * [code style] Clean dead triton kernel code in fused_moe and useless vllm_ops import (sgl-project#8310) * chore: upgrade flashinfer v0.2.9rc1 (sgl-project#8301) * [router] add streaming unit test (sgl-project#8299) * [router] add request format unit test (sgl-project#8300) * HiCache Storage TP Refinement (sgl-project#8307) * breakdown kernel update (sgl-project#8334) * support idle batch for TBO (sgl-project#8233) * [Feature] Integrate quick allreduce and select the best allreduce implementation (sgl-project#6619) * DP Enhancement (sgl-project#8280) * fix: Fix failed functional tests https://github.com/meta-llama/llama-stack-evals (sgl-project#8266) * [AMD] Add silu_and_mul, gelu_and_mul, gelu_tanh_and_mul, and gelu_quick kernels for AMD GPUs (sgl-project#7135) * [CPU] Add tutorial docs for SGL on CPU (sgl-project#8000) * chore: upgrade mooncake 0.3.5 (sgl-project#8341) * [torch.compile bug] avoid biased_grouped_topk_impl func repeatedly triggering `torch.compile` in forward pass (sgl-project#8353) * [P/D] Support ipv6 in P/D scenario (sgl-project#7858) * Add H20-3e fused MoE kernel tuning configs for Qwen3-Coder-480B-A35B-Instruct (sgl-project#8344) * [Bugfix][Feat] Add XML-ish grammar in EBNFComposer and fix misc bugs in Qwen3 detector (sgl-project#8357) * Clean up server_args, triton cache manager (sgl-project#8332) * fix: upgrade nccl version (sgl-project#8359) * [Feat] Add reasoning parser for Qwen/Qwen3-235B-A22B-Thinking-2507 (sgl-project#8363) * fix: kimi k2 xgrammar crash (sgl-project#8367) * Fix FP4 MoE accuracy from missing routed_scaling_factor (sgl-project#8333) * [CI] Fix flaky threshold (sgl-project#8370) * chore: bump v0.4.9.post4 (sgl-project#8305) * Fix test_moe_fused_gate_combined sgl-kernel ci test (sgl-project#8374) * Uodate Dockerfile.gb200 to latest sglang (sgl-project#8356) * chore: improve mmmu benchmark (sgl-project#7000) * Save peak memory in logits processor (sgl-project#8343) * Extract update_weights from RL Engine to SGLang to keep simplicity and fix torch reduce (sgl-project#8267) * chore: improvements on mm_utils (sgl-project#7737) * vlm: optimize tensor transport (sgl-project#6003) * Tiny assert EPLB is used together with expert parallel (sgl-project#8381) * model: support intern-s1 (sgl-project#8350) * Add perf tests for LoRA (sgl-project#8314) * Remove slot usage in code to be backward-compatible with python 3.9 (sgl-project#8396) * Add docker release flow for gb200 (sgl-project#8394) * HiCache, check before terminate prefetching (sgl-project#8372) * Add nvfp4 scaled mm benchmark. (sgl-project#8401) * Urgent Fix: intern-s1 chat-template matching (sgl-project#8403) * Tool to dump and compare internal activation tensors (sgl-project#7976) * Minor tool for comparison of benchmark results (sgl-project#7974) * Fix bench script making input data on L2 cache (sgl-project#7739) * [NVIDIA] Add Flashinfer MoE blockscale fp8 backend (sgl-project#8036) * Update Cutlass in sgl-kernel to v4.1 (sgl-project#8392) * fix: minor fix TransportProxyTensor under tp (sgl-project#8382) * [router] add different policies for p node and d node (sgl-project#8395) * Add A800 fused MoE kernel tuning configs for Qwen3-Coder-480B-A35B-Instruct (sgl-project#8351) * fix: fix the missing metrics on non-rank0 nodes (sgl-project#7720) * [2/N] MoE Refactor: Unify weight loader and quant methods (sgl-project#8397) * Use FlashInfer FP4 gemm. (sgl-project#8241) * Support precomputed_embeddings for Llama 4 (sgl-project#8156) * [hotfix] fix merge conflicts in FlashInferEPMoE (sgl-project#8405) * chore: update CODEOWNERS (sgl-project#8407) * chore: upgrade flashinfer v0.2.9rc2 (sgl-project#8406) * Support triton kernels v3.4.0 for fused_moe (sgl-project#8258) * [Bugfix] Prevent PD server crash from invalid grammar (sgl-project#8062) * Change to use native arm runner (sgl-project#8414) * Support overlapped lora updates (sgl-project#8213) * Support ue8m0 for triton quant kernel (sgl-project#7603) * Fix: Improve test_openai_function_calling unit test and fix reasoning_parser.py think_start_token logic (sgl-project#8316) * bugfix: Fix multiple finish_reason chunks and tool_calls finish reason check (sgl-project#8417) * Fix test_openai_server (sgl-project#8419) * Fix docker buildx push error (sgl-project#8425) * bugfix: Fix XGrammar backend to use model's EOS tokens for constrained generation (sgl-project#8422) * [router] improve router logs and request id header (sgl-project#8415) * [feat] Support different attention backends for prefill and decode (sgl-project#6338) * chore: bump transformer to 4.54.0 (sgl-project#8416) * [PD] Fix abort_request for PD disaggregation (sgl-project#8352) * GLM-4.5 Model Support (sgl-project#8224) * Remove zstd compression for building Dockerfile.gb200 (sgl-project#8442) * doc: add bench_one_batch_server in the benchmark doc (sgl-project#8441) * GLM-4.5 Model Support Follow-up (sgl-project#8445) * fix GLM4_MOE launch with compressed_tensor quant model (sgl-project#8456) * Fix per_token_group_quant_8bit when hidden_dim // group_size is not divided by 4. (sgl-project#8449) * Revert "[kernel] opt moe align block kernel by block/warp scan algorithm" (sgl-project#8457) * chore: bump v0.4.9.post5 (sgl-project#8458) * fix:reorder topk experts to ensure shared expert replaces minimal score (sgl-project#8125) * perf: add kimi k2 h200 fused moe config (extracted from theta-asap-sglang-049) * Cherry-pick commit 4a75e015 "Add draft model fuse..." 到当前分支 * Update PR template (sgl-project#8465) * feat: throttle requests at scheduler based on --max_queued_requests (sgl-project#7565) * [theta] tuning script for glm4 moe * perf: add fused moe kernel config glm4.5,h20-3e,tp8 * [theta] tuning script for glm4 moe h20 * fix: update dep (sgl-project#8467) * [NVIDIA] Change to use `num_local_experts` (sgl-project#8453) * Fix parsing ChatCompletionMessage (sgl-project#7273) * [3/N] MoE Refactor: Simplify DeepEP Output (sgl-project#8421) * feat: support glm4 tuning (sgl-project#8473) * Fix DEEPEP BF16 compatibility for Deepseek Style model like GLM 4.5 (sgl-project#8469) * Update codeowner (sgl-project#8476) * chore: add glm4 fp8 tp8 config (sgl-project#8478) * chore: add glm 4.5 fp8 tp4 config (sgl-project#8480) * [CI]Add genai-bench Performance Validation for PD Router (sgl-project#8477) * Update CODEOWNERS (sgl-project#8485) * Rename the last step in pr-test.yml as pr-test-finish (sgl-project#8486) * Reduce memory usage for fp4 moe (sgl-project#8413) * Tiny add warnings for DeepEP when it is suboptimal (sgl-project#8426) * Support colocating requests (sgl-project#7973) * Fix incorrect KV cache allocation for MTP models. (sgl-project#8482) * Add PVC and update resource limits in k8s config (sgl-project#8489) * chore: bump v0.4.9.post6 (sgl-project#8517) * Always trigger pr-test (sgl-project#8527) * Update README.md (sgl-project#8528) * [sgl-kernel performace] fix fp8 quant kernels dispatch __nv_fp8_e4m3 bug to improve performance 10%-20% (sgl-project#8499) * Update cutlass_moe.py (sgl-project#8535) * Fix moe align kernel test (sgl-project#8531) * Split the scheduler into multiple mixin classes to reduce the file size (sgl-project#8483) * bring back kimi vl ci (sgl-project#8537) * fix: temporarily disable cuda-ipc for mm data tensor (sgl-project#8431) * Support EPLB in FusedMoE (sgl-project#8448) * feat(hicache): support file backend reading directory config form env. (sgl-project#8498) * feature(pd-hicache): Prefill instances support reusing the RemoteStorage Cache via HiCache. (sgl-project#8516) * [router] allow longer time out for router e2e (sgl-project#8560) * Update cutlass_moe.py (sgl-project#8545) * Update CODEOWNERS (sgl-project#8562) * [feature] [sgl-router] Add a dp-aware routing strategy (sgl-project#6869) * [Hot-Fix] moe_aligned_block_size CI failed in AMD (sgl-project#8461) * Cherry-pick commit 4fdc06a9 "add fp8a8 kimi-k2 dr..." 到当前分支 * [Model] Add support for Arcee Foundational Model (sgl-project#8154) * Revert "Fix the input tools format and history tool_calls in OpenAI API (sgl-project#6556)" (sgl-project#8584) * Add hf3fs support for hicache storage (based on sgl-project#7704) (sgl-project#7280) * [router] migrate router from actix to axum (sgl-project#8479) * [Fix]Fix index oob in get_group_gemm_starts kernel. (sgl-project#8564) * Bump transfomers to 4.54.1 to fix Gemma cache issue. (sgl-project#8541) * Add GKE's default CUDA runtime lib location to PATH and LD_LIBRARY_PATH. (sgl-project#8544) * Bug: Fix google gemma3n-mm audio input not working bug (sgl-project#8365) * update sgl-kernel for EP: kernel part (sgl-project#8514) * chore: bump sgl-kernel v0.2.8 (sgl-project#8599) * [bugfix] Fix 2 minor bugs in the hicache storage layer (sgl-project#8404) * fix incorrect increase of hit count (sgl-project#8533) * Support l3 cache (mooncake store) for hiradix cache (sgl-project#7211) * [theta] Conditionally import HiCacheHF3FS sgl-project#8598 * update sgl-kernel for EP: python part (sgl-project#8550) * add SVG logo (sgl-project#8603) * [4/N] MoE Refactor: Unified Triton Kernel for FusedMoE and EPMoE (sgl-project#8515) * fix: fork should not run pypi router (sgl-project#8604) * model: support Step3V (sgl-project#8583) * [Feature] Hybrid EP and TP (sgl-project#8590) * chore: bump v0.4.10 (sgl-project#8608) * [PD] Use batch transfer for rdma transport and add notes for mnnvl usage (sgl-project#8595) * [bugifx] QWen-1M context support[2/3] using current cuda stream in the DCA's kernel for bugfix. (sgl-project#8611) * Fix hf3fs_fuse import error (sgl-project#8623) * Update step3v default config (sgl-project#8626) * [ci] fix genai-bench execution cmd (sgl-project#8629) * [router] update router pypi version (sgl-project#8628) * [Optimization][Perf] Disable the GC during CUDA graph capture to speed up by up to 3x (sgl-project#8577) * Fix typos in py_test/test_launch_server.py (sgl-project#6227) * misc: Remove debug print to logger.info (sgl-project#8633) * SGLang HiCache NIXL Connector (sgl-project#8488) * [bug] remove pdlb from minilb since its no longer available (sgl-project#8634) * [bugfix] Fix flashinfer cutlass EP moe after MoE refactor (sgl-project#8630) * Conditionally import HiCacheHF3FS (sgl-project#8598) * TRTLLM Gen MLA Decode Kernel Integration (same as sgl-project#7938) (sgl-project#8632) * Fix nan value generated after custom all reduce (sgl-project#8532) * Revert "Fix nan value generated after custom all reduce (sgl-project#8532)" (sgl-project#8642) * Feature/modelscope model download (sgl-project#8083) * chore: speedup NPU CI by cache (sgl-project#8270) * [Bugfix] fix w8a8_int8 load issue (sgl-project#8308) * [bugfix] fix router python parser for pd urls (sgl-project#8644) * [router] add basic usage doc (sgl-project#8640) * [router] upgrade router version to 0.1.8 (sgl-project#8645) * [NVIDIA] Enable Flashinfer MoE blockscale fp8 backend for TP MoE (sgl-project#8450) * HiCache, fixing hash value indexing (sgl-project#8636) * Interface change for kvcache io to support page first layout (sgl-project#8318) * Update batch size limitation of dsv3_router_gemm kernel to 16 (sgl-project#8051) * chore: bump v0.4.10.post1 (sgl-project#8652) * Add hf3fs_utils.cpp to package-data (sgl-project#8653) * Fix chat template handling for OpenAI serving (sgl-project#8635) * Bug: apply final_hidden_states*=self.routed_scaling_factor at MoE lay… (sgl-project#8511) * [5/N] MoE Refactor: Update MoE parallelism arguments (sgl-project#8658) * Increase tolerance to address CI failures (sgl-project#8643) * [Kimi K2] dsv3_router_gemm supports NUM_EXPERTS == 384 (sgl-project#8013) * [DOC]Update sgl-kernel README (sgl-project#8665) * fix per token cuda kernel hidden dim cannot divide by 16 (sgl-project#8543) * fix arg typo for --disaggregation-transfer-backend (sgl-project#8664) * [fix] fix pd disagg error of vlms (sgl-project#8094) * Disable tp for shared experts under expert parallelism for GLM4.5 model (sgl-project#8647) (sgl-project#8647) * [bugfix] Fix page size for create_flashmla_kv_indices_triton() for cutlass mla (sgl-project#8685) * [bug] limit bootstrap room to to [0, 2^63 - 1] (sgl-project#8684) * Update CODEOWNERS (sgl-project#8686) * Fix deepgemm masked grouped gemm jit compile (sgl-project#8679) * Fix FP8 block quantization when N or K is not multiples of 128 (sgl-project#8648) * bugfix(hicache): Fix 'MooncakeStore' not defined error. (sgl-project#8668) * upgrade xgrammar 0.1.22 (sgl-project#8522) * [bugfix] Add 'disaggregation_mode' parameter to warmup function when compile deep_gemm manually (sgl-project#8618) * Add support for NCCL symmetric memory for TP allreduces (sgl-project#8238) * [1/2] sgl-kernel: Fuse routed scaling factor into select_experts (sgl-project#8364) * chore(gb200): update dockerfile to handle fp4 disaggregation (sgl-project#8694) * [bugfix] Apply routed scaling factor to cutlass_fused_experts_fp8 (sgl-project#8688) * Fix: resolve prefill of retracted request out-of-memory issue when ignore_eos is enabled (sgl-project#7434) * model: adapt mllama4 to VisionAttention (sgl-project#8512) * Add tensor.detach() back to update weight util (sgl-project#8691) * [Doc] Polish sgl-kernel readme for cu126 build error (sgl-project#8704) * [theta] merge 0802-3 * Revert "[1/2] sgl-kernel: Fuse routed scaling factor into select_experts" (sgl-project#8706) * [router] minor code clean up and and refactoring (sgl-project#8711) * [Bug] fix green context's incompatibility with `cuda < 12.4` (sgl-project#8701) * chore: bump sgl-kernel v0.2.9 (sgl-project#8713) * Remove assertions about per group quant fp8 (sgl-project#8717) * [FIX] Fix the nightly CI by disabling swa mem pool for gemma2 (sgl-project#8693) * Fix triton moe error caused by TopK refactor (sgl-project#8705) * [router] Implement HTTP Dependency Injection Pattern for Router System (sgl-project#8714) * [Feature] Radix Tree in C++ (sgl-project#7369) * [Perf]Use Cooperative Schedule for H100 & H200 & H800 in fp8_blockwise_scaled_grouped_mm (sgl-project#8722) * Fix fused MoE when `routed_scaling_factor is None` (sgl-project#8709) * Tiny fix CI pytest error (sgl-project#8524) * [hotfix] fix mixtral with tensor-level compressed-tensor quantization (sgl-project#8721) * Support limiting max loaded loras in CPU. (sgl-project#8650) * Reduce memory accumulation in long-running server (sgl-project#8306) * HiCache storage, style change and bug fix (sgl-project#8719) * [feat] support minimum token load balance in dp attention (sgl-project#7379) * Do layernorm before allgather for DP attention (sgl-project#8631) * [fix] Fix divide by zero error for llama4. (sgl-project#8683) * feat: Add new moe triton for NVIDIA RTX 6000 Ada (sgl-project#8547) * [Improvements] Merge health check route (sgl-project#8444) * chore: bump sgl-kernel 0.3.0 with torch 2.8.0 (sgl-project#8718) * Save cuda graph memory for fa3 (sgl-project#8567) * [CUDA Graph] save cuda graph memory by using next_token_logits_buffer (sgl-project#8579) * [DP] fix the compatibility issue between DP attention and `--attention-backend triton` (sgl-project#8723) * chore: bump v0.4.10.post2 (sgl-project#8727) * feat: Support DP Attention for step3_vl (sgl-project#8699) * [RL] fix update weight for FusedMoE with EP (sgl-project#8676) * use fp32 for e_score_correction_bias in GLM-4.5 (sgl-project#8729) * Fix triton kernels topk with keyword arguments (sgl-project#8732) * feat: support cutlass_moe_fp8 kernel for fusedmoe in sm90 (sgl-project#8678) * Fix the missing 'lof' choice of --schedule-policy server args (sgl-project#7114) * fix args typo in memory_pool_host (sgl-project#8662) * [CI] Do not trigger pd-disaggregation CI in draft PR (sgl-project#8737) * [MoE] Enable `renormalize=False` in Triton kernels (sgl-project#8735) * Replace torch.jit.script with torch.compile in get_masked_input_and_mask to fix benchmark underreporting (sgl-project#8733) * Fix bug of refactoring TopKOutput in w4afp8 (sgl-project#8745) * Rename lora_path to lora_id in batches (sgl-project#8437) * [sgl-kernel] avoid per_token_quant_fp8.cu hardcode sm_count (sgl-project#8738) * [CI] Ascend NPU CI enhancement (sgl-project#8294) * [bugfix] fix import path in HiCacheController (sgl-project#8749)

WANG-GH requested review from ByronHsu, Ying1123, hnyls2002, ispobock, merrymercy, xiezhq-hermann, zhaochenyang20 and zhyncs as code owners June 20, 2025 05:30

gemini-code-assist bot reviewed Jun 20, 2025

View reviewed changes

Edenzzzz reviewed Jun 23, 2025

View reviewed changes

WANG-GH requested review from BBuf, CatherineSue, FlamingoPg, Fridge003, HaiShaw, HandH1998, JustinTong0323, ch-wan, kssteven418, mickqian, rkooo567, slin1237 and yizhang2077 as code owners June 30, 2025 07:00

WANG-GH force-pushed the dp_balance branch from 74457fb to 67c21df Compare June 30, 2025 08:46

Qiaolin-Yu reviewed Jul 24, 2025

View reviewed changes

guanyewang added 2 commits July 28, 2025 10:26

move DPBalanceMeta to manager/utils.py and add some comments

d8b8672

Merge branch 'dp_balance' of https://github.com/WANG-GH/sglang into d…

854ab53

…p_balance

ch-wan self-assigned this Jul 31, 2025

Merge remote-tracking branch 'upstream/main' into dp_balance

457de67

WANG-GH force-pushed the dp_balance branch from 81b5fd6 to 457de67 Compare August 1, 2025 03:47

add ci and remove dp log

464db43

ch-wan added ready-to-merge The PR is ready to merge after the CI is green. and removed ready-to-merge The PR is ready to merge after the CI is green. labels Aug 3, 2025

add doc

0b03b3d

ch-wan merged commit f7b2853 into sgl-project:main Aug 3, 2025

ollybbmonster reviewed Aug 11, 2025

View reviewed changes

narutolhy pushed a commit to narutolhy/sglang that referenced this pull request Aug 17, 2025

[feat] support minimum token load balance in dp attention (sgl-projec…

c24a8a1

…t#7379)

MahmoudAshraf97 pushed a commit to MahmoudAshraf97/sglang that referenced this pull request Sep 8, 2025

[feat] support minimum token load balance in dp attention (sgl-projec…

9381a01

…t#7379)

-        def gather_dp_balance_info(holding_tokens_list) -> Union[None, List[List[int]]]:
-            """gather recv_dp_balance_id_this_term and holding tokens per worker for dp balance"""
-            recv_list = self.recv_dp_balance_id_this_term
-            assert len(recv_list) <= 511, (
-                "The number of requests received this round is too large. "
-                "Please increase gather_tensor_size and onfly_info_size."
-            )
-            gather_tensor_size = 512
-            # recv_tensor: | holding_tokens | len(recv_dp_balance_id) | recv_dp_balance_ids
-            recv_tensor = torch.zeros(gather_tensor_size, dtype=torch.int32)
-            recv_tensor[0] = holding_tokens_list
-            recv_tensor[1] = len(
-                recv_list
-            )  # The first element is the length of the list.
-            recv_tensor[2 : len(recv_list) + 2] = torch.tensor(
-                recv_list, dtype=torch.int32
-            )
-            if self.tp_rank == 0:
-                gathered_list = [
-                    torch.zeros(gather_tensor_size, dtype=torch.int32)
-                    for _ in range(self.balance_meta.num_workers)
-                ]
-            else:
-                gathered_list = None
-            torch.distributed.gather(
-                recv_tensor, gathered_list, group=self.tp_cpu_group
-            )
-            gathered_id_list_per_worker = None
-            if self.tp_rank == 0:
-                gathered_id_list_per_worker = []
-                holding_tokens_list = []
-                for tensor in gathered_list:
-                    holding_tokens_list.append(tensor[0].item())
-                    list_length = tensor[1].item()
-                    gathered_id_list_per_worker.append(
-                        tensor[2 : list_length + 2].tolist()
-                    )
-            return gathered_id_list_per_worker, holding_tokens_list
+        def gather_dp_balance_info(current_worker_holding_tokens: int) -> Tuple[Optional[List[List[int]]], List[int]]:

[feat] support minimum token load balance in dp attention #7379

[feat] support minimum token load balance in dp attention #7379

Uh oh!

Conversation

WANG-GH commented Jun 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation & Modifications

Performence

Checklist

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jun 20, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jun 20, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jun 20, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jun 20, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jun 20, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jun 20, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jun 20, 2025

Choose a reason for hiding this comment

Uh oh!

Edenzzzz Jun 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Edenzzzz commented Jun 23, 2025

Uh oh!

JustinTong0323 commented Jun 30, 2025

Uh oh!

WANG-GH commented Jun 30, 2025

Uh oh!

JustinTong0323 commented Jun 30, 2025

Uh oh!

WANG-GH commented Jul 1, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WANG-GH commented Jul 25, 2025

Uh oh!

ltdo111 commented Jul 31, 2025

Uh oh!

WANG-GH commented Jul 31, 2025

Uh oh!

ltdo111 commented Jul 31, 2025

Uh oh!

PanXun2 commented Jul 31, 2025

Uh oh!

WANG-GH commented Jul 31, 2025

WANG-GH commented Jun 20, 2025 •

edited

Loading

Edenzzzz Jun 23, 2025 •

edited

Loading