[Feat] Multi-stream for eplb heat collection and aggregation#4214
[Feat] Multi-stream for eplb heat collection and aggregation#4214MengqingCao merged 8 commits intovllm-project:mainfrom
Conversation
Signed-off-by: daishixun <dsxsteven@sina.com>
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Code Review
This pull request introduces an asynchronous stream to handle MoE expert load (heat) collection, aiming to optimize performance by overlapping it with other computations. The changes involve creating a dedicated stream and switching to it for MoE load accumulation and gathering.
My review has identified a critical race condition and a high-severity performance issue in vllm_ascend/eplb/eplb_updator.py. The race condition is due to missing synchronization between the main computation stream and the new asynchronous stream, which could lead to incorrect load balancing. The performance issue is related to a buffer being re-allocated on every call, which should be addressed for efficiency. Please see the detailed comments for suggestions on how to fix these issues.
| with npu_stream_switch(moe_load_async_stream()): | ||
| self.world_size = dist.get_world_size() | ||
| self.device = local_load.device | ||
| if self._gather_buffer is None: | ||
| shape = (self.world_size, *local_load.shape) | ||
| self._gather_buffer = torch.empty(shape, | ||
| dtype=local_load.dtype, | ||
| device=self.device) | ||
|
|
||
| dist.all_gather_into_tensor(self._gather_buffer, local_load) | ||
|
|
||
| moe_load = self._gather_buffer.permute(1, 0, 2) | ||
| self.shared_dict["moe_load"] = moe_load.cpu() | ||
| logger.debug( | ||
| f"[ModelRunner] Updated shared_dict['moe_load'] shape={moe_load.shape}" | ||
| ) |
There was a problem hiding this comment.
There is a race condition here. The moe_load tensors are updated asynchronously on moe_load_async_stream in fused_moe.py. However, self.adaptor.get_rank_expert_workload() is called on the default stream (on line 152, before this block) to read these tensors without any synchronization. This can lead to reading stale or incomplete data, causing incorrect load balancing. To fix this, you must synchronize the streams before reading moe_load. For example, you could add moe_load_async_stream().synchronize() before the call to self.adaptor.get_rank_expert_workload() on line 152.
| if self._gather_buffer is None: | ||
| shape = (self.world_size, *local_load.shape) | ||
| self._gather_buffer = torch.empty(shape, | ||
| dtype=local_load.dtype, | ||
| device=self.device) |
There was a problem hiding this comment.
The self._gather_buffer is reset to None on every call to compute_and_set_moe_load (on line 154). This makes this condition always true, causing the buffer to be re-allocated on every invocation, which is inefficient. To avoid this performance issue, self._gather_buffer should be initialized to None in the __init__ method of the EplbUpdator class, and the line self._gather_buffer = None should be removed from this method.
Signed-off-by: daishixun <dsxsteven@sina.com>
Signed-off-by: daishixun <dsxsteven@sina.com>
Signed-off-by: daishixun <dsxsteven@sina.com>
| logger.debug( | ||
| f"[ModelRunner] Updated shared_dict['moe_load'] shape={moe_load.shape}" | ||
| ) | ||
| with npu_stream_switch(moe_load_async_stream()): |
There was a problem hiding this comment.
maybe better to set moe_load_async_stream as class attribute of EplbUpdator
There was a problem hiding this comment.
already move this function to eplb module, since other file would also call this stream, so move to eplb utils is better
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
Signed-off-by: daishixun <dsxsteven@sina.com>
Signed-off-by: daishixun <dsxsteven@sina.com>
6c7c895 to
e09650a
Compare
| return _SHARED_EXPERTS_CALCULATION_STREAM | ||
|
|
||
|
|
||
| def moe_load_async_stream() -> torch_npu.npu.Stream: |
There was a problem hiding this comment.
move this function to eplb module
Signed-off-by: daishixun <dsxsteven@sina.com>
cf110ae to
d5b59ad
Compare
…oject#4214) ### What this PR does / why we need it? This PR optimizes multistream for eplb heat collection and aggregation - vLLM version: v0.12.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.12.0 --------- Signed-off-by: daishixun <dsxsteven@sina.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>
…oject#4214) ### What this PR does / why we need it? This PR optimizes multistream for eplb heat collection and aggregation - vLLM version: v0.12.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.12.0 --------- Signed-off-by: daishixun <dsxsteven@sina.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>
…llm-project#4214)" This reverts commit 9a885d0.
…llm-project#4214)" This reverts commit 9a885d0.
…llm-project#4214)" This reverts commit 9a885d0. Signed-off-by: Wangyibo1005 <2633333316@qq.com>
…llm-project#4214)" This reverts commit 9a885d0. Signed-off-by: Wangyibo1005 <2633333316@qq.com>
What this PR does / why we need it?
This PR optimizes multistream for eplb heat collection and aggregation
Does this PR introduce any user-facing change?
No
How was this patch tested?
Co-authored-by: Skywalker-EP 173723846@qq.com,walterchenchn@outlook.com