[Op] DeepSeekV3.2 support bmm_transpose operator#4631
[Op] DeepSeekV3.2 support bmm_transpose operator#4631wangxiyuan merged 3 commits intovllm-project:mainfrom
Conversation
Signed-off-by: ZYang6263 <zy626375@gmail.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a new bmm_transpose operator to replace torch_npu.npu_transpose_batchmatmul in _v_up_proj. While the Python-side change is straightforward, my review of the C++ implementation for the new operator (torch.ops._C_ascend.batch_matmul_transpose) revealed critical issues related to out-of-bounds memory access and multi-device safety. These issues, detailed in the line comment, could lead to runtime errors and must be addressed before this change is merged.
| res = torch.empty((b, self.num_heads, self.v_head_dim), | ||
| dtype=x.dtype, | ||
| device=x.device) | ||
| torch.ops._C_ascend.batch_matmul_transpose(x, self.W_UV, res) |
There was a problem hiding this comment.
The C++ implementation of the batch_matmul_transpose operator has a couple of critical issues that could lead to runtime errors.
-
Potential for out-of-bounds access: In
csrc/batch_matmul_transpose/op_host/batch_matmul_transpose.h, the tiling data for different batch sizes is cached in a static arrayglobal_tiling_dataof sizeMAX_CAPTURE_NUM(1024). The index into this array,batchIdx, is derived from the number of tokens (opShape.m). If the number of tokens is greater than 1024, which is common during prefill, this will result in an out-of-bounds access and a runtime error.// csrc/batch_matmul_transpose/op_host/batch_matmul_transpose.h:104 int32_t batchIdx = opShape.m - 1; ... if (batchIdx >= 0 && batchIdx < MAX_CAPTURE_NUM) { // MAX_CAPTURE_NUM is 1024 ... } else { TORCH_CHECK(false, "batchIdx is out of range: ", batchIdx); }
-
Not safe for multi-device execution: The
global_tiling_datais astaticvariable, meaning it's initialized only once. Its device is set to the device of the input tensor from the first call. In a multi-GPU environment where workers on different devices might call this operator, this will cause device mismatch errors for any call not on the initial device.// csrc/batch_matmul_transpose/op_host/batch_matmul_transpose.h:106 static auto global_tiling_data = at::empty( {tilingSize * MAX_CAPTURE_NUM}, at::TensorOptions().dtype(at::kByte).device(tensor_a.options().device()));
A potential solution for the multi-device issue is to use a thread-safe map from device index to the tiling data tensor, for example using
std::mapwith astd::mutex.
Given these issues, the underlying implementation of this operator needs to be revised before it can be safely used.
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
|
this ops may have some problem in high level cann version, considering merge lately |
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
Signed-off-by: ZYang6263 <50876451+ZYang6263@users.noreply.github.com>
### What this PR does / why we need it? DeepSeekV3.2 support bmm_transpose operator. - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: ZYang6263 <zy626375@gmail.com> Signed-off-by: ZYang6263 <50876451+ZYang6263@users.noreply.github.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: tanqingshan (A) <50050625@china.huawei.com>
### What this PR does / why we need it? DeepSeekV3.2 support bmm_transpose operator. - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: ZYang6263 <zy626375@gmail.com> Signed-off-by: ZYang6263 <50876451+ZYang6263@users.noreply.github.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
### What this PR does / why we need it? DeepSeekV3.2 support bmm_transpose operator. - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: ZYang6263 <zy626375@gmail.com> Signed-off-by: ZYang6263 <50876451+ZYang6263@users.noreply.github.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it? DeepSeekV3.2 support bmm_transpose operator. - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: ZYang6263 <zy626375@gmail.com> Signed-off-by: ZYang6263 <50876451+ZYang6263@users.noreply.github.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
What this PR does / why we need it?
DeepSeekV3.2 support bmm_transpose operator.
Does this PR introduce any user-facing change?
How was this patch tested?