[Bugfix][Kernel] fix bias adding in triton kernel implemented fused moe#31676
Conversation
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
|
Originated from PR#29008, #29008 |
There was a problem hiding this comment.
Code Review
This pull request addresses a correctness issue in the fused MoE Triton kernel by ensuring that bias addition and routed weight multiplication are performed after dequantization. The change correctly moves these operations to follow the scaling of the accumulator, which aligns with the standard mathematical formulation for quantized operations. This fix is crucial for numerical accuracy and appears to be implemented correctly.
|
@mgoin PTAL . Thank you. |
|
@mgoin Hey, could you take a look at this PR? Thanks! |
|
This PR breaks ROCm at: |
…oe (vllm-project#31676) Signed-off-by: xuebwang-amd <xuebwang@amd.com>
…oe (vllm-project#31676) Signed-off-by: xuebwang-amd <xuebwang@amd.com>
…oe (vllm-project#31676) Signed-off-by: xuebwang-amd <xuebwang@amd.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
…oe (vllm-project#31676) Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Purpose
Since bias is typically not quantized, it should be added after the dequantization at last.
y = s_x * s_w * (Wq - zw) * (xq - zx) + bias
where:
s_x, s_w: scaling factors for activation and weight
Wq, xq: quantized weight and activation
zw, zx: zero points for weight and activation
Test Plan
Test Result
Example (TP=2):
