Skip to content

[ROCm][Quantization] Enable moe_wna16 on ROCm via Triton fallback#35596

Open
brucechanglongxu wants to merge 1 commit intovllm-project:mainfrom
brucechanglongxu:amd/enable-moe-wna16
Open

[ROCm][Quantization] Enable moe_wna16 on ROCm via Triton fallback#35596
brucechanglongxu wants to merge 1 commit intovllm-project:mainfrom
brucechanglongxu:amd/enable-moe-wna16

Conversation

@brucechanglongxu
Copy link
Copy Markdown

@brucechanglongxu brucechanglongxu commented Feb 28, 2026

moe_wna16 (W4A16/W8A16 MoE quantization, used by GPTQ/AWQ-quantized Mixtral, DeepSeek, etc.) is blocked on ROCm by two issues:

  1. Not in RocmPlatform.supported_quantization — platform verification rejects it outright.

  2. Even if you bypass that, should_moe_wna16_use_cuda() in fused_moe.py returns True on ROCm because it checks current_platform.is_cuda(), which returns True for ROCm under the current platform model. This routes into invoke_fused_moe_wna16_cuda_kernelops.moe_wna16_gemm, a CUDA-only C++ op that isn't registered on ROCm builds. The Triton fallback path (invoke_fused_moe_wna16_triton_kernel) would work fine but never gets reached.

The fix is two lines:

  • Add "moe_wna16" to supported_quantization in vllm/platforms/rocm.py
  • Add and not current_platform.is_rocm() to should_moe_wna16_use_cuda() in vllm/model_executor/layers/fused_moe/fused_moe.py so the Triton kernel is used instead

The linear layers within moe_wna16 models already handle ROCm correctly — check_marlin_supports_layer() in marlin_utils.py returns False on ROCm (line 213-214), so MoeWNA16Config.get_quant_method() falls through to the non-Marlin AWQ/GPTQ paths which have working ROCm support via Exllama/Conch.

The Triton WNA16 MoE kernel is the same one used on CUDA when should_moe_wna16_use_cuda() returns false (W8A16 case, or large batch sizes), so it's well-exercised in existing CI.

@mergify mergify bot added the rocm Related to AMD ROCm label Feb 28, 2026
@github-project-automation github-project-automation bot moved this to Todo in AMD Feb 28, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enables moe_wna16 quantization on ROCm by ensuring the Triton fallback kernel is used instead of the CUDA-only one, and by adding moe_wna16 to the list of supported quantization methods for the ROCm platform. The changes are correct and address the issue. I have one suggestion to make the platform check more direct and robust.

Comment on lines 1219 to +1220
current_platform.is_cuda()
and not current_platform.is_rocm()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

To make the platform check more direct and robust against potential inconsistencies in is_cuda() behavior across environments, consider using current_platform.device_name == 'cuda'. This directly checks for the CUDA platform and is less prone to misinterpretation.

Suggested change
current_platform.is_cuda()
and not current_platform.is_rocm()
current_platform.device_name == "cuda"

… path

Enable WNA16 (W4A16/W8A16) MoE quantization on ROCm by:
- Adding "moe_wna16" to RocmPlatform.supported_quantization
- Excluding ROCm from should_moe_wna16_use_cuda() so the Triton
  fallback kernel (invoke_fused_moe_wna16_triton_kernel) is used
  instead of the CUDA-only moe_wna16_gemm op

The Triton WNA16 MoE kernel already works on ROCm. Linear layers
within moe_wna16 models fall through to non-Marlin AWQ/GPTQ paths
since check_marlin_supports_layer returns False on ROCm.

This enables popular 4-bit quantized MoE models (Mixtral, DeepSeek,
etc.) with GPTQ/AWQ quantization on AMD GPUs.

Signed-off-by: Bruce Changlong Xu <brucechanglongxu@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

rocm Related to AMD ROCm

Projects

Status: Todo

Development

Successfully merging this pull request may close these issues.

1 participant