[ROCm][Quantization] Enable moe_wna16 on ROCm via Triton fallback by brucechanglongxu · Pull Request #35596 · vllm-project/vllm

brucechanglongxu · 2026-02-28T07:18:02Z

moe_wna16 (W4A16/W8A16 MoE quantization, used by GPTQ/AWQ-quantized Mixtral, DeepSeek, etc.) is blocked on ROCm by two issues:

Not in RocmPlatform.supported_quantization — platform verification rejects it outright.
Even if you bypass that, should_moe_wna16_use_cuda() in fused_moe.py returns True on ROCm because it checks current_platform.is_cuda(), which returns True for ROCm under the current platform model. This routes into invoke_fused_moe_wna16_cuda_kernel → ops.moe_wna16_gemm, a CUDA-only C++ op that isn't registered on ROCm builds. The Triton fallback path (invoke_fused_moe_wna16_triton_kernel) would work fine but never gets reached.

The fix is two lines:

Add "moe_wna16" to supported_quantization in vllm/platforms/rocm.py
Add and not current_platform.is_rocm() to should_moe_wna16_use_cuda() in vllm/model_executor/layers/fused_moe/fused_moe.py so the Triton kernel is used instead

The linear layers within moe_wna16 models already handle ROCm correctly — check_marlin_supports_layer() in marlin_utils.py returns False on ROCm (line 213-214), so MoeWNA16Config.get_quant_method() falls through to the non-Marlin AWQ/GPTQ paths which have working ROCm support via Exllama/Conch.

The Triton WNA16 MoE kernel is the same one used on CUDA when should_moe_wna16_use_cuda() returns false (W8A16 case, or large batch sizes), so it's well-exercised in existing CI.

gemini-code-assist

Code Review

This pull request enables moe_wna16 quantization on ROCm by ensuring the Triton fallback kernel is used instead of the CUDA-only one, and by adding moe_wna16 to the list of supported quantization methods for the ROCm platform. The changes are correct and address the issue. I have one suggestion to make the platform check more direct and robust.

gemini-code-assist · 2026-02-28T07:28:52Z

vllm/model_executor/layers/fused_moe/fused_moe.py

        current_platform.is_cuda()
+        and not current_platform.is_rocm()


To make the platform check more direct and robust against potential inconsistencies in is_cuda() behavior across environments, consider using current_platform.device_name == 'cuda'. This directly checks for the CUDA platform and is less prone to misinterpretation.

Suggested change

current_platform.is_cuda()

and not current_platform.is_rocm()

current_platform.device_name == "cuda"

… path Enable WNA16 (W4A16/W8A16) MoE quantization on ROCm by: - Adding "moe_wna16" to RocmPlatform.supported_quantization - Excluding ROCm from should_moe_wna16_use_cuda() so the Triton fallback kernel (invoke_fused_moe_wna16_triton_kernel) is used instead of the CUDA-only moe_wna16_gemm op The Triton WNA16 MoE kernel already works on ROCm. Linear layers within moe_wna16 models fall through to non-Marlin AWQ/GPTQ paths since check_marlin_supports_layer returns False on ROCm. This enables popular 4-bit quantized MoE models (Mixtral, DeepSeek, etc.) with GPTQ/AWQ quantization on AMD GPUs. Signed-off-by: Bruce Changlong Xu <brucechanglongxu@gmail.com>

brucechanglongxu requested review from mgoin, pavanimajety and tjtanaa as code owners February 28, 2026 07:18

mergify bot added the rocm Related to AMD ROCm label Feb 28, 2026

github-project-automation bot added this to AMD Feb 28, 2026

github-project-automation bot moved this to Todo in AMD Feb 28, 2026

gemini-code-assist bot reviewed Feb 28, 2026

View reviewed changes

brucechanglongxu force-pushed the amd/enable-moe-wna16 branch from 2f63add to 080d293 Compare March 7, 2026 04:20

will-deines mentioned this pull request Mar 25, 2026

[MoE][GPT-OSS] Add L40S/SM89 Marlin block-size policy #38054

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ROCm][Quantization] Enable moe_wna16 on ROCm via Triton fallback#35596

[ROCm][Quantization] Enable moe_wna16 on ROCm via Triton fallback#35596
brucechanglongxu wants to merge 1 commit intovllm-project:mainfrom
brucechanglongxu:amd/enable-moe-wna16

brucechanglongxu commented Feb 28, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		current_platform.is_cuda()
		and not current_platform.is_rocm()

	current_platform.is_cuda()
	and not current_platform.is_rocm()
	current_platform.device_name == "cuda"

Uh oh!

Conversation

brucechanglongxu commented Feb 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

brucechanglongxu commented Feb 28, 2026 •

edited

Loading