[Quantization] enable MXFP4 Triton backend on SM120 (Blackwell)#31089
[Quantization] enable MXFP4 Triton backend on SM120 (Blackwell)#31089janreges wants to merge 3 commits intovllm-project:mainfrom
Conversation
- Add SM120 to triton_kernels_supported condition in both backend selection functions (get_mxfp4_backend, get_mxfp4_backend_with_lora) - Use StridedLayout for SM120 to avoid "Must use persistent kernel" error caused by unsupported cluster TMA operations - Configure SM120-specific constraints: is_persistent=False, num_stages=1 Tested on NVIDIA RTX PRO 6000 Blackwell (compute capability 12.0). Requires Triton fix: triton-lang/triton#8498
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
There was a problem hiding this comment.
Code Review
This pull request enables the MXFP4 Triton backend for NVIDIA Blackwell (SM120) GPUs. The changes involve updating the device capability checks in mxfp4.py to include SM120 and adding specific configurations for Blackwell in mxfp4_utils.py to handle its architectural differences, such as disabling persistent kernels.
My main feedback is to refactor the duplicated logic for checking Triton kernel support in mxfp4.py into a helper function. This will improve code maintainability and prevent potential inconsistencies in the future. The rest of the changes look good and are well-commented.
| triton_kernels_supported = ( | ||
| has_triton_kernels() | ||
| and is_torch_equal_or_newer("2.8.0") | ||
| # NOTE: triton_kernels are only confirmed to work on SM90 and SM100 | ||
| # NOTE: triton_kernels are confirmed to work on SM90, SM100, and SM120 | ||
| # SM110 fails with this error: https://github.com/vllm-project/vllm/issues/29317 | ||
| # SM120 needs this fix: https://github.com/triton-lang/triton/pull/8498 | ||
| and (9, 0) <= current_platform.get_device_capability() < (11, 0) | ||
| # SM120 support added after Triton fix: https://github.com/triton-lang/triton/pull/8498 | ||
| and ( | ||
| (9, 0) <= current_platform.get_device_capability() < (11, 0) | ||
| or current_platform.is_device_capability_family(120) | ||
| ) | ||
| ) |
There was a problem hiding this comment.
The logic to determine if Triton kernels are supported is duplicated here and in get_mxfp4_backend (lines 155-165). This can lead to maintenance issues, as a future change might only be applied to one of the locations.
Additionally, the current implementation does not handle the case where current_platform.get_device_capability() returns None, which would cause a TypeError.
To improve maintainability, avoid code duplication, and fix the potential TypeError, I suggest extracting this logic into a new helper function.
For example, you could add the following helper function at the module level:
def _is_triton_mxfp4_supported_on_cuda() -> bool:
"""Checks if the Triton MXFP4 kernels are supported on CUDA."""
capability = current_platform.get_device_capability()
if capability is None:
return False
# NOTE: triton_kernels are confirmed to work on SM90, SM100, and SM120
# SM110 fails with this error: https://github.com/vllm-project/vllm/issues/29317
# SM120 support added after Triton fix: https://github.com/triton-lang/triton/pull/8498
is_sm90_or_sm100 = (9, 0) <= capability < (11, 0)
is_sm120 = current_platform.is_device_capability_family(120)
return (has_triton_kernels() and is_torch_equal_or_newer("2.8.0")
and (is_sm90_or_sm100 or is_sm120))Then, you can simplify the code here and in get_mxfp4_backend by calling this new function.
triton_kernels_supported = _is_triton_mxfp4_supported_on_cuda()|
Hi @janreges, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
Hi @janreges, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
…nction - Extract duplicated logic for checking Triton MXFP4 support on CUDA into new _is_triton_mxfp4_supported_on_cuda() helper function - Fix potential TypeError when get_device_capability() returns None - Simplify code in get_mxfp4_backend() and get_mxfp4_backend_with_lora() Addresses PR review feedback to improve maintainability and avoid code duplication. Signed-off-by: jan.reges <jan.reges@siteone.cz>
|
Hi @janreges, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
This would be really nice to see merged! |
|
hi, can someone merge this? would like to use fp4 on NVIDIA RTX PRO 6000 Blackwell. currently also see the following: EngineCore_DP0 pid=280) WARNING 03-05 17:16:17 [marlin_utils_fp4.py:338] Your GPU does not have native support for FP4 computation but FP4 quantization is being used. Weight-only FP4 compression will be used leveraging the Marlin kernel. This may degrade performance for compute-heavy workloads. |
yewentao256
left a comment
There was a problem hiding this comment.
Please merge from main and solve the pre-commit issue and conflicts
|
cc @janreges |
|
@geraldstanje1 Sorry for the confusion but this warning isn't actually something to be concerned about since GPT-OSS is already a weight-only checkpoint i.e. MXFP4 W4A16 From the PR description, performance is actually worse with this backend
Can you try benchmarking locally to see if there is any reason to use this kernel? |
|
Closing this PR for now as this kernel seems slower than Marlin on SM120 and achieves the same result of MXFP4 w4a16. The warning message shared is just a user confusion that we have removed on main |
Purpose
Enable MXFP4 Triton kernel backend on NVIDIA Blackwell consumer GPUs (SM120, compute capability 12.0).
Test Plan
Tested with a version compiled from current source code on NVIDIA RTX PRO 6000 Blackwell 96GB:
vllm serve \ "openai/gpt-oss-120b" \ --async-scheduling \ --trust-remote-code \ --gpu-memory-utilization 0.91 \ --enable-chunked-prefill \ --enable-prefix-caching \ --tensor-parallel-size 1 \ --max-num-batched-tokens 32768 \ --max-model-len 131072 \ --max-num-seqs 512 \ --disable-log-requests \ --reasoning-parser openai_gptoss \ --enable-auto-tool-choice \ --tool-call-parser openai \ --port 8000Test Result
Tested on NVIDIA RTX PRO 6000 Blackwell (compute capability 12.0) with AMD EPYC 9554 processor - model
openai/gpt-oss-120bloads and runs successfully with MXFP4 Triton backend.However, the performance is worse compared to the Marlin backend - batch 1 = 160 flow/s. Marlin with the same configuration 201 flow/s.
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.