Skip to content

Conversation

@zyongye
Copy link
Member

@zyongye zyongye commented Aug 7, 2025

Need nightly torch and triton main to work.

Don't merge. want for accuracy test

@github-actions
Copy link

github-actions bot commented Aug 7, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for mxfp4 quantization on Hopper GPUs by integrating a new Triton kernel for MoE layers. The changes include adding the kernel wrappers, modifying the mxfp4 quantization path to use it, and adding corresponding tests. The implementation looks solid, but I have two high-level concerns. First, the number of warps for the Triton kernel is configured statically based on an environment variable, which might not be optimal or correct for dynamic batch sizes at runtime. Second, a utility function modifies a global configuration flag, which is a risky pattern that could lead to hard-to-debug side effects. Addressing these points would improve the robustness and maintainability of this new feature.

Comment on lines +301 to +306
# FIXME warp need to be adjusted based on batch size
# only apply to batched mode
if self.moe.use_ep:
num_warps = 4 if envs.VLLM_MOE_DP_CHUNK_SIZE <= 512 else 8
else:
num_warps = 8
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The FIXME comment on line 301 indicates that num_warps should be adjusted based on the batch size. The current implementation determines num_warps based on the static environment variable VLLM_MOE_DP_CHUNK_SIZE, which may not reflect the dynamic batch size at runtime. This static configuration could lead to suboptimal performance or potential correctness issues if the Triton kernel has strict requirements for num_warps based on the input size. This value is used during weight loading to swizzle the weights, so it cannot be changed dynamically per batch without re-swizzling. This suggests a potential design issue that should be addressed for robust performance and correctness.

Comment on lines +25 to +31
if current_platform.is_cuda() and \
current_platform.is_device_capability(100):
constraints = {
"is_persistent": True,
"epilogue_subtile": 1,
}
opt_flags.update_opt_flags_constraints(constraints)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The function _swizzle_mxfp4 modifies a global state by calling opt_flags.update_opt_flags_constraints(constraints). Modifying global state within a utility function is a dangerous pattern as it can introduce non-local side effects that are difficult to debug, especially in a system that might handle multiple models or requests concurrently. This could cause issues if different models or layers have conflicting requirements for these optimization flags. It would be safer to manage this global state with more care, for example, by using a context manager to set and restore the flags, or by passing constraints as parameters to the underlying kernel if the API supports it.

zyongye added 2 commits August 7, 2025 21:52
Signed-off-by:  <[email protected]>

Signed-off-by: Yongye Zhu <[email protected]>
Signed-off-by: Yongye Zhu <[email protected]>
Comment on lines +3246 to +3249
def has_triton_kernels() -> bool:
"""Whether the optional `triton_kernels` package is available."""

return _has_module("triton_kernels")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QQ: How can I install this?

Copy link
Member Author

@zyongye zyongye Aug 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to install directly from triton repo

uv pip install triton/python/triton_kernels --no-deps

There's no PyPI wheel yet

@WoosukKwon WoosukKwon merged commit e789cad into vllm-project:main Aug 8, 2025
11 of 14 checks passed
@minosfuture
Copy link
Contributor

hmm, this broke the trunk

(APIServer pid=1847171)   File "/data/users/yming/gitrepos/vllm2/vllm/config.py", line 1173, in _verify_quantization
(APIServer pid=1847171)     method = me_quant.get_quantization_config(name)
(APIServer pid=1847171)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1847171)   File "/data/users/yming/gitrepos/vllm2/vllm/model_executor/layers/quantization/__init__.py", line 114, in get_quantization_config
(APIServer pid=1847171)     from .mxfp4 import Mxfp4Config
(APIServer pid=1847171)   File "/data/users/yming/gitrepos/vllm2/vllm/model_executor/layers/quantization/mxfp4.py", line 11, in <module>
(APIServer pid=1847171)     from vllm.model_executor.layers.fused_moe.gpt_oss_triton_kernels_moe import (
(APIServer pid=1847171)   File "/data/users/yming/gitrepos/vllm2/vllm/model_executor/layers/fused_moe/gpt_oss_triton_kernels_moe.py", line 13, in <module>
(APIServer pid=1847171)     import triton_kernels.swiglu
(APIServer pid=1847171) ModuleNotFoundError: No module named 'triton_kernels'

@zyongye
Copy link
Member Author

zyongye commented Aug 8, 2025

Are you running gpt-oss or other model?

You need to install triton_kernels here

git clone https://github.com/triton-lang/triton
uv pip install triton/python/triton_kernels --no-deps

@zyongye
Copy link
Member Author

zyongye commented Aug 8, 2025

Pushed a fix #22529

@huydhn
Copy link
Contributor

huydhn commented Aug 8, 2025

Just FYI that the error shows up on llama4 benchmark run https://github.com/pytorch/pytorch-integration-testing/actions/runs/16834994069/job/47692144587#step:14:3962, so it's other models too

@minosfuture
Copy link
Contributor

yea I was running deepseek. The code path is shared. Thanks for the quick fix!

jinzhen-lin pushed a commit to jinzhen-lin/vllm that referenced this pull request Aug 9, 2025
Signed-off-by: <[email protected]>
Signed-off-by: Yongye Zhu <[email protected]>
Signed-off-by: Jinzhen Lin <[email protected]>
noamgat pushed a commit to noamgat/vllm that referenced this pull request Aug 9, 2025
Signed-off-by: <[email protected]>
Signed-off-by: Yongye Zhu <[email protected]>
Signed-off-by: Noam Gat <[email protected]>
@yiliu30
Copy link
Contributor

yiliu30 commented Aug 13, 2025

Hi @zyongye , can we use that kernel on Blackwell? If so, could you provide the Triton commit? I encountered the following issue when running UT locally.

============================== short test summary info ==============================
FAILED test_gpt_oss_triton_kernels.py::test_equiv[1-2-bf16-mx4] - NotImplementedError: Must use persistent kernel and be TMA-compliant for native ...
FAILED test_gpt_oss_triton_kernels.py::test_equiv[2-2-bf16-mx4] - NotImplementedError: Must use persistent kernel and be TMA-compliant for native ...
FAILED test_gpt_oss_triton_kernels.py::test_equiv[4-2-bf16-mx4] - NotImplementedError: Must use persistent kernel and be TMA-compliant for native ...
FAILED test_gpt_oss_triton_kernels.py::test_equiv[8-2-bf16-mx4] - NotImplementedError: Must use persistent kernel and be TMA-compliant for native ...
FAILED test_gpt_oss_triton_kernels.py::test_triton_kernel_batched_moe[1-64-bf16-mx4] - triton.compiler.errors.CompilationError: at 277:23:
FAILED test_gpt_oss_triton_kernels.py::test_triton_kernel_batched_moe[2-64-bf16-mx4] - triton.compiler.errors.CompilationError: at 277:23:
FAILED test_gpt_oss_triton_kernels.py::test_triton_kernel_batched_moe[4-64-bf16-mx4] - triton.compiler.errors.CompilationError: at 277:23:
FAILED test_gpt_oss_triton_kernels.py::test_triton_kernel_batched_moe[8-64-bf16-mx4] - triton.compiler.errors.CompilationError: at 277:23:
============================ 8 failed, 1 passed in 4.86s ============================

@mgoin
Copy link
Member

mgoin commented Aug 13, 2025

Hi @yiliu30 for Blackwell SM100 we have kernels from flashinfer available, please see the recipe for details https://docs.vllm.ai/projects/recipes/en/latest/OpenAI/GPT-OSS.html#b200

paulpak58 pushed a commit to paulpak58/vllm that referenced this pull request Aug 13, 2025
Signed-off-by: <[email protected]>
Signed-off-by: Yongye Zhu <[email protected]>
Signed-off-by: Paul Pak <[email protected]>
@zyongye zyongye deleted the hopper-mxfp4 branch August 15, 2025 05:44
diegocastanibm pushed a commit to diegocastanibm/vllm that referenced this pull request Aug 15, 2025
Signed-off-by: <[email protected]>
Signed-off-by: Yongye Zhu <[email protected]>
Signed-off-by: Diego-Castan <[email protected]>
yiliu30 pushed a commit to yiliu30/vllm-fork that referenced this pull request Aug 19, 2025
epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 28, 2025
xiao-llm pushed a commit to xiao-llm/vllm that referenced this pull request Aug 28, 2025
Signed-off-by: <[email protected]>
Signed-off-by: Yongye Zhu <[email protected]>
Signed-off-by: Xiao Yu <[email protected]>
zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Aug 28, 2025
Comment on lines +49 to +53
quant_tensor = convert_layout(wrap_torch_tensor(quant_tensor, dtype=FP4),
value_layout, **value_layout_opts)
scale = convert_layout(wrap_torch_tensor(scale), scale_layout,
**scale_layout_opts)
return quant_tensor, InFlexData(), scale
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it safe to unwrap from triton_kernels.tensor.Tensor from here? Could we avoid it in the first place?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is util function from triton_kernels. It is designed to take triton_kernels.Tensor instead of torch.Tensor.

@mergify mergify bot added the gpt-oss Related to GPT-OSS models label Nov 12, 2025
del layer.w2_weight
layer.w13_weight = None
layer.w2_weight = None
torch.cuda.empty_cache()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gpt-oss Related to GPT-OSS models

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants