[Kernel] Add Sonic MoE integration for Hopper GPUs#31548
[Kernel] Add Sonic MoE integration for Hopper GPUs#31548clocksmith wants to merge 10 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request integrates Sonic MoE for Hopper GPUs, which is a great performance enhancement. The implementation is well-structured, following existing patterns in vLLM for MoE layers, and includes necessary components like weight permutation and support-checking utilities. The accompanying tests cover the new logic, including platform detection and weight permutation correctness. I've identified a couple of areas for improvement: one is a performance optimization to cache the created MoE kernel, and the other is a potential logic issue in one of the new tests. Overall, this is a solid contribution.
4022ab9 to
cdc35ec
Compare
|
Requested review to run sonicmoe related tests in buildkite/ci/pr CI item |
|
In this PR, the kernel is not hooked up to any MoeMethod, so it cannot be run by any users other than in the unit test |
EDIT: Correct, this is intentional: kernel and unit tests first to get implementation reviewed. You are right tho, I can add the wiring into Happy to add the wiring now, just need Can a maintainer add UPDATEL: @robertgshaw2-redhat added the wiring, PTAL. Still pending the |
zyongye
left a comment
There was a problem hiding this comment.
Thank you for your contribution. I think the biggest issue is our current framework can't take into router logits directly and needs to do additional bookkeeping to generate the metadata (and that process is not optimized). I think we can leave that for now and wait for MoE refactor to complete and change this PR accordingly.
The moe refactor issue tracker is here
| _up_projection_forward( | ||
| x=hidden_states, | ||
| w1=w1_sonic, | ||
| z=z, |
There was a problem hiding this comment.
We probably don't need this since they fuse swiglu with up proj.
There was a problem hiding this comment.
I think this is needed: _up_projection_forward appears to perform both the up projection and swiglu in the fused kernel (which is usually more efficient, and core to Sonic MoE on Hopper).
swiglu is not being applied separately; is this correct?
Thanks! Got it, is it common within vllm to use comments like |
Yeah I think it's fine for now |
8c5f5af to
1479564
Compare
|
Hi @clocksmith, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
1479564 to
f4d8a42
Compare
|
Hi @clocksmith, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
cf15b99 to
b665cb3
Compare
|
This should simplify the testing issue so this PR can be tested in place (kernel), potentially with e2e following: #31606 |
|
This pull request has merge conflicts that must be resolved before it can be |
27c689b to
00d9a35
Compare
|
Merged and update with CI build step added to this PR directly. Closed #31606 (review) Can't wait to see how this benchmarks! |
|
Hi @clocksmith, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
1 similar comment
|
Hi @clocksmith, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
This pull request has merge conflicts that must be resolved before it can be |
|
Hi @clocksmith, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
4 similar comments
|
Hi @clocksmith, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
Hi @clocksmith, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
Hi @clocksmith, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
Hi @clocksmith, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
Still a mismatch that I am debugging, probably in activiation, but getting closer:
|
|
Hi @clocksmith, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
2 similar comments
|
Hi @clocksmith, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
Hi @clocksmith, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
This pull request has merge conflicts that must be resolved before it can be |
|
WOOHOO! dtype,e,topk,m,k,n,triton_us,sonic_us,speedup,mode,rel_err H100 benchmark update (tuned Triton configs, eager/eager): - e=8, topk=2, m=256, k=512, n=4096 |
Signed-off-by: X <x@simulatte.world>
Signed-off-by: X <x@simulatte.world>
Signed-off-by: X <x@simulatte.world>
Signed-off-by: X <x@simulatte.world>
Signed-off-by: X <x@simulatte.world>
Signed-off-by: X <x@simulatte.world>
Signed-off-by: X <x@simulatte.world>
Signed-off-by: X <x@simulatte.world>
Signed-off-by: X <x@simulatte.world>
|
I cannot get into vllm slack; what shall be done with this PR? It can be safely merged and not used, then decided on how to route to prefill-only next? |
|
Hi @clocksmith, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
| sonic_requested = envs.VLLM_USE_SONIC_MOE | ||
| sonic_supported = False | ||
| if sonic_requested: | ||
| from vllm.model_executor.layers.fused_moe.sonic_moe import ( | ||
| is_sonic_moe_supported, | ||
| ) | ||
|
|
||
| sonic_supported = is_sonic_moe_supported() | ||
| sonic_enabled = ( | ||
| sonic_supported | ||
| and sonic_requested | ||
| and is_act_and_mul | ||
| and not has_bias | ||
| and not use_ep | ||
| and not moe_config.moe_parallel_config.is_sequence_parallel | ||
| and moe_config.experts_per_token <= 16 | ||
| and moe_config.in_dtype in (torch.float16, torch.bfloat16) | ||
| and moe_config.activation in ("silu", "silu_and_mul") | ||
| ) | ||
| if sonic_requested and sonic_supported and not sonic_enabled: | ||
| if use_ep: | ||
| logger.debug_once( | ||
| "Sonic MoE disabled because expert parallelism is enabled." | ||
| ) | ||
| elif has_bias: | ||
| logger.debug_once("Sonic MoE disabled because MoE biases are enabled.") | ||
| elif not is_act_and_mul: | ||
| logger.debug_once("Sonic MoE disabled because is_act_and_mul is False.") | ||
| elif moe_config.moe_parallel_config.is_sequence_parallel: | ||
| logger.debug_once( | ||
| "Sonic MoE disabled because sequence parallelism is enabled." | ||
| ) | ||
| elif moe_config.experts_per_token > 16: | ||
| logger.debug_once("Sonic MoE disabled because topk > 16.") | ||
| elif moe_config.in_dtype not in (torch.float16, torch.bfloat16): | ||
| logger.debug_once( | ||
| "Sonic MoE disabled because input dtype is unsupported: %s", | ||
| moe_config.in_dtype, | ||
| ) | ||
| elif moe_config.activation not in ("silu", "silu_and_mul"): | ||
| logger.debug_once( | ||
| "Sonic MoE disabled because activation is unsupported: %s", | ||
| moe_config.activation, | ||
| ) |
There was a problem hiding this comment.
Can we put all of this in sonic_moe.py?
There was a problem hiding this comment.
Yeah, it would be great if you could move this into SonicMoeExperts::_supports_current_device, _supports_no_act_and_mul, etc. and rely on FusedMoEExperts::is_supported_config check for Sonic support
| - pip install -r /tmp/sonic-moe/requirements.txt | ||
| # Override SonicMoE pins: older CUTLASS DSL can generate invalid LLVM IR on H100; | ||
| # newer CUTLASS DSL triggers import-time annotation evaluation errors in SonicMoE. | ||
| - pip install -U nvidia-cutlass-dsl==4.3.5 | ||
| - pip install -U --force-reinstall --no-deps git+https://github.com/Dao-AILab/quack.git@4210c0abcb20a7126775661640e79de425a55206 | ||
| - pip install -e /tmp/sonic-moe --no-deps | ||
| - pytest -v -s kernels/moe/test_sonic_moe.py |
There was a problem hiding this comment.
I don't know if this installation will mess up with other package. Since it's not default enabled maybe we don't need it in the CI?
| f"got {activation}" | ||
| ) | ||
|
|
||
| w1_sonic, w2_sonic = self._ensure_weights_ready(w1, w2) |
There was a problem hiding this comment.
Can we move it out from the critical path? It can be done completely at weight loading time
| ) from None | ||
|
|
||
|
|
||
| def sonic_moe_forward( |
There was a problem hiding this comment.
Is this function solely used for test?
|
Also please fix the pre-commit |
yzong-rh
left a comment
There was a problem hiding this comment.
Thank you for the contribution.
There has been some refactoring which caused those pre-commit complaints.
FusedMoEModularKernel->FusedMoEKernel(now supports both modular and monolithic)MoEPrepareAndFinalizeNoEP->MoEPrepareAndFinalizeNoDPEP(more accurate naming)FusedMoEPermuteExpertsUnpermute->FusedMoEExpertsModularorFusedMoEExpertsMonolithic
I think a lot of it was done in #32564
FlashInferExperts is a good example of the new flow.
| sonic_requested = envs.VLLM_USE_SONIC_MOE | ||
| sonic_supported = False | ||
| if sonic_requested: | ||
| from vllm.model_executor.layers.fused_moe.sonic_moe import ( | ||
| is_sonic_moe_supported, | ||
| ) | ||
|
|
||
| sonic_supported = is_sonic_moe_supported() | ||
| sonic_enabled = ( | ||
| sonic_supported | ||
| and sonic_requested | ||
| and is_act_and_mul | ||
| and not has_bias | ||
| and not use_ep | ||
| and not moe_config.moe_parallel_config.is_sequence_parallel | ||
| and moe_config.experts_per_token <= 16 | ||
| and moe_config.in_dtype in (torch.float16, torch.bfloat16) | ||
| and moe_config.activation in ("silu", "silu_and_mul") | ||
| ) | ||
| if sonic_requested and sonic_supported and not sonic_enabled: | ||
| if use_ep: | ||
| logger.debug_once( | ||
| "Sonic MoE disabled because expert parallelism is enabled." | ||
| ) | ||
| elif has_bias: | ||
| logger.debug_once("Sonic MoE disabled because MoE biases are enabled.") | ||
| elif not is_act_and_mul: | ||
| logger.debug_once("Sonic MoE disabled because is_act_and_mul is False.") | ||
| elif moe_config.moe_parallel_config.is_sequence_parallel: | ||
| logger.debug_once( | ||
| "Sonic MoE disabled because sequence parallelism is enabled." | ||
| ) | ||
| elif moe_config.experts_per_token > 16: | ||
| logger.debug_once("Sonic MoE disabled because topk > 16.") | ||
| elif moe_config.in_dtype not in (torch.float16, torch.bfloat16): | ||
| logger.debug_once( | ||
| "Sonic MoE disabled because input dtype is unsupported: %s", | ||
| moe_config.in_dtype, | ||
| ) | ||
| elif moe_config.activation not in ("silu", "silu_and_mul"): | ||
| logger.debug_once( | ||
| "Sonic MoE disabled because activation is unsupported: %s", | ||
| moe_config.activation, | ||
| ) |
There was a problem hiding this comment.
Yeah, it would be great if you could move this into SonicMoeExperts::_supports_current_device, _supports_no_act_and_mul, etc. and rely on FusedMoEExperts::is_supported_config check for Sonic support
Purpose
Integrate Sonic MoE for Hopper GPUs. (paper) 🎸 🚀
Weights are permuted from vLLM's concatenated format to Sonic's interleaved format during model loading (no inference runtime cost).
Addresses #31039
Key Changes
Usage