Flashinfer_CUTLASS_MOE fuses quantization for TP#27223
Flashinfer_CUTLASS_MOE fuses quantization for TP#27223pavanimajety merged 6 commits intovllm-project:mainfrom
Conversation
Signed-off-by: Shu Wang. <shuw@nvidia.com>
1057c27 to
4c37b7d
Compare
bnellnm
left a comment
There was a problem hiding this comment.
LGTM. Alternatively, you could prevent modular kernels from being created in this case and fall though to the direct call to flashinfer_cutlass_moe_fp4?
Yes. But do you prefer to have modular kernels anyways? |
|
|
||
| assert self.moe_quant_config is not None | ||
|
|
||
| return flashinfer_cutlass_moe_fp4( |
There was a problem hiding this comment.
Can you update for compressed-tensors too?
There was a problem hiding this comment.
There was a problem hiding this comment.
But by deleting this elif clause (and, per @mgoin 's suggestion, applying this change to compressed-tensors), doesn't it force the use of FlashInfer cutlass implementation to go through the modular kernels?
I'm just trying to understand if this is the plan for all cases that use FlashInfer, regardless of distributed strategies, or whether self.flashinfer_moe_backend is FlashinferMoeBackend.TENSORRT_LLM or FlashinferMoeBackend.CUTLASS.
There was a problem hiding this comment.
I don't see any reason to use the modular kernels for cases that aren't using some kind of all2all communication. In this particular case I think @wenscarl and @leejnau figured out that this was dead code because the CUTLASS case always created a modular kernel. I'm not sure if the same holds true for compressed_tensors.
There was a problem hiding this comment.
@bnellnm this PR is updated with a additional quant_dtype: nvfp4_skip_quantization.
There was a problem hiding this comment.
I'm just trying to understand if this is the plan for all cases that use FlashInfer
I vote for that. Since flashinfer cutlass moe is at least a better option to normal cutlass_moe. TRTLLM MoE can win sometimes even.
I've no preference for this particular case since there's no communication going on. My only concern would be that the |
c78959c to
4c37b7d
Compare
| FlashInferExperts( | ||
| out_dtype=hidden_states.dtype, | ||
| quant_config=quant_config, | ||
| use_dp=False, |
There was a problem hiding this comment.
Why are we always setting use_dp=False? Doesn't flashinfer_cutlass_moe_fp4 also support dp?
There was a problem hiding this comment.
flashinfer_cutlass_moe_fp4(dead code) is removed in this PR. Previously flashinfer_cutlass_moe_fp4 is meant for TP case only. If DP, the fused expert is assemble elsewhere.
There was a problem hiding this comment.
How is TP + Flashinfer cutlass handled now? Could we remove this method altogether or do we see any cases where it would be used?
There was a problem hiding this comment.
https://github.com/vllm-project/vllm/pull/27223/files#diff-5bb9585da825481e1ae1534657a703f846325c7dd72cfddb0c41f878db33d78aR82 differentiate TP vs DP. But in either case, a fused moe expert is assembled. I agree that flashinfer_cutlass_moe_fp4 should be removed. But compressed tensor TP could still depends on it. So its removal is out-of-scope for this PR.
| elif ( | ||
| self.allow_flashinfer | ||
| and self.flashinfer_moe_backend == FlashinferMoeBackend.CUTLASS | ||
| ): |
There was a problem hiding this comment.
will this break PP mode? when run deepseek PP, it will go to the last clause that use cutlass_moe_fp4, which will break on SM120. While on SGLang, flashinfer_cutlass works with sm120 PP.
There was a problem hiding this comment.
well, #27123 added back this and solved my issue.
Signed-off-by: Shu Wang. <shuw@nvidia.com>
Signed-off-by: Shu Wang. <shuw@nvidia.com>
Signed-off-by: Shu Wang. <shuw@nvidia.com>
co-authored by @leejnau
cc. @bnellnm
Purpose
For TP case, the nvfp4 quantzation is fused with flashinfer_cutlass_moe call. This fixes the accuracy issue for nvidia/Deepseek-R1-0528-FP4-v2 model.
For DP case, the fix should rely on #26135.
Test Plan
Test Result
Previously:
With this PR:
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.