[Perf] Fuse stride preparation for NVFP4 cutlass_moe#31837
Conversation
Signed-off-by: mgoin <mgoin64@gmail.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a performance optimization for the nvfp4 CUTLASS MoE kernels by fusing the stride tensor initialization. Instead of using torch::full to create and initialize stride tensors, which launches separate CUDA kernels, this change allocates uninitialized tensors with torch::empty and performs the initialization inside the existing __get_group_gemm_starts kernel. This effectively reduces kernel launch overhead, leading to the performance improvements shown in the benchmarks. The changes are applied consistently for both sm100 and sm120 architectures. The implementation is correct and the optimization is a good practice for CUDA programming. I have no further comments.
pavanimajety
left a comment
There was a problem hiding this comment.
Thanks! We may be able to further reduce these by initializing in post processing AOT because the values remain constant and are deterministic.
|
nice work |
Purpose
When profiling another PR, I noticed that there were always 3
vectorized_elementwise_kernelkernels when calling theops.cutlass_fp4_moe_mmfunction and found thesetorch::fullconstructors. The kernel__get_group_gemm_startsalready runs once per-expert to set up pointers. We can add stride initialization there instead of launching 3 separatetorch::fullkernels. This improves latency by ~5% and affects throughput by a smaller amount.Before (based on #31832 already fusing silu_and_mul):

After:

Test Plan
Test Result
Latency Benchmark
Throughput Benchmark
Eval
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.