[NVIDIA] Add Low Latency NVFP4 decode kernels from Flashinfer#8552
[NVIDIA] Add Low Latency NVFP4 decode kernels from Flashinfer#8552
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
|
||
| # Additional parameter needed for TRT-LLM | ||
| layer.g1_scale_c = Parameter( | ||
| (layer.w2_input_scale_quant * layer.g1_alphas).to(torch.float32), |
There was a problem hiding this comment.
FC1 is nvfp4 x nvfp4 -> nvfp4, then the scaleC factor for FC1 is dequantA * dequantB * quantC. I am not sure what is g1_alphas here
There was a problem hiding this comment.
g1_alphas is used for scaleGated. ScaleGate should be dequantA * dequantB.
There was a problem hiding this comment.
scaleC for FC2 must be dequantA * dequantB as it takes nvfp4 as inputs and outputs bf16. Just checking if the logic is as expected
|
Seems to work, just ran some evals quickly flashinfer trtllmgen moe: baseline, flashinfer cutlass moe (disabled CUDA graph as it was crashing at high num question counts with seg fault): |
|
@azhurkevich @kushanam May you help rebase this? Thanks! |
|
Yeah, I'll rebase it. Working on perf now |
|
Commands used to launch and bench server. For repro steps. Default backend CUTLASS fused_moe flashinfer backend: trtllmgen backend: Universal command to run benchmarking: |
|
Default backend perf: |
|
CUTLASS fused_moe perf: |
|
trtllmgen perf: |
|
@azhurkevich Let me help with resolving the merge conflict. |
04f3c1f to
0ff4840
Compare
|
@ch-wan oof, sorry just noticed your comment. I squashed everything and doing final rebase. Do you want me to do it or you would like to do it? |
|
@azhurkevich go ahead |
0ff4840 to
bc9fb6c
Compare
|
| self.disable_shared_experts_fusion = True | ||
| logger.warning( | ||
| "FlashInfer TRTLLM MoE is enabled. --disable-shared-experts-fusion is automatically set." | ||
| ) |
There was a problem hiding this comment.
This is not required anymore?
There was a problem hiding this comment.
Hi @trevor-m can you help check this? Thanks! If so, may you help submit a pr for this @pavanimajety
|
What flashinfer_python version does it require? I tried 0.2.3 to 0.2.7.post1, it is lack of RoutingMethodType which is newly added in this PR. |
|
@yuan-luo you at least gotta use flashinfer version v0.2.9rc1. These kernels didnt exist in flashinfer before that version came out |
Is it possible to make this feature optional if we are not using fp4? I guess it may impact lots of users, for example the local flashinfer repo mirror in my environment, the highest flashinfer version is v0.2.7.post1. |
…oject#8552) Co-authored-by: Cheng Wan <cwan@x.ai>
…oject#8552) Co-authored-by: Cheng Wan <cwan@x.ai>


Motivation
Bring best low latency NVFP4 kernels for Blackwell MoE. Currently enabling DSR1.
Modifications
Changing some weight preprocessing logic as well as exposing these kernels. Plus various piping to make it work.
Accuracy Test
Ran accuracy tests, find description and repro below.
Benchmark & Profiling
Added below with repros.
Checklist