Add fused top-K softmax kernel for MoE#2769
Conversation
|
@Yard1 @cadedaniel @pcmoritz Can any one of you review the PR? |
|
Yes, happy to review! Thanks a lot for writing this :) |
|
@pcmoritz Thanks! |
|
Btw, I did a little bit of benchmarking on this PR and without touching any of the system parameters in the PR I'm already seeing a 1.5% - 3.5% end-to-end latency improvement. It is higher in the low latency regime. Concretely I tested on TP2 on H100 with 1000 input and 50 output tokens on Mixtral. So it seems worth merging this even though the low-level kernel code is not easy to follow -- most people can probably just treat it as a black box so it shouldn't have a big impact on maintainability. |
There was a problem hiding this comment.
I will spend some more time trying to understand the implementation in topk_softmax_kernels.cu but no need to block on that since that's mostly the upstream code from https://github.com/NVIDIA/TensorRT-LLM/blob/v0.7.1/cpp/tensorrt_llm/kernels/mixtureOfExperts/moe_kernels.cu and in any case we should probably keep it close to that and not change it :)
|
@pcmoritz Thanks again for you review! Yes, I think we don't have to worry too much about the implementation details, at least at the moment, as I only made a minor change to the kernel. |
|
Sounds good, the PR looks great :) |
This PR ports a fused topk-softmax kernel from TensorRT-LLM v0.7.1.
TODO: