[EPLB] Support EPLB w/ NVFP4#29804
Conversation
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
There was a problem hiding this comment.
Code Review
This pull request adds support for Expert Parallel Load Balancing (EPLB) with NVFP4 quantization. The changes include a new test case for this functionality and modifications to ModelOptNvFp4FusedMoE to handle the EPLB path, along with a new kernel wrapper flashinfer_trtllm_fp4_routed_moe. The implementation is largely correct, but I've identified a critical issue where the routing method type is hardcoded in the new kernel wrapper. This would lead to incorrect behavior for MoE models that use different routing mechanisms. I have provided comments with suggestions to address this issue by dynamically determining the routing method.
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
Signed-off-by: Andrew Briand <abriand@nvidia.com>
cf61fec to
e642217
Compare
|
Does this support marlin kernel? |
Yes, this should work since Marlin accepts topk_ids from |
| ): | ||
| # Pack top k ids and expert weights into a single int32 tensor, as | ||
| # required by TRT-LLM | ||
| packed_tensor = (topk_ids.to(torch.int32) << 16) | topk_weights.to( |
There was a problem hiding this comment.
Maybe hide this packing operation in the flashinfer_trtllm_fp4_routed_moe. I.e., let flashinfer_trtllm_fp4_routed_moe take topk_ids and topk_weights directly, making its interface closer to Marlin’s.
Additionally, the packing will be removed in the flashinfer api in the near future so we can just pass topk_ids and topk_weights to flashinfer.
Signed-off-by: Andrew Briand <abriand@nvidia.com>
Signed-off-by: Andrew Briand <abriand@nvidia.com>
|
CC @tlrmchlsmth |
IwakuraRein
left a comment
There was a problem hiding this comment.
LGTM. Thanks for the contribution
…re comms Signed-off-by: Andrew Briand <abriand@nvidia.com>
Signed-off-by: Andrew Briand <abriand@nvidia.com>
|
Hi @andrewbriand, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: Andrew Briand <abriand@nvidia.com>
| weight[src], | ||
| # Move to device in case the weights have been offloaded to CPU | ||
| weight[src].to(torch.cuda.current_device()), |
There was a problem hiding this comment.
Can we submit this change separately? I don't see the need to prioritize supporting cpu offloading with eplb and this may have complications
There was a problem hiding this comment.
Sure, I will revert this for now
…GPU before comms" This reverts commit c3a7ea1. Signed-off-by: Andrew Briand <abriand@nvidia.com>
pavanimajety
left a comment
There was a problem hiding this comment.
LGTM, thanks for the PR!
Signed-off-by: Andrew Briand <abriand@nvidia.com> Co-authored-by: Andrew Briand <abriand@nvidia.com>
Signed-off-by: Andrew Briand <abriand@nvidia.com> Co-authored-by: Andrew Briand <abriand@nvidia.com> Signed-off-by: Ubuntu <mjtaheri68@gmail.com>
Signed-off-by: Andrew Briand <abriand@nvidia.com> Co-authored-by: Andrew Briand <abriand@nvidia.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
Purpose
Support EPLB in combination with NVFP4.
Test Plan
Added a test
test_eplb_fused_moe_layer_dep_nvfp4.pywhich ensures that NVFP4 backends correctly route tokens to physical experts based on their logical expert ids.Test Result
Tests pass on GB200.
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.