[Bugfix] make moe_align_block_size compliable with cuda graph#12036
[Bugfix] make moe_align_block_size compliable with cuda graph#12036jinzhen-lin wants to merge 7 commits intovllm-project:mainfrom
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
|
With cuda graph, the generation speed of DeepSeek-V3 (W4A16, 8*A100-80G, bs=1) increase from 5 tokens/s to 10 tokens/s. |
b3cba63 to
83f0926
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
|
close this PR as I create a new PR with better |
The
moe_align_block_sizeis used by many moe models (e.g. DeepSeek-V3) but it is not compliable with cuda graph now. This PR fix it.Reference: sgl-project/sglang@77d1210 sgl-project/sglang@6e53051