[Model] Introduce CUDA Graph support for DeepSeek v3#12204
[Model] Introduce CUDA Graph support for DeepSeek v3#12204houseroad wants to merge 1 commit intovllm-project:mainfrom
Conversation
Signed-off-by: Lu Fang <lufang@fb.com>
|
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
|
wow that's amazing! |
|
@houseroad Before Woosuk takes a look at this PR, |
tlrmchlsmth
left a comment
There was a problem hiding this comment.
LGTM, and thanks for the fix! (I'm accepting but don't have a system to run DeepSeek v3 so can't verify the fix -- changes look good anyway)
| if (num_experts >= 256) { | ||
| if (!use_shared_memory) { |
There was a problem hiding this comment.
Why was this change needed for the fix BTW?
| const int32_t mem_tokens_cnts = | ||
| ((num_experts + 1) * num_experts) * sizeof(int32_t); | ||
| const int32_t mem_cumsum = (num_experts + 1) * sizeof(int32_t); | ||
| // allocate global memory | ||
| int32_t* tokens_cnts; | ||
| int32_t* cumsum; | ||
| cudaMalloc(&tokens_cnts, mem_tokens_cnts); | ||
| cudaMalloc(&cumsum, mem_cumsum); | ||
| torch::Tensor token_cnts = | ||
| torch::empty({(num_experts + 1) * num_experts}, | ||
| torch::TensorOptions() | ||
| .dtype(torch::kInt) | ||
| .device(topk_ids.device())); | ||
| torch::Tensor cumsum = | ||
| torch::empty({num_experts + 1}, torch::TensorOptions() | ||
| .dtype(torch::kInt) | ||
| .device(topk_ids.device())); |
There was a problem hiding this comment.
Makes sense to me, since during cuda graph capture, some actions, such as cudaMalloc, may be unsafe
|
it is recommended to merge main, and use |
|
I actually think this PR #12222 has a better implementation of this optimization, could you please help review @houseroad ? |
Kudos to @jianyuh, who introduce the CUDA graph to DeepSeek v3. The overall throughput almost doubled based on the testing.