[Performance] DeepSeek V3.2 multi-stream indexer overlap#35968
[Performance] DeepSeek V3.2 multi-stream indexer overlap#35968haosdent wants to merge 2 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a performance optimization for the DeepSeek V3.2 Indexer by using a secondary CUDA stream to overlap computations. The overall approach is sound and correctly implemented. I've suggested one improvement to further enhance parallelism and better align with the stated goal of the PR, which should lead to better performance.
Note: Security Review did not run due to the size of the PR.
|
I prefer the TRTLLM style of maybe_execute_in_parallel, if you think it's feasible to implement something similar here. Having to maintain two code paths for single-stream and multi-stream is bound to cause issues and duplicate work. See: https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/_torch/modules/attention.py#L1393 |
|
There are concerns that multi-stream in this naive way will break torch.compile and custom-ops would be required to avoid breaking the graph. Have you observed this? Do the decodes still run in a single full graph with compilation active? |
|
There's a similar implementation for dual stream hiding but seems that it needs fake op to bypass torch compile error or else dual stream won't show in actual run. Does this implementation have this issue? Also, relative discussion here: #32828 (comment) (I also vote for TRTLLM's maybe_execute_in_parallel) |
|
Got it, let me research how TensorRT-LLM works @benchislett @jhaotingc |
Yes, we did need to use custom-ops as how TensorRT-LLM does. @benchislett |
|
@jhaotingc thanks a lot for your useful references!
Not sure, because the weights_proj and wk + norm are relatively small for our case, I guess the impact is limited. But only test could prove this. |
|
I get this error when running DSV3.2 NVFP4 on 8xB200 in TP8: Please include tests for the custom op in addition to testing the maybe_execute_in_parallel |
|
I got it working using this fake implementation instead: |
Second thought, I change the function to return k only and split outside, to workaround the issue. |
|
@benchislett May you help to review again when you are available? Thank you in advance. Then I could do a benchmark when you think the direction of this change is correct. |
|
The change seems reasonable. I have no qualms other than those stated in my review of #36795. |
|
Note though that this is just one of many opportunities for multi-streaming in DSV3.2 indexer. Hopefully once torch.compile + multi-stream becomes standard we can expand to many more cases |
367c11c to
415566d
Compare
|
Thanks @benchislett , have rebased and pushed. |
benchislett
left a comment
There was a problem hiding this comment.
few more changes please. see comments
Overlap `weights_proj` with `wk + k_norm` in the Indexer forward pass using a secondary CUDA stream. Wrapped as a custom op (`torch.ops.vllm.indexer_weights_and_k_proj`) so stream/event operations are opaque to torch.compile and do not cause graph breaks. The custom op returns `(weights, k)` — both contiguous tensors. `torch.split` to produce `k_pe`/`k_nope` happens outside the op boundary where torch.compile traces it natively with correct strides. Signed-off-by: haosdent <haosdent@gmail.com> Co-authored-by: Xin Yang <xyangx@amazon.com> Co-authored-by: Ben Chislett <chislett.ben@gmail.com>
|
Thanks @benchislett , I have fixed some of the comments except 1 that needs clarification |
|
Do we have any numbers on the performance benefits? |
|
|
||
| def _indexer_weights_and_k_proj_impl( | ||
| hidden_states: torch.Tensor, | ||
| layer_name: str, |
There was a problem hiding this comment.
Is there a way to avoid passing the layer_name as a string? This will regress cold compile times (something like 4x usually). Alternatively, is it possible for this to wait until after vLLM upgrades to PyTorch 2.11? (next Monday/Tuesday probably)
There was a problem hiding this comment.
To be clear I need to ship a quick PR to vLLM after the PyTorch 2.11 update and then we should be able to use something like the layer_name as an input to the custom operator
There was a problem hiding this comment.
Thanks @zou3519 , then I rebase after your PR is merged.
@LucasWilkinson Sry that I haven't finished the benchmark yet. I encountered issues finding cards and setting up the environment recently. Would post the result once it finishes |
|
We will need to see benchmark numbers before moving forward. I am also exploring an optimization which fuses the weights_proj and the wk projection, which involves upcasting the wk matrix to FP32. We will need to compare these to see which one gives more speedup |
Purpose
Overlap
weights_projwithwk + k_normin the DeepSeek V3.2Indexerforward pass using a secondary CUDA stream. Theweights_projGEMM is small (hidden_size → n_head, i.e. 7168→64) and underutilizes GPU SMs, so it can run concurrently withwk + k_normon the auxiliary stream, removing them from the critical path.torch.compile compatibility
The dual-stream execution is wrapped in a custom op (
torch.ops.vllm.indexer_weights_and_k_proj) registered viadirect_register_custom_opindeepseek_v2.py, following the same pattern asgdn_in_projin PR #36795. This makes stream/event operations opaque to torch.compile, preventing graph breaks.The custom op returns only
(weights, k)— both contiguous tensors. Thetorch.splitthat producesk_pe/k_nopehappens outside the op boundary inIndexer.forward(), where torch.compile traces it natively with correct strides. This avoids stride mismatch issues where non-contiguoustorch.splitviews would leak across the op boundary.Uses the global
aux_stream()singleton (fromvllm.utils.torch_utils) andmaybe_execute_in_parallel(fromvllm.utils.multi_stream_utils), consistent with existing vLLM conventions.Closes #35226
Test Plan
tests/utils_/test_indexer_dual_stream.pyvalidating output shapes and contiguous strides across multiple dimension combinationsTest Result