-
-
Notifications
You must be signed in to change notification settings - Fork 11.7k
[Performance] Remove redundant clone() calls in cutlass_mla #24891
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request removes two redundant .clone() calls on the q_nope and q_pe tensors within the _sm100_forward_decode function. This is a good performance optimization as it avoids unnecessary data copies. The change is correct because the underlying sm100_cutlass_mla_decode custom operation can handle non-contiguous tensors by using their strides, as long as the innermost dimension is contiguous, which is the case for the tensors here. The optimization is safely scoped to the newer sm100 execution path, leaving the legacy implementation untouched. The change improves performance and is safe to merge.
mgoin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice find
80280fd to
9abae15
Compare
|
Removed the contiguous() call in _sm100_cutlass_mla_decode(), gets additional 0.8%, for a total of 2.4% improvement. TPOT 18.7ms vs 19.15ms. |
Signed-off-by: Alexander Matveev <[email protected]>
…mla decode Signed-off-by: Alexander Matveev <[email protected]>
9abae15 to
54e6cd5
Compare
| # Extract the subsets of the outputs | ||
| returned_lse = lse[:, :H].contiguous( | ||
| ) if self.need_to_return_lse_for_decode else lse | ||
| out = out[:, :H] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand putting this in a conditional, but why can we remove the contiguous for out if we can't for lse?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most likely lse as well, I was just on the safe side, since I don't know how to test it.
Signed-off-by: bbartels <[email protected]> [gpt-oss] Add IncompleteDetails to ResponsesRepsonse (vllm-project#24561) Signed-off-by: Andrew Xia <[email protected]> [gpt-oss][1a] create_responses stream outputs BaseModel type, api server is SSE still (vllm-project#24759) Signed-off-by: Andrew Xia <[email protected]> [Performance] Remove redundant clone() calls in cutlass_mla (vllm-project#24891) [Bug] Fix Cutlass Scaled MM Compilation Error (vllm-project#24887) Signed-off-by: yewentao256 <[email protected]> [ci] fix wheel names for arm wheels (vllm-project#24898) Signed-off-by: simon-mo <[email protected]> [Tests] fix initialization of kv hash in tests (vllm-project#24273) Signed-off-by: Mickael Seznec <[email protected]> [Compile] Fix noop_elimination pass and add tests for noop_elimination (vllm-project#24880) Signed-off-by: zjy0516 <[email protected]> Propagate entire tokens to connector for resumed preemptions Signed-off-by: Qier Li <[email protected]> Fix pre-commit Signed-off-by: Qier Li <[email protected]> Rename field and nullify empty lists Signed-off-by: Qier Li <[email protected]> Update vllm/v1/core/sched/scheduler.py Co-authored-by: Nick Hill <[email protected]> Signed-off-by: Qier Li <[email protected]> Add unit test for preemption resumption Signed-off-by: Qier Li <[email protected]>
…ject#24891) Signed-off-by: xuebwang-amd <[email protected]>
…ject#24891) Signed-off-by: xuebwang-amd <[email protected]>
This PR removes 2 redundant clone() calls in pre-attn cutlass MLA python code (that we found in the profiling work). This PR is step 1 ("Remove unnecessary copies from Cutlass MLA") from this meta fusion issue (#24629): For DeekSeekR1 on 8xB200 GPUs batch size 32, this improves decode perf by 2.4% from 19.15ms TPO to 18.7ms.
Verified that correctness is preserved via manual check and also lm_eval on GSM8K.
Command used: lm_eval --model vllm --model_args pretrained=deepseek-ai/DeepSeek-R1-0528,tensor_parallel_size=8 --tasks gsm8k --num_fewshot 5 --batch_size auto