[Performance] Remove redundant clone() calls in cutlass_mla #24891

alexm-redhat · 2025-09-15T16:37:18Z

This PR removes 2 redundant clone() calls in pre-attn cutlass MLA python code (that we found in the profiling work). This PR is step 1 ("Remove unnecessary copies from Cutlass MLA") from this meta fusion issue (#24629): For DeekSeekR1 on 8xB200 GPUs batch size 32, this improves decode perf by 2.4% from 19.15ms TPO to 18.7ms.

Verified that correctness is preserved via manual check and also lm_eval on GSM8K.
Command used: lm_eval --model vllm --model_args pretrained=deepseek-ai/DeepSeek-R1-0528,tensor_parallel_size=8 --tasks gsm8k --num_fewshot 5 --batch_size auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.9568	±	0.0056
		strict-match	5	exact_match	↑	0.9538	±	0.0058

gemini-code-assist

Code Review

This pull request removes two redundant .clone() calls on the q_nope and q_pe tensors within the _sm100_forward_decode function. This is a good performance optimization as it avoids unnecessary data copies. The change is correct because the underlying sm100_cutlass_mla_decode custom operation can handle non-contiguous tensors by using their strides, as long as the innermost dimension is contiguous, which is the case for the tensors here. The optimization is safely scoped to the newer sm100 execution path, leaving the legacy implementation untouched. The change improves performance and is safe to merge.

mgoin

Nice find

alexm-redhat · 2025-09-15T18:12:20Z

Removed the contiguous() call in _sm100_cutlass_mla_decode(), gets additional 0.8%, for a total of 2.4% improvement. TPOT 18.7ms vs 19.15ms.

Signed-off-by: Alexander Matveev <[email protected]>

…mla decode Signed-off-by: Alexander Matveev <[email protected]>

mgoin · 2025-09-15T18:46:28Z

vllm/v1/attention/backends/mla/cutlass_mla.py

+            # Extract the subsets of the outputs
+            returned_lse = lse[:, :H].contiguous(
+            ) if self.need_to_return_lse_for_decode else lse
+            out = out[:, :H]


I understand putting this in a conditional, but why can we remove the contiguous for out if we can't for lse?

Most likely lse as well, I was just on the safe side, since I don't know how to test it.

…ject#24891)

Signed-off-by: bbartels <[email protected]> [gpt-oss] Add IncompleteDetails to ResponsesRepsonse (vllm-project#24561) Signed-off-by: Andrew Xia <[email protected]> [gpt-oss][1a] create_responses stream outputs BaseModel type, api server is SSE still (vllm-project#24759) Signed-off-by: Andrew Xia <[email protected]> [Performance] Remove redundant clone() calls in cutlass_mla (vllm-project#24891) [Bug] Fix Cutlass Scaled MM Compilation Error (vllm-project#24887) Signed-off-by: yewentao256 <[email protected]> [ci] fix wheel names for arm wheels (vllm-project#24898) Signed-off-by: simon-mo <[email protected]> [Tests] fix initialization of kv hash in tests (vllm-project#24273) Signed-off-by: Mickael Seznec <[email protected]> [Compile] Fix noop_elimination pass and add tests for noop_elimination (vllm-project#24880) Signed-off-by: zjy0516 <[email protected]> Propagate entire tokens to connector for resumed preemptions Signed-off-by: Qier Li <[email protected]> Fix pre-commit Signed-off-by: Qier Li <[email protected]> Rename field and nullify empty lists Signed-off-by: Qier Li <[email protected]> Update vllm/v1/core/sched/scheduler.py Co-authored-by: Nick Hill <[email protected]> Signed-off-by: Qier Li <[email protected]> Add unit test for preemption resumption Signed-off-by: Qier Li <[email protected]>

…ject#24891) Signed-off-by: xuebwang-amd <[email protected]>

…ject#24891)

…ject#24891) Signed-off-by: xuebwang-amd <[email protected]>

alexm-redhat requested review from WoosukKwon, comaniac, njhill, robertgshaw2-redhat and ywang96 as code owners September 15, 2025 16:37

alexm-redhat requested a review from ProExpertProg September 15, 2025 16:37

mergify bot added the v1 label Sep 15, 2025

alexm-redhat requested a review from mgoin September 15, 2025 16:38

alexm-redhat mentioned this pull request Sep 15, 2025

[torch.compile][Performance]: Unwrap custom ops and improve fusion (Inductor and custom) #24629

Open

1 task

gemini-code-assist bot reviewed Sep 15, 2025

View reviewed changes

mgoin approved these changes Sep 15, 2025

View reviewed changes

ProExpertProg approved these changes Sep 15, 2025

View reviewed changes

mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 15, 2025

mgoin enabled auto-merge (squash) September 15, 2025 17:07

alexm-redhat force-pushed the cutlass_mla_no_clones branch from 80280fd to 9abae15 Compare September 15, 2025 18:11

alexm-redhat self-assigned this Sep 15, 2025

alexm-redhat added 2 commits September 15, 2025 11:16

[Performance] Remove redundant clone() calls in cutlass_mla

0514ccd

Signed-off-by: Alexander Matveev <[email protected]>

avoid a redundant contiguous() call for the end of the sm100 cutlass …

54e6cd5

…mla decode Signed-off-by: Alexander Matveev <[email protected]>

alexm-redhat force-pushed the cutlass_mla_no_clones branch from 9abae15 to 54e6cd5 Compare September 15, 2025 18:16

mgoin reviewed Sep 15, 2025

View reviewed changes

mgoin merged commit aae725a into main Sep 15, 2025
46 checks passed

mgoin deleted the cutlass_mla_no_clones branch September 15, 2025 20:21

FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025

[Performance] Remove redundant clone() calls in cutlass_mla (vllm-pro…

47d7978

…ject#24891)

xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025

[Performance] Remove redundant clone() calls in cutlass_mla (vllm-pro…

79c3245

…ject#24891) Signed-off-by: xuebwang-amd <[email protected]>

choprahetarth pushed a commit to Tandemn-Labs/vllm that referenced this pull request Oct 11, 2025

[Performance] Remove redundant clone() calls in cutlass_mla (vllm-pro…

24243d3

…ject#24891)

xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025

[Performance] Remove redundant clone() calls in cutlass_mla (vllm-pro…

943ad50

…ject#24891) Signed-off-by: xuebwang-amd <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Performance] Remove redundant clone() calls in cutlass_mla #24891

[Performance] Remove redundant clone() calls in cutlass_mla #24891

Uh oh!

alexm-redhat commented Sep 15, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

mgoin left a comment

Uh oh!

alexm-redhat commented Sep 15, 2025 •

edited

Loading

Uh oh!

mgoin Sep 15, 2025

Uh oh!

alexm-redhat Sep 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

[Performance] Remove redundant clone() calls in cutlass_mla #24891

[Performance] Remove redundant clone() calls in cutlass_mla #24891

Uh oh!

Conversation

alexm-redhat commented Sep 15, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

alexm-redhat commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mgoin Sep 15, 2025

Choose a reason for hiding this comment

Uh oh!

alexm-redhat Sep 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

alexm-redhat commented Sep 15, 2025 •

edited by github-actions bot

Loading

alexm-redhat commented Sep 15, 2025 •

edited

Loading