[Performance] DeepSeek V3.2 multi-stream indexer overlap by haosdent · Pull Request #35968 · vllm-project/vllm

haosdent · 2026-03-04T06:35:04Z

Purpose

Overlap weights_proj with wk + k_norm in the DeepSeek V3.2 Indexer forward pass using a secondary CUDA stream. The weights_proj GEMM is small (hidden_size → n_head, i.e. 7168→64) and underutilizes GPU SMs, so it can run concurrently with wk + k_norm on the auxiliary stream, removing them from the critical path.

torch.compile compatibility

The dual-stream execution is wrapped in a custom op (torch.ops.vllm.indexer_weights_and_k_proj) registered via direct_register_custom_op in deepseek_v2.py, following the same pattern as gdn_in_proj in PR #36795. This makes stream/event operations opaque to torch.compile, preventing graph breaks.

The custom op returns only (weights, k) — both contiguous tensors. The torch.split that produces k_pe/k_nope happens outside the op boundary in Indexer.forward(), where torch.compile traces it natively with correct strides. This avoids stride mismatch issues where non-contiguous torch.split views would leak across the op boundary.

Uses the global aux_stream() singleton (from vllm.utils.torch_utils) and maybe_execute_in_parallel (from vllm.utils.multi_stream_utils), consistent with existing vLLM conventions.

Closes #35226

Test Plan

FakeTensorMode tests for the custom op in tests/utils_/test_indexer_dual_stream.py validating output shapes and contiguous strides across multiple dimension combinations

Test Result

tests/utils_/test_indexer_dual_stream.py::TestIndexerWeightsAndKProjOp::test_fake_output_shapes_and_strides PASSED
tests/utils_/test_indexer_dual_stream.py::TestIndexerWeightsAndKProjOp::test_fake_output_shapes_parametrized[1-64-128] PASSED
tests/utils_/test_indexer_dual_stream.py::TestIndexerWeightsAndKProjOp::test_fake_output_shapes_parametrized[16-64-128] PASSED
tests/utils_/test_indexer_dual_stream.py::TestIndexerWeightsAndKProjOp::test_fake_output_shapes_parametrized[128-64-128] PASSED
tests/utils_/test_indexer_dual_stream.py::TestIndexerWeightsAndKProjOp::test_fake_output_shapes_parametrized[256-32-64] PASSED

======================== 5 passed, 6 warnings in 0.82s =========================

gemini-code-assist

Code Review

This pull request introduces a performance optimization for the DeepSeek V3.2 Indexer by using a secondary CUDA stream to overlap computations. The overall approach is sound and correctly implemented. I've suggested one improvement to further enhance parallelism and better align with the stated goal of the PR, which should lead to better performance.

_{Note: Security Review did not run due to the size of the PR.}

vllm/model_executor/models/deepseek_v2.py

benchislett · 2026-03-04T16:45:48Z

I prefer the TRTLLM style of maybe_execute_in_parallel, if you think it's feasible to implement something similar here. Having to maintain two code paths for single-stream and multi-stream is bound to cause issues and duplicate work. See: https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/_torch/modules/attention.py#L1393

benchislett · 2026-03-04T16:45:51Z

There are concerns that multi-stream in this naive way will break torch.compile and custom-ops would be required to avoid breaking the graph. Have you observed this? Do the decodes still run in a single full graph with compilation active?

jhaotingc · 2026-03-05T00:09:08Z

#33505

There's a similar implementation for dual stream hiding but seems that it needs fake op to bypass torch compile error or else dual stream won't show in actual run.

Does this implementation have this issue?

Also, relative discussion here: #32828 (comment)
I guess for some gemms torch compiles aggressively but if it's in a torch op that bypasses torch compile it'll run slower.

(I also vote for TRTLLM's maybe_execute_in_parallel)

haosdent · 2026-03-05T02:15:25Z

Got it, let me research how TensorRT-LLM works @benchislett @jhaotingc

haosdent · 2026-03-05T04:24:38Z

There are concerns that multi-stream in this naive way will break torch.compile and custom-ops would be required to avoid breaking the graph. Have you observed this? Do the decodes still run in a single full graph with compilation active?

Yes, we did need to use custom-ops as how TensorRT-LLM does. @benchislett

haosdent · 2026-03-05T04:30:07Z

@jhaotingc thanks a lot for your useful references!

Does this implementation have this issue?

Not sure, because the weights_proj and wk + norm are relatively small for our case, I guess the impact is limited. But only test could prove this.

benchislett · 2026-03-11T00:54:19Z

I get this error when running DSV3.2 NVFP4 on 8xB200 in TP8:

(Worker pid=3157294) (Worker_TP5 pid=3157294) ERROR 03-11 00:34:25 [multiproc_executor.py:932]   File "/tmp/torchinductor_root/pe/cpet3wrv5gxzezciaqngn3apgvewozoe67jpmwcakyztnyeuaduh.py", line 1699, in call
(Worker pid=3157294) (Worker_TP5 pid=3157294) ERROR 03-11 00:34:25 [multiproc_executor.py:932]     assert_size_stride(buf10, (s72, 64), (64, 1), 'torch.ops.vllm.indexer_dual_stream.default')
(Worker pid=3157294) (Worker_TP5 pid=3157294) ERROR 03-11 00:34:25 [multiproc_executor.py:932] AssertionError: expected size 16384==16384, stride 128==64 at dim=0
(Worker pid=3157294) (Worker_TP5 pid=3157294) ERROR 03-11 00:34:25 [multiproc_executor.py:932] Error in op: torch.ops.vllm.indexer_dual_stream.default
(Worker pid=3157294) (Worker_TP5 pid=3157294) ERROR 03-11 00:34:25 [multiproc_executor.py:932] This error most often comes from a incorrect fake (aka meta) kernel for a custom op.
(Worker pid=3157294) (Worker_TP5 pid=3157294) ERROR 03-11 00:34:25 [multiproc_executor.py:932] Use torch.library.opcheck to test your custom op.

Please include tests for the custom op in addition to testing the maybe_execute_in_parallel

benchislett · 2026-03-11T01:27:27Z

I got it working using this fake implementation instead:

def _indexer_dual_stream_fake(
    hidden_states: torch.Tensor,
    layer_name: str,
    n_head: int,
    head_dim: int,
    rope_dim: int,
) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:
    """Fake implementation for torch.compile shape inference."""
    num_tokens = hidden_states.shape[0]
    dtype = hidden_states.dtype
    device = hidden_states.device
    
    # weights: contiguous, shape (N, 64) -> stride (64, 1)
    weights = torch.empty_strided(
        (num_tokens, n_head), (n_head, 1), dtype=dtype, device=device
    )
    
    # k: contiguous, shape (N, 128) -> stride (128, 1)
    k_stride = (head_dim, 1)
    k = torch.empty_strided(
        (num_tokens, head_dim), k_stride, dtype=dtype, device=device
    )
    
    # k_pe and k_nope: shape (N, 64), but inherit k's stride -> (128, 1)
    k_pe = torch.empty_strided(
        (num_tokens, rope_dim), k_stride, dtype=dtype, device=device
    )
    k_nope = torch.empty_strided(
        (num_tokens, head_dim - rope_dim), k_stride, dtype=dtype, device=device
    )
    
    return weights, k, k_pe, k_nope

vllm/model_executor/models/deepseek_v2.py

haosdent · 2026-03-12T02:18:15Z

(Worker pid=3157294) (Worker_TP5 pid=3157294) ERROR 03-11 00:34:25 [multiproc_executor.py:932] AssertionError: expected size 16384==16384, stride 128==64 at dim=0
I got it working using this fake implementation instead:

Second thought, I change the function to return k only and split outside, to workaround the issue.

haosdent · 2026-03-16T03:36:22Z

@benchislett May you help to review again when you are available? Thank you in advance.

Then I could do a benchmark when you think the direction of this change is correct.

benchislett · 2026-03-16T20:36:58Z

The change seems reasonable. I have no qualms other than those stated in my review of #36795.

benchislett · 2026-03-16T20:37:32Z

Note though that this is just one of many opportunities for multi-streaming in DSV3.2 indexer. Hopefully once torch.compile + multi-stream becomes standard we can expand to many more cases

vllm/model_executor/models/deepseek_v2.py

benchislett · 2026-03-18T15:58:11Z

@haosdent maybe_execute_in_parallel has been added in #36795 which just merged. Please update and resolve the conflict (use the existing implementation)

haosdent · 2026-03-19T04:16:47Z

Thanks @benchislett , have rebased and pushed.

vllm/model_executor/models/deepseek_v2.py

vllm/utils/multi_stream_utils.py

benchislett

few more changes please. see comments

Overlap `weights_proj` with `wk + k_norm` in the Indexer forward pass using a secondary CUDA stream. Wrapped as a custom op (`torch.ops.vllm.indexer_weights_and_k_proj`) so stream/event operations are opaque to torch.compile and do not cause graph breaks. The custom op returns `(weights, k)` — both contiguous tensors. `torch.split` to produce `k_pe`/`k_nope` happens outside the op boundary where torch.compile traces it natively with correct strides. Signed-off-by: haosdent <haosdent@gmail.com> Co-authored-by: Xin Yang <xyangx@amazon.com> Co-authored-by: Ben Chislett <chislett.ben@gmail.com>

haosdent · 2026-03-19T15:07:37Z

Thanks @benchislett , I have fixed some of the comments except 1 that needs clarification

LucasWilkinson · 2026-03-24T18:50:44Z

Do we have any numbers on the performance benefits?

zou3519 · 2026-03-24T18:55:03Z

vllm/model_executor/models/deepseek_v2.py


+def _indexer_weights_and_k_proj_impl(
+    hidden_states: torch.Tensor,
+    layer_name: str,


Is there a way to avoid passing the layer_name as a string? This will regress cold compile times (something like 4x usually). Alternatively, is it possible for this to wait until after vLLM upgrades to PyTorch 2.11? (next Monday/Tuesday probably)

To be clear I need to ship a quick PR to vLLM after the PyTorch 2.11 update and then we should be able to use something like the layer_name as an input to the custom operator

Thanks @zou3519 , then I rebase after your PR is merged.

this is the PR btw, #38123

haosdent · 2026-03-25T13:36:36Z

Do we have any numbers on the performance benefits?

@LucasWilkinson Sry that I haven't finished the benchmark yet. I encountered issues finding cards and setting up the environment recently. Would post the result once it finishes

benchislett · 2026-03-25T19:36:28Z

We will need to see benchmark numbers before moving forward.

I am also exploring an optimization which fuses the weights_proj and the wk projection, which involves upcasting the wk matrix to FP32. We will need to compare these to see which one gives more speedup

mergify bot added the deepseek Related to DeepSeek models label Mar 4, 2026

gemini-code-assist bot reviewed Mar 4, 2026

View reviewed changes

vllm/model_executor/models/deepseek_v2.py Outdated Show resolved Hide resolved

haosdent mentioned this pull request Mar 4, 2026

[Performance]: DeepSeek 3.2 Multi-stream indexer #35226

Open

haosdent force-pushed the fix-35226 branch from c3e87af to 6d1d649 Compare March 5, 2026 03:17

benchislett reviewed Mar 11, 2026

View reviewed changes

vllm/model_executor/models/deepseek_v2.py Outdated Show resolved Hide resolved

benchislett mentioned this pull request Mar 13, 2026

[Perf] Enable dual stream execution of input projection for Qwen3 #36795

Merged

5 tasks

haosdent marked this pull request as ready for review March 16, 2026 10:26

haosdent changed the title ~~[WIP] [Performance] DeepSeek V3.2 multi-stream indexer overlap~~ [Performance] DeepSeek V3.2 multi-stream indexer overlap Mar 16, 2026

xyang16 reviewed Mar 18, 2026

View reviewed changes

vllm/model_executor/models/deepseek_v2.py Outdated Show resolved Hide resolved

haosdent force-pushed the fix-35226 branch 2 times, most recently from 367c11c to 415566d Compare March 19, 2026 04:12

benchislett reviewed Mar 19, 2026

View reviewed changes

vllm/model_executor/models/deepseek_v2.py Show resolved Hide resolved

benchislett reviewed Mar 19, 2026

View reviewed changes

vllm/utils/multi_stream_utils.py Outdated Show resolved Hide resolved

benchislett reviewed Mar 19, 2026

View reviewed changes

vllm/utils/multi_stream_utils.py Outdated Show resolved Hide resolved

benchislett requested changes Mar 19, 2026

View reviewed changes

haosdent force-pushed the fix-35226 branch from 415566d to 6c6f780 Compare March 19, 2026 14:54

haosdent force-pushed the fix-35226 branch from 6c6f780 to e160222 Compare March 19, 2026 15:00

haosdent requested a review from benchislett March 19, 2026 15:06

benchislett approved these changes Mar 23, 2026

View reviewed changes

Merge branch 'main' into fix-35226

b14650c

benchislett added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 23, 2026

zou3519 reviewed Mar 24, 2026

View reviewed changes

Uh oh!

Conversation

haosdent commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

torch.compile compatibility

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

benchislett commented Mar 4, 2026

Uh oh!

benchislett commented Mar 4, 2026

Uh oh!

jhaotingc commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

haosdent commented Mar 5, 2026

Uh oh!

haosdent commented Mar 5, 2026

Uh oh!

haosdent commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benchislett commented Mar 11, 2026

Uh oh!

benchislett commented Mar 11, 2026

Uh oh!

Uh oh!

haosdent commented Mar 12, 2026

Uh oh!

haosdent commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benchislett commented Mar 16, 2026

Uh oh!

benchislett commented Mar 16, 2026

Uh oh!

Uh oh!

benchislett commented Mar 18, 2026

Uh oh!

haosdent commented Mar 19, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

benchislett left a comment

Choose a reason for hiding this comment

Uh oh!

haosdent commented Mar 19, 2026

Uh oh!

LucasWilkinson commented Mar 24, 2026

Uh oh!

zou3519 Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

zou3519 Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

haosdent Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

zou3519 Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

haosdent commented Mar 25, 2026

Uh oh!

benchislett commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

haosdent commented Mar 4, 2026 •

edited

Loading

jhaotingc commented Mar 5, 2026 •

edited

Loading

haosdent commented Mar 5, 2026 •

edited

Loading

haosdent commented Mar 16, 2026 •

edited

Loading