Integrate DeepGeMM MegaMoE by WoosukKwon · Pull Request #40843 · vllm-project/vllm

WoosukKwon · 2026-04-24T22:28:24Z

No description provided.

Signed-off-by: youkaichao <youkaichao@gmail.com> Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu> Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> Signed-off-by: Nick Hill <nickhill123@gmail.com> Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: yasong.wang <yasong.wang@inferact.ai> Signed-off-by: Zhewen Li <zhewenli@inferact.ai> Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>

Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>

mergify · 2026-04-24T22:29:06Z

Documentation preview: https://vllm--40843.org.readthedocs.build/en/40843/

mergify · 2026-04-24T22:29:13Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @WoosukKwon.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

gemini-code-assist

Code Review

This pull request adds support for the Deepseek V4 model architecture, including a horizontally-fused MLA kernel, an MTP draft model for speculative decoding, and a custom tokenizer. It introduces optimized CUDA and Triton kernels for softplus-sqrt Top-K gating and MHC blocks. Feedback centers on performance and memory safety: the reviewer recommends replacing frequent GPU tensor allocations in metadata builders and fused kernels with pre-allocated buffers to support CUDA graph capture and reduce overhead. Additionally, the reviewer identified missing record_stream calls for tensors executed on auxiliary streams and advised making hardcoded sccache settings in the Dockerfile configurable.

gemini-code-assist · 2026-04-24T22:32:21Z

+        && export SCCACHE_BUCKET=inferact-sccache \
+        && export SCCACHE_REGION=us-west-2 \
+        && export SCCACHE_S3_NO_CREDENTIALS=0\


The sccache configuration is hardcoded to a specific bucket (inferact-sccache) and region (us-west-2). This leaks infrastructure details and prevents users from using their own cache buckets. These should be made configurable via build arguments or environment variables.

gemini-code-assist · 2026-04-24T22:32:21Z

+                lambda: self.indexer(
+                    hidden_states, qr, positions, self.indexer_rotary_emb
+                ),
+                kv_insert_and_compress,
+                self.ln_events[0],
+                self.ln_events[1],
+                self.aux_stream,
+            )
+        elif self.compressor is not None:


In attention_impl, several tensors are used on an auxiliary stream via maybe_execute_in_parallel without being recorded on that stream. To ensure memory safety and prevent the caching allocator from reusing memory while the auxiliary stream is still active, you must call record_stream(self.aux_stream) on hidden_states, qr, and positions before the parallel execution block.

gemini-code-assist · 2026-04-24T22:32:21Z

+            # SWA-only layer: no compressor, no overlap.
+            self._fused_qnorm_rope_kv_insert(q, kv, positions, attn_metadata)
+
+        # Handle dummy run (no metadata).
+        if not isinstance(attn_metadata, dict):
+            # Reserve _forward_prefill's bf16-gather workspace; the dummy
+            # run returns before mla_attn runs, so without this the shared
+            # workspace locks below the real prefill size.
+            sub = self.mla_attn


In this parallel execution block, q, kv, and positions are used on the auxiliary stream. You must call record_stream(self.aux_stream) on these tensors to prevent premature memory reuse by the PyTorch caching allocator.

gemini-code-assist · 2026-04-24T22:32:21Z

+                global_indices, topk_lens = compute_global_topk_indices_and_lens(
+                    self.topk_indices_buffer[:num_decode_tokens],
+                    swa_metadata.token_to_req_indices,
+                    attn_metadata.block_table[:num_decodes],
+                    block_size,
+                    is_valid,
+                )


The compute_global_topk_indices_and_lens function allocates new GPU tensors (global_topk_indices and topk_lens) on every decode step. Frequent GPU allocations are expensive and can break CUDA graph capture. These should be pre-allocated in the layer's __init__ or managed via a persistent workspace.

gemini-code-assist · 2026-04-24T22:32:21Z

+            combined_indices, combined_lens = combine_topk_swa_indices(
+                topk_indices[query_start:query_end],
+                query_start_loc[
+                    num_decodes + chunk_start : num_decodes + chunk_end + 1
+                ],
+                seq_lens[chunk_start:chunk_end],
+                gather_lens[chunk_start:chunk_end],
+                self.window_size,
+                self.compress_ratio,
+                top_k,
+                M,
+                N,
+            )


The combine_topk_swa_indices function allocates new tensors on the GPU during every prefill step. This should be refactored to use pre-allocated buffers or a persistent workspace to ensure compatibility with CUDA graphs and improve performance.

gemini-code-assist · 2026-04-24T22:32:21Z

+    post_mix = torch.empty(
+        num_tokens,
+        hc_mult,
+        dtype=torch.float32,
+        device=residual.device,
+    )
+    comb_mix = torch.empty(
+        num_tokens,
+        hc_mult2,
+        dtype=torch.float32,
+        device=residual.device,
+    )
+    layer_input = torch.empty(
+        num_tokens,
+        hidden_size,
+        dtype=torch.bfloat16,
+        device=residual.device,
+    )
+
+    gemm_out_mul = torch.empty(
+        n_splits,
+        num_tokens,
+        hc_mult3,
+        dtype=torch.float32,
+        device=residual.device,
+    )
+    gemm_out_sqrsum = torch.empty(
+        n_splits,
+        num_tokens,
+        dtype=torch.float32,
+        device=residual.device,
+    )


The mhc_pre function performs multiple GPU tensor allocations (post_mix, comb_mix, layer_input, gemm_out_mul, gemm_out_sqrsum) on every forward pass. These should be moved to the layer's initialization or managed via a persistent workspace to avoid allocation overhead and support CUDA graph capture.

gemini-code-assist · 2026-04-24T22:32:21Z

+    output_scale_packed = torch.zeros(
+        (num_packed_groups, tma_aligned_M),
+        dtype=torch.int32,
+        device=input.device,
+    ).T[:M, :]


The silu_mul_quant_fp8_packed_triton function allocates output_scale_packed on every call. This allocation should be avoided by passing a pre-allocated buffer or using a workspace manager, as this function is in the critical path for MoE execution.

gemini-code-assist · 2026-04-24T22:32:21Z

+    token_to_seq = torch.empty(total_seq_lens, dtype=torch.int32, device=device)
+
+    cu_seq_lens = torch.empty(num_reqs + 1, dtype=torch.int32, device=device)
+    # Assigning to slice avoids cpu sync.
+    cu_seq_lens[:1] = 0
+    torch.cumsum(compressed_seq_lens[start_idx:end_idx], dim=0, out=cu_seq_lens[1:])
+
+    query_start_loc = (
+        query_start_loc[start_idx : end_idx + 1] - query_start_loc[start_idx]
+    )
+
+    total_query_len = int(
+        (query_start_loc_cpu[end_idx] - query_start_loc_cpu[start_idx]).item()
+    )
+    if query_slice is not None:
+        qs_start = query_slice.start
+        qs_stop = query_slice.stop
+    else:
+        qs_start = 0
+        qs_stop = total_query_len
+    output_query_len = qs_stop - qs_start
+
+    cu_seq_len_ks = torch.empty(output_query_len, dtype=torch.int32, device=device)
+    cu_seq_len_ke = torch.empty(output_query_len, dtype=torch.int32, device=device)
+


The build_prefill_chunk_metadata function allocates several GPU tensors (token_to_seq, cu_seq_lens, cu_seq_len_ks, cu_seq_len_ke) during the metadata build phase. In the V1 architecture, metadata builders should avoid GPU allocations to maintain performance and ensure CUDA graph stability. These should be pre-allocated.

gemini-code-assist · 2026-04-24T22:32:21Z

+            pfx_gather_lens = torch.empty(
+                num_prefills, dtype=torch.int32, device=seq_lens.device
+            )


The _build_deepseek_v4_metadata function allocates pfx_gather_lens on the GPU during every metadata build step. This allocation should be avoided by using a pre-allocated buffer to support CUDA graph capture and reduce overhead.

gemini-code-assist · 2026-04-24T22:32:21Z

+
+    index_q_fp8 = torch.empty_like(index_q, dtype=torch.float8_e4m3fn)
+    _fused_indexer_q_rope_quant_kernel[(num_tokens, num_index_q_heads)](
+        positions,
+        index_q,
+        index_q.stride(0),
+        index_q.stride(1),
+        index_q_cos_sin_cache,
+        index_q_cos_sin_cache.stride(0),
+        index_q_cos_sin_cache.shape[-1] // 2,
+        index_q_fp8,


The fused_indexer_q_rope_quant function performs multiple GPU tensor allocations (index_weights_out, index_q_packed, index_q_scale, index_q_fp8) on every call. These allocations should be moved to the layer's initialization or handled via a persistent workspace to avoid overhead and support CUDA graph capture.

youkaichao · 2026-04-28T08:17:27Z

for reference, it is included in 5e3525c , and merged in #40860

zyongye and others added 7 commits April 24, 2026 02:58

nit

bc34b25

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

[Fix] Always allocate FP8 indexer cache even for FP4 indexer (#225)

4bab47b

Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>

Avoid CPU-GPU sync (#224)

6f3820e

Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>

add 1024 topk support (#226)

aa11460

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

temporary disable persistent topk for 1024

3602f14

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

Integrate DeepGeMM MegaMoE

e129139

Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>

WoosukKwon requested review from LucasWilkinson, ProExpertProg, hmellor, houseroad and xuechendi as code owners April 24, 2026 22:28

mergify Bot added documentation Improvements or additions to documentation ci/build deepseek Related to DeepSeek models new-model Requests to new models performance Performance-related issues gpt-oss Related to GPT-OSS models nvidia speculative-decoding labels Apr 24, 2026

github-project-automation Bot added this to gpt-oss Issues & Enhancements Apr 24, 2026

mergify Bot added v1 tool-calling labels Apr 24, 2026

github-project-automation Bot added this to NVIDIA Apr 24, 2026

github-project-automation Bot moved this to To Triage in gpt-oss Issues & Enhancements Apr 24, 2026

mergify Bot added the needs-rebase label Apr 24, 2026

github-project-automation Bot added this to Tool Calling Apr 24, 2026

mergify Bot added the kv-connector label Apr 24, 2026

gemini-code-assist Bot reviewed Apr 24, 2026

View reviewed changes

WoosukKwon closed this Apr 25, 2026

github-project-automation Bot moved this to Done in Tool Calling Apr 25, 2026

github-project-automation Bot moved this to Done in NVIDIA Apr 25, 2026

github-project-automation Bot moved this from To Triage to Done in gpt-oss Issues & Enhancements Apr 25, 2026

hmellor deleted the woosuk/mega-moe branch April 27, 2026 14:44

Uh oh!

Conversation

WoosukKwon commented Apr 24, 2026

Uh oh!

mergify Bot commented Apr 24, 2026

Uh oh!

mergify Bot commented Apr 24, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

youkaichao commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants