Enable Piecewise CUDA Graph for NemotronH Hybrid (Mamba+Attention) Models by vedantjh2 · Pull Request #19903 · sgl-project/sglang

vedantjh2 · 2026-03-04T23:59:07Z

Motivation

Piecewise CUDA graph (PCG) was previously disabled for NemotronH models because the layer detection logic required all layers to use standard GQA attention. NemotronH is a hybrid architecture (4 Attention + 24 Mamba + 24 MLP across 52 layers) where all sublayers use a mixer attribute instead of self_attn, causing the detection to fail with:

Disable piecewise CUDA graph because some layers do not apply Standard GQA

Changes

model_runner.py

Added mixer attribute detection for NemotronH-style hybrid models
Every layer now gets an entry in attention_layers (None for non-attention layers), enabling PCG to handle sparse attention architectures
Relaxed validation: PCG is only disabled when no attention layers are found, rather than when any layer lacks attention

nemotron_h.py

Extracted _forward_mamba() method from NemotronHMambaDecoderLayer.forward()
Added nemotron_mamba2_with_output split op (using register_custom_op + register_split_op) to enable graph breaks around Mamba layers during PCG capture
Added token slicing in the split op to handle padded CUDA graph buffers (Mamba asserts on exact token counts)
Changed Layers from union type (A | B | C | D) to tuple for torch.compile compatibility

Benchmark Results

Model: NVIDIA-Nemotron-Nano-9B-v2 (H100, bfloat16)
Benchmark: GSM8K (200 samples)

Configuration	Throughput (tok/s)	Latency (s)	Accuracy
Baseline (no PCG)	1521	15.74	88.0%
With PCG (inductor)	1681	14.29	89.0%
With PCG (eager)	1542.8	15.60	89.0%
Improvement vs baseline (inductor)	+10.5%	-9.2%	—
Improvement vs baseline (eager)	+1.4%	-0.9%	—

Notes

PCG provides the best gains when paired with the inductor compiler backend in this setup.
PCG eager achieves essentially the same accuracy as PCG+inductor (89.0%), with smaller performance gains.

Launch commands

# Baseline
python -m sglang.launch_server \
  --model-path nvidia/NVIDIA-Nemotron-Nano-9B-v2 \
  --host 0.0.0.0 --port 30000

# With PCG (inductor)
python -m sglang.launch_server \
  --model-path nvidia/NVIDIA-Nemotron-Nano-9B-v2 \
  --host 0.0.0.0 --port 30000 \
  --enable-piecewise-cuda-graph \
  --piecewise-cuda-graph-compiler inductor

# With PCG (eager)
python -m sglang.launch_server \
  --model-path /shared/public/elr-models/nvidia/NVIDIA-Nemotron-Nano-9B-v2 \
  --host 0.0.0.0 --port 30000 \
  --enable-piecewise-cuda-graph \
  --piecewise-cuda-graph-compiler eager

gemini-code-assist · 2026-03-04T23:59:10Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

python/sglang/srt/models/nemotron_h.py

zminglei · 2026-03-05T00:45:46Z

python/sglang/srt/models/nemotron_h.py

+    # captured graph size. Slice to actual token count for Mamba forward.
+    attn_backend = forward_batch.attn_backend
+    metadata = attn_backend.linear_attn_backend.forward_metadata
+    num_actual_tokens = metadata.num_prefill_tokens + (


why does only mamba need this special shape handle, can't we know the exact output shape before?

During CUDA graph replay, [hidden_states] is padded to the captured graph batch size. Attention handles this naturally via KV cache and masks, but Mamba processes tokens sequentially through conv/SSM states and asserts [num_actual_tokens == projected_states.shape[0]]. The slicing must happen inside the split op (not the caller) because [torch.compile(fullgraph=True)] requires static tensor shapes within each compiled segment — the split op acts as the graph break where we can access runtime metadata.

zminglei · 2026-03-05T00:51:55Z

python/sglang/srt/model_executor/model_runner.py

+                elif hasattr(layer, "_forward_mamba"):
+                    # Mamba layer with split op support - store the layer itself
+                    attn_layer = layer
+            # attn_layer is None for non-attention layers (e.g. Mamba, MLP-only)


why do we need to store non-attention layers as None in attention_layers? Could we only store attention layers as previously, but we could insert mamba attention layers into attention_layers for mamba models or change the field name in nemotron_h to make it compatible with existing logic.

attention_layers is indexed by layer_id in the split ops (e.g., attention_layers[layer_id]). NemotronH's layer_id is the absolute model layer index (0–51), so each position must map correctly. Without None placeholders, layer_id=10 (a Mamba layer) would index into the wrong entry.

For non-hybrid models this is backward-compatible — every entry is an attention layer and no None values appear. Open to alternatives if you have a preferred approach (e.g., using a dict instead of a list).

I think this is a good design. Should keep it

The only concern here is we're changing the attention_layer capturing logic for all other models as well, i.e. we now are adding None for non-attention decoder layers. But theoretically it should be fine since all other models should only have attention/linear-attention decoder layers. Only nemotron_h is having MLP in their decoder layer. We can verify it by checking if all PCG CIs of other models pass.

Agree @zminglei

zminglei · 2026-03-05T03:33:36Z

/tag-and-rerun-ci

…sglang into vjhaveri/fix_nemotron

Oasis-Git · 2026-03-05T22:21:57Z

python/sglang/srt/model_executor/model_runner.py

+                elif hasattr(layer, "_forward_mamba"):
+                    # Mamba layer with split op support - store the layer itself
+                    attn_layer = layer
+            # attn_layer is None for non-attention layers (e.g. Mamba, MLP-only)


I think this is a good design. Should keep it

Oasis-Git · 2026-03-05T22:24:49Z

python/sglang/srt/models/nemotron_h.py

-            use_triton_causal_conv=True,  # TODO: investigate need of `use_triton_causal_conv`
-        )
-        return output, residual
+        if forward_batch.forward_mode.is_extend() and get_forward_context() is not None:


use is_in_piecewise_cuda_graph() context

…de.is_extend() and get_forward_context() is not None

…sglang into vjhaveri/fix_nemotron

vedantjh2 · 2026-03-07T04:32:17Z

/rerun-failed-ci

…sglang into vjhaveri/fix_nemotron

…dels (sgl-project#19903)

he-weiwen · 2026-03-25T19:51:19Z

Hi - I was profiling NVIDIA-Nemotron-Nano-9B-v2-FP8 locally and I noticed that PCG wasn't actually enabled successfully by default? Instead I see Disable piecewise CUDA graph because some layers do not apply Standard GQA in my logs.

Reading the source code, it seems that only the mamba and attention layers are accounted for while the MLP layers are completely skipped in the hasattr matching, so it falls through with the condition len(self.attention_layers) < self.model_config.num_hidden_layers ?

Can someone confirm what is the actual status of this? I met another crash by skipping the check on number of attention layers so I don't think the fix is trivial.

Oasis-Git · 2026-03-25T22:59:36Z

cc @vedantjh2

vedantjh2 · 2026-03-26T00:13:59Z

The layer discovery loop in init_piecewise_cuda_graphs only appended to attention_layers when it found an attention or mamba layer. NemotronH's MLP (-) and MoE layers use the mixer attribute but have neither .attn nor ._forward_mamba, so they were silently skipped. This made len(attention_layers) < num_hidden_layers (e.g., ~30 out of 56), triggering the "some layers do not apply Standard GQA" bail-out.

The fix: when a mixer-based layer is neither attention nor mamba, append None as a positional placeholder so the list stays aligned with layer indices and the length check passes.

Fix in PR #21436

fix nemotron pcg

ddf7e6b

vedantjh2 requested review from Fridge003, Ying1123, hanming-lu, hnyls2002, ispobock, merrymercy, xiezhq-hermann and yizhang2077 as code owners March 4, 2026 23:59

vedantjh2 added 3 commits March 5, 2026 00:31

resolver merge conflict

16b5e3d

fix rmerge conflict

da7daae

fix lint

02b793e

zminglei reviewed Mar 5, 2026

View reviewed changes

vedantjh2 and others added 3 commits March 5, 2026 01:10

Merge branch 'main' into vjhaveri/fix_nemotron

3319a59

make consistent

2459197

Merge branch 'main' into vjhaveri/fix_nemotron

84a5e5c

github-actions bot added the run-ci label Mar 5, 2026

vedantjh2 added 2 commits March 5, 2026 18:56

fix lint

4044f86

Merge branch 'vjhaveri/fix_nemotron' of https://github.com/vedantjh2/…

c890ee8

…sglang into vjhaveri/fix_nemotron

vedantjh2 changed the title ~~fix nemotron to be able to use pcg~~ fix nemotron to be able to use piecewise cuda graph Mar 5, 2026

vedantjh2 added 4 commits March 5, 2026 11:05

Merge branch 'main' into vjhaveri/fix_nemotron

b27fe6d

put back comment

92e0065

Merge branch 'vjhaveri/fix_nemotron' of https://github.com/vedantjh2/…

fbc5332

…sglang into vjhaveri/fix_nemotron

Merge branch 'main' into vjhaveri/fix_nemotron

fecd99c

Oasis-Git reviewed Mar 5, 2026

View reviewed changes

vedantjh2 changed the title ~~fix nemotron to be able to use piecewise cuda graph~~ Enable Piecewise CUDA Graph for NemotronH Hybrid (Mamba+Attention) Models Mar 5, 2026

use is_in_piecewise_cuda_graph instead of if forward_batch.forward_mo…

92fd376

…de.is_extend() and get_forward_context() is not None

vedantjh2 added 3 commits March 5, 2026 22:45

Merge branch 'vjhaveri/fix_nemotron' of https://github.com/vedantjh2/…

4893a0e

…sglang into vjhaveri/fix_nemotron

Merge branch 'main' into vjhaveri/fix_nemotron

e3dcd06

Merge branch 'main' into vjhaveri/fix_nemotron

98675b1

vedantjh2 added 5 commits March 7, 2026 05:30

fix nemotron moe

6445583

Merge branch 'vjhaveri/fix_nemotron' of https://github.com/vedantjh2/…

522c707

…sglang into vjhaveri/fix_nemotron

fix failing test

29092e0

Merge branch 'main' into vjhaveri/fix_nemotron

5a0ad2f

add guard for pcg

6faece1

Oasis-Git approved these changes Mar 11, 2026

View reviewed changes

ispobock merged commit 25bd830 into sgl-project:main Mar 12, 2026
167 of 173 checks passed

liubiyongge pushed a commit to liubiyongge/sglang that referenced this pull request Mar 13, 2026

Enable Piecewise CUDA Graph for NemotronH Hybrid (Mamba+Attention) Mo…

7ca826c

…dels (sgl-project#19903)

Wangzheee pushed a commit to Wangzheee/sglang that referenced this pull request Mar 21, 2026

Enable Piecewise CUDA Graph for NemotronH Hybrid (Mamba+Attention) Mo…

3746287

…dels (sgl-project#19903)

Conversation

vedantjh2 commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Changes

Benchmark Results

Launch commands

Uh oh!

gemini-code-assist bot commented Mar 4, 2026

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zminglei commented Mar 5, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vedantjh2 commented Mar 7, 2026

Uh oh!

Uh oh!

he-weiwen commented Mar 25, 2026

Uh oh!

Oasis-Git commented Mar 25, 2026

Uh oh!

vedantjh2 commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

vedantjh2 commented Mar 4, 2026 •

edited

Loading