Enable Piecewise CUDA Graph for NemotronH Hybrid (Mamba+Attention) Models#19903
Enable Piecewise CUDA Graph for NemotronH Hybrid (Mamba+Attention) Models#19903ispobock merged 22 commits intosgl-project:mainfrom
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
| # captured graph size. Slice to actual token count for Mamba forward. | ||
| attn_backend = forward_batch.attn_backend | ||
| metadata = attn_backend.linear_attn_backend.forward_metadata | ||
| num_actual_tokens = metadata.num_prefill_tokens + ( |
There was a problem hiding this comment.
why does only mamba need this special shape handle, can't we know the exact output shape before?
There was a problem hiding this comment.
During CUDA graph replay, [hidden_states] is padded to the captured graph batch size. Attention handles this naturally via KV cache and masks, but Mamba processes tokens sequentially through conv/SSM states and asserts [num_actual_tokens == projected_states.shape[0]]. The slicing must happen inside the split op (not the caller) because [torch.compile(fullgraph=True)] requires static tensor shapes within each compiled segment — the split op acts as the graph break where we can access runtime metadata.
| elif hasattr(layer, "_forward_mamba"): | ||
| # Mamba layer with split op support - store the layer itself | ||
| attn_layer = layer | ||
| # attn_layer is None for non-attention layers (e.g. Mamba, MLP-only) |
There was a problem hiding this comment.
why do we need to store non-attention layers as None in attention_layers? Could we only store attention layers as previously, but we could insert mamba attention layers into attention_layers for mamba models or change the field name in nemotron_h to make it compatible with existing logic.
There was a problem hiding this comment.
attention_layers is indexed by layer_id in the split ops (e.g., attention_layers[layer_id]). NemotronH's layer_id is the absolute model layer index (0–51), so each position must map correctly. Without None placeholders, layer_id=10 (a Mamba layer) would index into the wrong entry.
For non-hybrid models this is backward-compatible — every entry is an attention layer and no None values appear. Open to alternatives if you have a preferred approach (e.g., using a dict instead of a list).
There was a problem hiding this comment.
I think this is a good design. Should keep it
There was a problem hiding this comment.
The only concern here is we're changing the attention_layer capturing logic for all other models as well, i.e. we now are adding None for non-attention decoder layers. But theoretically it should be fine since all other models should only have attention/linear-attention decoder layers. Only nemotron_h is having MLP in their decoder layer. We can verify it by checking if all PCG CIs of other models pass.
|
/tag-and-rerun-ci |
…sglang into vjhaveri/fix_nemotron
| elif hasattr(layer, "_forward_mamba"): | ||
| # Mamba layer with split op support - store the layer itself | ||
| attn_layer = layer | ||
| # attn_layer is None for non-attention layers (e.g. Mamba, MLP-only) |
There was a problem hiding this comment.
I think this is a good design. Should keep it
| use_triton_causal_conv=True, # TODO: investigate need of `use_triton_causal_conv` | ||
| ) | ||
| return output, residual | ||
| if forward_batch.forward_mode.is_extend() and get_forward_context() is not None: |
There was a problem hiding this comment.
use is_in_piecewise_cuda_graph() context
…de.is_extend() and get_forward_context() is not None
|
/rerun-failed-ci |
|
Hi - I was profiling NVIDIA-Nemotron-Nano-9B-v2-FP8 locally and I noticed that PCG wasn't actually enabled successfully by default? Instead I see Reading the source code, it seems that only the mamba and attention layers are accounted for while the MLP layers are completely skipped in the hasattr matching, so it falls through with the condition Can someone confirm what is the actual status of this? I met another crash by skipping the check on number of attention layers so I don't think the fix is trivial. |
|
cc @vedantjh2 |
|
The layer discovery loop in init_piecewise_cuda_graphs only appended to attention_layers when it found an attention or mamba layer. NemotronH's MLP (-) and MoE layers use the mixer attribute but have neither .attn nor ._forward_mamba, so they were silently skipped. This made len(attention_layers) < num_hidden_layers (e.g., ~30 out of 56), triggering the "some layers do not apply Standard GQA" bail-out. The fix: when a mixer-based layer is neither attention nor mamba, append None as a positional placeholder so the list stays aligned with layer indices and the length check passes. Fix in PR #21436 |
Motivation
Piecewise CUDA graph (PCG) was previously disabled for NemotronH models because the layer detection logic required all layers to use standard GQA attention. NemotronH is a hybrid architecture (4 Attention + 24 Mamba + 24 MLP across 52 layers) where all sublayers use a
mixerattribute instead ofself_attn, causing the detection to fail with:Changes
model_runner.py
mixerattribute detection for NemotronH-style hybrid modelsattention_layers(Nonefor non-attention layers), enabling PCG to handle sparse attention architecturesnemotron_h.py_forward_mamba()method fromNemotronHMambaDecoderLayer.forward()nemotron_mamba2_with_outputsplit op (usingregister_custom_op+register_split_op) to enable graph breaks around Mamba layers during PCG captureLayersfrom union type (A | B | C | D) to tuple fortorch.compilecompatibilityBenchmark Results
Model: NVIDIA-Nemotron-Nano-9B-v2 (H100, bfloat16)
Benchmark: GSM8K (200 samples)
Notes
Launch commands