fix nemotron capture for non attention layers#21436
fix nemotron capture for non attention layers#21436Fridge003 merged 7 commits intosgl-project:mainfrom
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
/tag-run-ci-label |
|
I just tried applying this fix on my local RTX 4090 and I got this: Is it working on your side? *Edit: Seems to be an flashinfer backend issue described in #21218 |
|
@he-weiwen yes it works on my end. does the issue you linked need to be merged before I can successfully run CI for this? |
|
@he-weiwen This issue is fixed in #21452 |
|
/rerun-ut test_nvidia_nemotron_nano_v2.py |
|
/rerun-ut test_nvidia_nemotron_3_nano.py |
|
✅ |
|
/rerun-ut test_nvidia_nemotron_3_super_bf16.py |
|
✅ |
|
✅ |
|
/rerun-ut test_nvidia_nemotron_3_super_nvfp4.py |
|
✅ |
Motivation
Piecewise CUDA graph capture was silently disabled for NemotronH hybrid models (e.g.,
NVIDIA-Nemotron-Nano-9B-v2). Theinit_piecewise_cuda_graphslayer discovery loop only appended toattention_layerswhen it found an attention or mamba layer. NemotronH's pure MLP (-pattern) and MoE layers use themixerattribute but have neither.attnnor._forward_mamba, so they were skipped entirely. This causedlen(attention_layers) < num_hidden_layers, triggering the early bail-out with"Disable piecewise CUDA graph because some layers do not apply Standard GQA".Modifications
python/sglang/srt/model_executor/model_runner.pyIn the
init_piecewise_cuda_graphsmethod, updated the layer discovery logic to handle NemotronH-style hybrid models where some layers are pure MLP/MoE (accessed viamixerbut without.attnor._forward_mamba):mixerattribute but is neither attention nor mamba, appendNoneas a positional placeholder toattention_layers. This keeps the list aligned with layer indices so that split ops likenemotron_mamba2_with_outputcan correctly index bylayer_id.Noneplaceholder is only appended for layers that enter themixerbranch — models without amixerattribute (e.g., LFM2 conv layers) are unaffected, preserving the existing safety check that disables piecewise CUDA graph for unsupported architectures.Accuracy Tests
GSM8K accuracy is identical with and without piecewise CUDA graph:
Benchmarking and Profiling
Model:
NVIDIA-Nemotron-Nano-9B-v2on a single GPU.Benchmark:
benchmark/gsm8k/bench_sglang.py(200 samples)--disable-piecewise-cuda-graph)Baseline:
PCG Enabled:
Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci