[Bug Fix] Reject incompatible combination of --disable-cuda-graph-padding and --enable-torch-compile by ppraneth · Pull Request #23903 · sgl-project/sglang

ppraneth · 2026-04-28T03:41:27Z

Motivation

When a user passes both --disable-cuda-graph-padding and --enable-torch-compile together, the engine never finishes initializing. It silently hangs for many minutes (or indefinitely on larger models) with no indication that anything is wrong.

The root cause is a batch size explosion in the CUDA/CPU graph capture step. With padding enabled, the capture list is a small fixed set of bucket sizes like [1, 2, 4, 8, 12, 16, 24]. With padding disabled, _generate_cuda_graph_batch_sizes and _generate_cpu_graph_batch_sizes both expand this to every integer from 1 up to cuda_graph_max_bs (for example, 1 through 24 or up to 160 on small GPUs). Because CPUGraphRunner runs a full torch.compile and Triton AUTOTUNE cycle for each batch size, the number of kernel benchmarks grows from roughly 28 (7 buckets x 4 matmul shapes) to hundreds or thousands. In the reproduction on an RTX 4090, the process was killed after 3 minutes while still at batch size 11 out of 24.

The two flags are also semantically incompatible at runtime. --disable-cuda-graph-padding means the server will route exact incoming batch sizes without rounding up. If those exact sizes were not pre-compiled, torch.compile would trigger a new compilation on the fly during request serving, causing latency spikes for real traffic.

Modifications

Added a validation check in ServerArgs.check_server_args() in python/sglang/srt/server_args.py that raises an AssertionError immediately at engine startup when both --disable-cuda-graph-padding and --enable-torch-compile are passed together.
The error message explains the cause (O(max_batch_size) autotune explosion) and tells the user which flag to remove, so there is no silent hang and no confusion about what went wrong.
No behavior change for any other flag combination.

Accuracy Tests

Speed Tests and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review and Merge Process

Ping Merge Oncalls to start the process. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

gemini-code-assist · 2026-04-28T03:41:31Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

ppraneth · 2026-04-28T06:38:57Z

@ShangmingCai Can you please review this pr?

ppraneth · 2026-05-12T05:30:08Z

@b8zhong Can you review this pr?

ShangmingCai

Thx for the PR. I am not an expert on CUDA graphs and Torch compile. This PR appears to be a guard rather than a fix, but the use case explanation sounds reasonable, so I think we can merge this. If people are requiring the mixed use of these configs, but hitting this assertion error in the future, then they could propose a real fix and remove this assertion at that time.

…ack) Brings in upstream sgl-project/sglang main commits since 096ad02 (merge base, Laguna-XS.2 model support). Total: 28 upstream commits composed. Custom-stack files preserved intact (entirely-ours, byte-identical to origin/main): - Blackwell CuTe kernel suite (warp_decode_cute, g1_attention_cute, gated_norm_cute, layersplit_cute, fused_store_index_cache) - TurboQuant 2.5-bit dense KV cache path - HIGGS 2-bit dense KV cache path (with split-K decode) - NVFP4 IndexCache dispatcher (active gate) - quantization_config_dispatch (HF-config-driven runtime routing) - All custom server-args flags and runtime methods preserved Verification: - 200+ merged Python files compile cleanly - Dispatcher symbol presence verified - HIGGS pool / TurboQuant pool classes present at expected lines - compressed_tensors_w4a4_nvfp4_moe imports clean - All custom server-args flags present (enable_higgs_dense_2bit_kv_cache, enable_turboquant_dense_kv_cache, turboquant_dense_kv_preset, indexer_quantization_declared, higgs_mla_decode_num_splits, etc.) Manual-merged shared files (auto-merge gave broken/mixed output; cleaned up post-merge): - python/sglang/srt/disaggregation/mooncake/conn.py: upstream's PR#24932 refactored maybe_send_extra into a state-types-loop. Replayed our LayerSplit NSA state-index-length-mismatch check inside the SWA/NSA branch of the new loop body. - sgl-kernel/python/sgl_kernel/__init__.py: upstream's PR#23449 (Apple Silicon Metal kernel) wrapped the entire module body in `if darwin/arm64: from sgl_kernel.metal import * else: ...`. The auto-merge duplicated the file body; rewrote cleanly with upstream's structure and re-injected our `g1_gate_forward`, `warp_decode_cute_moe_forward`, and `warp_decode_cute_moe_packed_forward` imports plus `g1_gate_forward` in _DEBUG_EXPORT_NAMES. - python/sglang/srt/managers/scheduler_output_processor_mixin.py: line 628 still referenced `result.num_accepted_drafts` (renamed by PR sgl-project#25038 to `num_correct_drafts`). Renamed in place. - python/sglang/srt/observability/scheduler_metrics_mixin.py: a block around the spec-decode logging path had mixed old/new names from auto-merge (lines 553/557/560). Renamed `spec_num_accepted_tokens` -> `spec_num_accept_tokens` and local `num_accepted_drafts` -> `num_correct_drafts` to match the rest of the file. - test/test_smc_info.py: stub Req mock used the old field names `spec_accepted_drafts` and `update_spec_acceptance_histogram`. Renamed to `spec_num_correct_drafts` and `update_spec_correct_drafts_histogram` per PR sgl-project#24081. Auto-merge cleanly integrated upstream changes to: - server_args.py (new fields: prefill_only_disable_kv_cache, weight_loader_drop_cache_after_load, prefill_delayer_queue_min_ratio, prefill_delayer_max_delay_ms, speculative_draft_window_size, etc.) - mem_cache/memory_pool.py (new NoOpMHATokenToKVPool) - model_executor/model_runner_kv_cache_mixin.py (NoOpMHATokenToKVPool pool factory + _validate_prefill_only_disable_kv_cache_pool_family) - layers/attention/nsa_backend.py (spec rename num_accepted_drafts -> num_correct_drafts; num_accepted_tokens -> num_accept_tokens) - layers/attention/nsa/nsa_indexer.py (new _apply_q_scale_and_softmax_scale compile method; torch.mm replaces deep_gemm wrapper) - 28+ disaggregation/spec/runner files with mostly clean upstream-side-only integration. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> ----- upstream commit subjects (28) ----- fd3eb77 [Cookbook]: add Laguna-XS.2 (Poolside) (sgl-project#24730) 6be1a45 Fix swa component host hit (sgl-project#25085) 693f497 [NPU] use causal_conv1d_update_v2 for performance (sgl-project#24595) 1efe9e2 [Bug Fix] Reject incompatible combination of --disable-cuda-graph-padding and --enable-torch-compile (sgl-project#23903) 8d27ce7 Optimize uvicorn startup command (sgl-project#25041) b35fd5f [fix] skip legacy minicpmv conv template for MiniCPM-V 4.6 (sgl-project#24998) 7582237 [Tiny Fix] Disable BCG when inner layer_model unresolved (sgl-project#25021) ca3bc05 Deepseek-v4-Pro share expert tp1 (sgl-project#24949) a72d3ae [Spec] Multi-layer mamba scatter cleanup; fix positional call bug (sgl-project#25030) 7128533 Revert "Migrate Intel CPU cases to the test/registered." (sgl-project#25044) 1f985c5 [Spec] Rename `accepted_indices` -> `accept_indices`; drop `_token_id` suffix per Rule 5 (sgl-project#25038) ecf5d84 Migrate Intel CPU cases to the test/registered. (sgl-project#22670) d7f4761 [PD] Refactor hybrid state transfer (sgl-project#24932) 91907b7 [UnifiedTree]: Fix Unified HiCache tombstone lock release replay (sgl-project#24972) 4ad63ad [Spec] Rename `accepted_drafts` -> `correct_drafts` for unambiguous naming (sgl-project#24081) 6bfb365 [PD] Rate limit prefill inflight polling warnings (sgl-project#24967) 6bb79c1 [Linear Attn] Add CUSTOM enum and plugin extensibility for kernel backends (sgl-project#24937) cfc41d5 Fix kimi k2.5 mla eagle + dp attention (sgl-project#25033) 0f3932c [Fix] Qwen3-ASR config: set thinker_config before super().__init__ (sgl-project#24187) f526e3f [Spec] Mamba scatter cleanup; fix multi-layer positional bug; dflash naming (sgl-project#25029) 10375a1 [NIXL][XPU] Fix uint64 overflow for mismatched P/D TP sizes (e.g. prefill_tp=1, decode_tp=2) (sgl-project#24648) 0a37d24 [diffusion] hardware: support sage attention backend on MUSA (attn backend, 21/N) (sgl-project#24752) 5495026 [HiCache] feat: default storage prefetch timeout (sgl-project#23309) 186eb42 Feat: Support SWA (Sliding Window Attention) for EAGLE-3 drafter (sgl-project#24664) a75b79e Feat: Support newer EAGLE-3 drafters (sgl-project#24663) f3a8189 [Spec] Internal rename per N2 v2 naming rule (sgl-project#25014) bfc2eda [MUSA] Use MUSA-optimized operators in piecewise CUDA graph (sgl-project#23633) 74d70af [Apple Silicon] Add Metal kernel support in sgl-kernel (sgl-project#23449)

…ding and --enable-torch-compile (sgl-project#23903)

fix bug

746ff3c

ShangmingCai approved these changes May 12, 2026

View reviewed changes

ShangmingCai merged commit 1efe9e2 into sgl-project:main May 12, 2026
57 of 65 checks passed

xjpang pushed a commit to xjpang/sglang that referenced this pull request May 13, 2026

[Bug Fix] Reject incompatible combination of --disable-cuda-graph-pad…

2391096

…ding and --enable-torch-compile (sgl-project#23903)

Shunkangz pushed a commit to Shunkangz/sglang that referenced this pull request May 27, 2026

[Bug Fix] Reject incompatible combination of --disable-cuda-graph-pad…

3b430e8

…ding and --enable-torch-compile (sgl-project#23903)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug Fix] Reject incompatible combination of --disable-cuda-graph-padding and --enable-torch-compile#23903

[Bug Fix] Reject incompatible combination of --disable-cuda-graph-padding and --enable-torch-compile#23903
ShangmingCai merged 1 commit into
sgl-project:mainfrom
ppraneth:bug6

ppraneth commented Apr 28, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Apr 28, 2026

Uh oh!

ppraneth commented Apr 28, 2026

Uh oh!

ppraneth commented May 12, 2026

Uh oh!

ShangmingCai left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ppraneth commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

Uh oh!

gemini-code-assist Bot commented Apr 28, 2026

Uh oh!

ppraneth commented Apr 28, 2026

Uh oh!

ppraneth commented May 12, 2026

Uh oh!

ShangmingCai left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ppraneth commented Apr 28, 2026 •

edited

Loading