Skip to content

[Refactor] Cuda Graph Runner/Backend Refactor#23906

Open
Oasis-Git wants to merge 106 commits into
sgl-project:mainfrom
Oasis-Git:cg-refactor
Open

[Refactor] Cuda Graph Runner/Backend Refactor#23906
Oasis-Git wants to merge 106 commits into
sgl-project:mainfrom
Oasis-Git:cg-refactor

Conversation

@Oasis-Git
Copy link
Copy Markdown
Collaborator

@Oasis-Git Oasis-Git commented Apr 28, 2026

Motivation

#23004

[WIP]

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

  1. Ping Merge Oncalls to start the process. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
  4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

CI States

Latest PR Test (Base): ❌ Run #26602343060
Latest PR Test (Extra): ❌ Run #26602342965

Oasis-Git and others added 12 commits April 27, 2026 22:34
…ils} packages

Sets up the empty package skeletons for the CUDA graph refactor without
changing any behavior.

- Create cuda_graph_runner/ package; relocate legacy cuda_graph_runner.py
  to cuda_graph_runner/legacy.py and re-export verbatim from __init__.py
  so the 31 existing import sites (model_runner, eagle_worker, lora,
  memory_pool, etc.) keep working transparently.
- Create cuda_graph_backend/ package with Base/Full/Breakable/TCPiecewise
  CudaGraphBackend skeleton classes (no implementations yet).
- Create cuda_graph_backend_utils/{breakable_cuda_graph,piecewise_cuda_graph}/
  empty subpackages for primitives that move in Phase 1.
- Add ServerArgs.cuda_graph_mode: Optional[Dict[str, str]] = None field
  for the upcoming canonical per-phase config; legacy flags still drive
  behavior.
- Add cuda_graph_runner.config_resolution.resolve_cuda_graph_config()
  no-op stub; real pipeline lands in Phase 1.

No code path uses the new abstractions yet. See refactor/plan.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… primitives

Splits the conflated context state in compilation/piecewise_context_manager.py
along its real seam: the cuda-graph-capture flag (used by both BCG and tcpcg)
moves to model_executor/, while the torch.compile-warmup flag (tcpcg-internal)
stays in compilation/.

Renames:
  is_in_piecewise_cuda_graph        -> is_in_cuda_graph_capture
  enable_piecewise_cuda_graph       -> enable_cuda_graph_capture
  is_in_pcg_torch_compile           -> is_in_torch_compile_warmup
  enable_piecewise_cuda_graph_compile -> enable_torch_compile_warmup
  PIECEWISE_CUDA_GRAPH_CAPTURE_FAILED_MSG -> CUDA_GRAPH_CAPTURE_FAILED_MSG

Relocations:
  compilation/piecewise_context_manager.py
    -> model_executor/cuda_graph_backend_utils/piecewise_cuda_graph/context_manager.py
       (capture flag, ForwardContext, set/get_forward_context)
    -> compilation/compile_phase.py
       (warmup flag, pcg_capture_stream)
  model_executor/breakable_cuda_graph/{breakable_cuda_graph,context,cuda_utils}.py
    -> model_executor/cuda_graph_backend_utils/breakable_cuda_graph/{...}

The two old paths (compilation/piecewise_context_manager.py and
model_executor/breakable_cuda_graph/) are kept as transition shims that
re-export from the new homes under both old and new names. Removed in
Phase 6.

Audited 38 callsites across 16 production files. All switched to the
renamed primitives at their new import paths. Behavior preserved
mechanically (bucket A everywhere). Two bucket-C candidates flagged
with TODO comments for follow-up:
  - models/nemotron_h.py:_forward_core (CUDA stream overlap path —
    genuine dynamo-tracing constraint, not capture-or-replay)
  - models/deepseek_common/.../forward_mla.py (non-contiguous-output
    bmm form — required by dynamo, not by capture)

Verified clean: 22 audited modules import OK, identity preserved
through both shims, BCG test path still resolves.

See refactor/plan.md §6.5 for context-flag semantics + audit rule.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ig_resolution

Moves the 18-condition ``_handle_piecewise_cuda_graph`` from server_args.py
to ``cuda_graph_runner.config_resolution`` and converts the long if/if
cascade into a data table of ``_PiecewiseDisableRule(name, predicate)``
entries — easier to read, audit, and extend.

Wires ``resolve_cuda_graph_config(self)`` from ``ServerArgs.__post_init__``
in place of the old method call. The old method is removed (no callers left).

Phase 1 only implements stage 3 (compatibility checks) of the four-stage
pipeline described in plan §3. Stages 1 (parse), 2 (default), and 4
(validate) remain stubs that land in Phase 4 alongside the new CLI surface.
GPU-memory-based defaulting still lives in ``_handle_gpu_memory_settings``
until then.

Behavior parity verified against an 18-case matrix covering every rule
plus the ``enforce_piecewise_cuda_graph`` bypass path. ``--enforce`` still
overrides the entire table for testing per Q3.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Locks down the contract that backend extractions in Phases 2b–2d must
satisfy. The ABC has four methods:

  prepare(runner)                     # one-time setup
  capture_one(shape_key, forward_fn)  # capture for one shape
  replay(shape_key)                   # replay artifact at shape
  cleanup()                           # default no-op

plus a class attribute ``captures_attn_metadata`` that lets the runner
know whether ``init_forward_metadata_capture_cuda_graph`` should run
inside the captured region (full-graph style) or outside on every
replay (PCG/BCG style).

Three concrete subclasses declared with the right metadata flag but
NotImplementedError bodies:
  - FullCudaGraphBackend (captures_attn_metadata=True)  — Phase 2b
  - BreakableCudaGraphBackend (captures_attn_metadata=False) — Phase 2c
  - TCPiecewiseCudaGraphBackend (captures_attn_metadata=False) — Phase 2d

Bodies will be lifted from cuda_graph_runner/legacy.py,
breakable_cuda_graph_runner.py, piecewise_cuda_graph_runner.py
respectively.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 0 added the ``ServerArgs.cuda_graph_mode`` field but missed the
corresponding ``parser.add_argument`` registration. ``from_cli_args``
maps every dataclass field name back from the argparse Namespace, so
launching the server crashed with ``AttributeError: 'Namespace' object
has no attribute 'cuda_graph_mode'``.

Adds a minimal ``--cuda-graph-mode`` flag that accepts a JSON object
and parses it to ``Dict[str, str]``. Validation of allowed values
(full/breakable/tcpcg/disabled) per phase lands in Phase 4; for now
the field is still unread.

Caught by trying to launch a baseline Qwen3-8B server for mgsm_en
validation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lifts the full-CUDA-graph capture primitives out of
``legacy.CudaGraphRunner`` and into ``cuda_graph_backend/full.py`` as
two runner-coupling-free static methods:

  FullCudaGraphBackend.make_graph()
    -> torch.cuda.CUDAGraph()
       (was the FULL branch of CudaGraphRunner._create_device_graph)

  FullCudaGraphBackend.capture_into(graph, pool, stream,
                                    device_module, memory_saver_adapter,
                                    run_once_fn)
    -> opens the appropriate graph capture context (memory-saver-aware)
       and runs run_once_fn under it, returns its output
       (was the FULL branch of CudaGraphRunner._capture_graph)

``legacy.CudaGraphRunner._create_device_graph`` and ``_capture_graph``
keep their env-var-driven Breakable branch inline (Phase 2c lifts that)
and delegate to the new primitives on the FULL branch. Net runtime
behavior: identical bytecode path, plus one extra Python call frame at
*startup* (per-shape capture); zero per-request cost.

The ABC methods (prepare/capture_one/replay) stay NotImplementedError —
the runner still owns the dict-based dispatch and the buffer setup;
Phase 3 wires the backend to be driven through the abstract interface
when runners get unified.

Validated against the cg-refactor baseline (commit 8e79e6e,
mgsm_en N=200 on Qwen3-8B = 0.865, latency 63.18s, throughput 3439.7
tok/s):

  Phase 2b: score 0.840, latency 64.22s, throughput 3418.8 tok/s

Score delta -0.025 = 1σ noise at p=0.85, N=200 (~5 samples flipped
between greedy-decoding runs, plausible from kernel-level
non-determinism). Latency/throughput within ±2% noise.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lifts the breakable-CUDA-graph capture primitives out of the env-var
branch of ``legacy.CudaGraphRunner._capture_graph`` into
``cuda_graph_backend/breakable.py`` as runner-coupling-free static
methods:

  BreakableCudaGraphBackend.make_graph()
    -> BreakableCUDAGraph()
       (HIP guard moves into the backend)

  BreakableCudaGraphBackend.capture_into(graph, pool, stream,
                                         run_once_fn,
                                         *, debug_eager,
                                         memory_saver_adapter)
    -> opens BreakableCUDAGraphCapture, optionally wraps with
       eager_on_graph(True) for --debug-cuda-graph mode, raises on
       memory-saver incompatibility.

``legacy.CudaGraphRunner._create_device_graph`` and ``_capture_graph``
both branches now delegate (Full path → FullCudaGraphBackend, Breakable
env-var path → BreakableCudaGraphBackend). The runner is now a thin
dispatcher over the two backends; only the dict-based per-shape
storage and the prefill BCG class remain non-extracted, both of which
land in Phase 3.

ABC methods (prepare/capture_one/replay) stay NotImplementedError —
runner unification in Phase 3 wires them.

Validation (cg-refactor baseline = 0.865, score floor 0.80 per user):
  Phase 2c default Full path: score 0.845, latency 64.79s,
  throughput 3397.0 tok/s — well within floor.

The breakable env-var decode path is not directly exercised by
mgsm_en (default config uses Full); the migration is byte-equivalent
and shares a path that was working pre-refactor.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lifts the torch.compile setup logic out of
``PiecewiseCudaGraphRunner`` and into ``cuda_graph_backend/tcpcg.py``
as runner-coupling-free static methods:

  TCPiecewiseCudaGraphBackend.build_compilation_config(server_args)
    -> CompilationConfig
       Validates --piecewise-cuda-graph-compiler choice, builds the
       config, registers the MoE A2A split-op when DeepEP/Mooncake is
       in use. Mirrors PiecewiseCudaGraphRunner.__init__ lines that
       previously did this inline.

  TCPiecewiseCudaGraphBackend.install_compile(language_model,
                                              compile_config, graph_pool,
                                              fullgraph=True,
                                              dynamic_arg_dims=None)
    -> wraps language_model with install_torch_compiled. Mirrors the
       call site in PiecewiseCudaGraphRunner.capture().

PCG runner now imports from the backend (deferred to call site to avoid
import-cycle risk) instead of holding the construction logic inline.
The unused top-of-file imports (CompilationConfig, install_torch_compiled,
get_moe_a2a_backend) are removed; they're now reached only through the
backend.

ABC methods (prepare/capture_one/replay) stay NotImplementedError —
runner unification in Phase 3 wires them.

Validation (cg-refactor baseline = 0.865, score floor 0.80 per user):
  Phase 2d: score 0.850, latency 64.09s, throughput 3406.3 tok/s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Establishes the canonical phase-named runner classes; today they are
thin wrappers/factories over the legacy classes so behavior is
unchanged. Phase 3b/c will migrate bodies into them.

cuda_graph_runner/decode_runner.py:
  class DecodeCudaGraphRunner(CudaGraphRunner): pass

  Subclasses the legacy decode runner. New code should refer to this
  name; the legacy class continues to host the implementation.

cuda_graph_runner/prefill_runner.py:
  class PrefillCudaGraphRunner:
      def __new__(cls, model_runner):
          if model_runner.server_args.enable_breakable_cuda_graph:
              return BreakableCudaGraphRunner(model_runner)
          return PiecewiseCudaGraphRunner(model_runner)

  Factory that selects the prefill backend (breakable vs tcpcg) by
  today's server-arg flag. Phase 4 will drive the selection from the
  canonical ``cuda_graph_mode`` config.

model_runner.py:
  - decode path uses DecodeCudaGraphRunner instead of CudaGraphRunner
    (defaultdict default; CPU/NPU paths unchanged).
  - prefill path uses PrefillCudaGraphRunner factory instead of an
    inline if/else.

Late imports avoid circular-dependency risk: model_executor modules
import from cuda_graph_runner; cuda_graph_runner imports the legacy
runner module which in turn imports model_executor primitives.
Localized imports inside the factory and inside model_runner sidestep
the cycle.

External readers of ``model_runner.graph_runner`` (eagle workers,
hardware stubs) continue to work since DecodeCudaGraphRunner is-a
CudaGraphRunner via inheritance.

Validation (cg-refactor baseline = 0.865, score floor 0.80 per user):
  Phase 3a: score 0.835, latency 65.17s, throughput 3371.2 tok/s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Implements stages 1 (parse) and 4 (validate) of the resolver pipeline
in ``cuda_graph_runner.config_resolution`` and switches
``PrefillCudaGraphRunner`` to drive backend selection from the
canonical ``cuda_graph_mode`` field.

Stage 1 (``_parse_canonical``):
  - Translates today's legacy flags to a canonical
    ``Dict[str, str]`` covering both phases:
      ``--disable-cuda-graph``         -> decode = "disabled"
      ``--disable-piecewise-cuda-graph`` -> prefill = "disabled"
      ``--enable-breakable-cuda-graph`` -> prefill = "breakable"
      otherwise                        -> defaults
        {decode: "full", prefill: "tcpcg"}.
  - Explicit ``--cuda-graph-mode`` JSON wins per-phase (Q8); when the
    JSON conflicts with a legacy convenience flag, emits a warning
    naming both, per plan §6 Q8.
  - Re-runs after compatibility checks so any auto-disable flips
    (e.g. ``disable_piecewise_cuda_graph = True`` from the 18-rule
    table) propagate into ``cuda_graph_mode``.

Stage 4 (``_validate_canonical``):
  - Rejects unknown phases (only ``decode``/``prefill`` allowed).
  - Rejects unknown backends per phase.
  - Raises ``NotImplementedError`` for the (prefill, full) cell with a
    pointer to use breakable/tcpcg instead — plan §6 Q1.

PrefillCudaGraphRunner factory now reads
``cuda_graph_mode["prefill"]`` instead of
``enable_breakable_cuda_graph`` directly. Decode side is unchanged
since the only available decode backend in v1 is ``full``.

Validation:
  - Unit tests confirmed default + breakable + decode-disabled mappings,
    JSON-vs-flag override warning, and validator rejection of
    (prefill, full), unknown phase, unknown backend.
  - mgsm_en N=200 on Qwen3-8B: score 0.835, latency 64.19s,
    throughput 3420.8 tok/s — above 0.80 floor.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…da_graph_mode

Turns ``DecodeCudaGraphRunner`` into a factory (matching
``PrefillCudaGraphRunner``) that consults
``cuda_graph_mode["decode"]``. Three branches:

  - "full" (default): returns a ``CudaGraphRunner`` instance. No
    behavior change vs Phase 3a.
  - "breakable" (experimental, plan §2.4): bridges to today's
    ``SGLANG_USE_BREAKABLE_CUDA_GRAPH`` env-var path inside
    ``CudaGraphRunner._capture_graph`` / ``_create_device_graph``.
    Sets the env var if not already set. Phase 3b/c will replace the
    env-var read with a constructor parameter.
  - "tcpcg": not implemented for the decode phase in v1; logs a
    one-shot warning and falls back to "full" so the server still
    boots. Tracked as a Phase-3+ follow-up in refactor/progress.md.

The (prefill, full) cell continues to raise NotImplementedError from
the validator (Phase 4a, plan §6 Q1). The matrix is now:

  (decode, full)        — implemented (default)
  (decode, breakable)   — experimental, env-var bridge
  (decode, tcpcg)       — falls back to full + warning (TODO)
  (prefill, breakable)  — implemented
  (prefill, tcpcg)      — implemented (default)
  (prefill, full)       — NotImplementedError stub

Validation:
  - mgsm_en N=200 on Qwen3-8B (default mode = decode:full +
    prefill:tcpcg): score 0.835, latency 64.14s, throughput 3423.4
    tok/s. Above 0.80 floor.
  - The breakable/tcpcg decode branches are exercised at construction
    time but not driven by the default eval; explicit tests for them
    are deferred to a follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…i-model PCG/BCG test pass

The transition shim at
``model_executor/breakable_cuda_graph/breakable_cuda_graph.py`` had
its underscore-prefixed re-exports listed explicitly (``_copy_output``,
etc.) but ``_copy_output`` was missed. The PCG/BCG test suite at
``test/registered/breakable_cuda_graph/test_breakable_cuda_graph.py``
imports it directly from the legacy path, so ``TestCopyOutput.setUpClass``
errored with ImportError under the cg-refactor branch.

Adds ``_copy_output`` to the explicit-imports list so the shim
remains 1:1 with the legacy module surface. (The * import covers it
for non-underscore symbols; private symbols need explicit listing.)

End-to-end test pass (CI-registered tests + multi-model coverage at
HEAD ``a7ae66efc`` plus this fix):

PCG suite (test/registered/piecewise_cuda_graph/):
  - TestPiecewiseCudaGraphQwen25VL (Qwen2.5-VL-7B-Instruct
    --enforce-piecewise-cuda-graph + --disable-radix-cache, gsm8k):
    score 0.818 ≥ 0.80 ✓
  - TestPiecewiseCudaGraphInternVL25 (InternVL2.5-8B same setup,
    gsm8k): score 0.575 ≥ 0.54 ✓
  - TestPiecewiseCudaGraphQwen25VLEmbedding (Qwen2.5-VL-3B-Instruct
    embedding, enforce vs disable):
    max_abs_diff 0.0078 < 1e-2 ✓

BCG suite (test/registered/breakable_cuda_graph/):
  - TestBreakableCUDAGraphBasic + TestCopyOutput +
    TestBreakGraphHelper (unit): 11 tests pass ✓
  - TestBreakableCudaGraph (Qwen3-8B
    --enable-breakable-cuda-graph, mgsm_en N=1319):
    score 0.856 ≥ 0.80 ✓

Plus the multi-model spot-checks already in progress.md:
  Qwen3-8B PCG default (Phase 5 commit): 0.835
  Nemotron-H Mamba: 0.280 (parity with BCG-notes 0.310 base model)
  Qwen3-30B-A3B MoE: 0.950

Spec-decoding PCG test (Qwen3.5-35B-A3B + NEXTN, 2-GPU, FP8) deferred
— model not cached, suite is "stage-b-test-2-gpu-large", out of scope
for this single-GPU validation pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Adds the four convenience flags from plan §3.2 as sugar over
``--cuda-graph-mode``:

  --prefill-cuda-graph-backend {full,breakable,tcpcg,disabled}
  --decode-cuda-graph-backend  {full,breakable,tcpcg,disabled}
  --prefill-disable-cuda-graph   (== ...-backend disabled)
  --decode-disable-cuda-graph    (== ...-backend disabled)

Each translates to a single-phase entry in the canonical
``cuda_graph_mode`` dict — no decode change when only prefill is set,
and vice versa.

ServerArgs gets four new fields with the same names; CLI registration
sits next to ``--cuda-graph-mode`` in ``add_cli_args``.

Precedence in ``_parse_canonical`` (highest first; warning emitted on
override):
  1. ``--cuda-graph-mode`` JSON.
  2. Per-phase convenience flags above.
  3. Legacy ``--enable-breakable-cuda-graph`` /
     ``--disable-piecewise-cuda-graph`` / ``--disable-cuda-graph``.
  4. Defaults: {decode: full, prefill: tcpcg}.

Validation:
  - Unit tests covering 7 scenarios (default, single-phase
    convenience, JSON-vs-convenience override warning,
    convenience-vs-legacy override warning) all pass.
  - mgsm_en N=200 on Qwen3-8B with ``--prefill-cuda-graph-backend
    breakable``: score 0.825, latency 64.83s, throughput 3416.2 tok/s.
    Above 0.80 floor; the convenience flag drives the same path that
    ``--enable-breakable-cuda-graph`` does (both set
    ``cuda_graph_mode["prefill"] = "breakable"``).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Oasis-Git and others added 2 commits April 28, 2026 05:11
…h constructor param

Removes the hacky env-var bridge that ``DecodeCudaGraphRunner`` factory
used to forward ``cuda_graph_mode["decode"] == "breakable"`` into
``CudaGraphRunner._capture_graph`` / ``_create_device_graph``.

CudaGraphRunner.__init__ now accepts ``use_breakable_capture: Optional[bool]``.
Default ``None`` keeps backwards compatibility with users who set
``SGLANG_USE_BREAKABLE_CUDA_GRAPH=1`` directly — when the kwarg is None,
the env var is consulted as a fallback.

The DecodeCudaGraphRunner factory now passes ``use_breakable_capture=True``
when ``cuda_graph_mode["decode"] == "breakable"``; no os.environ
mutation. The (decode, tcpcg) fallback warning is unchanged.

Fixes one of the cleanup items flagged in
``refactor/progress.md`` "Open issues" section.

Validation: mgsm_en N=200 on Qwen3-8B (default decode=full) = 0.850,
latency 65.10s, throughput 3397.9 tok/s. Above 0.80 floor.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e to dynamo-only

Two model-side gates were tagged ``TODO(cg-refactor)`` because their
explicit comments named torch.compile / dynamo as the constraint, but
they were gated on the broader ``is_in_cuda_graph_capture()`` (which
fires during both capture and replay). Narrowed to
``torch.compiler.is_compiling()`` — fires only during dynamo tracing
(compile time), letting replay take the fast path.

Sites touched:

  models/nemotron_h.py:_forward_core
    Comment: "torch.compile cannot trace CUDA streams". The Mamba
    decoder layer's stream-overlap path was disabled during both
    capture and replay; now it's only disabled during compile.

  models/deepseek_common/.../forward_mla.py
    Comment: "torch dynamo requires out= op was called where output
    tensor was non-contiguous". The non-contiguous-output bmm form
    was used during both capture and replay; now it's only used
    during compile.

Both gates were preserving correctness because the broader gate was a
strict superset of the dynamo-only gate. Narrowing them improves
replay performance without affecting capture-time correctness.

Validation: mgsm_en N=200 on Nemotron-H-8B-Base-8K (which exercises
the nemotron_h.py gate via the hybrid Mamba path): score 0.28,
matching the pre-audit baseline (BCG notes' 0.310 for tp2; tp1 here).
The ``forward_mla.py`` change touches the DeepSeek MLA path; full
validation against DS-Coder-V2-Lite + flashinfer backend is a Phase 6
follow-up.

Removes the imports of ``is_in_cuda_graph_capture`` from both files
since they are no longer used after the narrowing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Oasis-Git Oasis-Git changed the title [Refactor] Cuda Graph Runner/Backend Refactor [WIP][Refactor] Cuda Graph Runner/Backend Refactor Apr 28, 2026
Oasis-Git and others added 7 commits April 28, 2026 05:33
Strips refactor-process commentary out of production code:

* Removes ``Phase N`` / ``cg-refactor`` mentions from docstrings,
  comments, and ``NotImplementedError`` messages across
  cuda_graph_runner, cuda_graph_backend, cuda_graph_backend_utils,
  the transition shims, server_args, model_runner,
  piecewise_cuda_graph_runner, and the two model-side gates in
  nemotron_h / forward_mla.

* Deletes three speculative-scaffolding files that were never wired:
  - ``cuda_graph_runner/base_runner.py`` (empty ``BaseCudaGraphRunner``
    placeholder; only used as a TYPE_CHECKING reference in stub
    methods that are now also gone)
  - ``cuda_graph_runner/buffers.py`` (no contents, no users)
  - ``cuda_graph_backend/base.py`` (``BaseCudaGraphBackend`` ABC with
    abstract ``prepare`` / ``capture_one`` / ``replay`` slots that
    nothing calls)

* Strips each backend down to the static methods that are actually
  used: ``FullCudaGraphBackend.{make_graph, capture_into}``,
  ``BreakableCudaGraphBackend.{make_graph, capture_into}``,
  ``TCPiecewiseCudaGraphBackend.{build_compilation_config, install_compile}``.
  Drops the ABC inheritance and the ``captures_attn_metadata`` flag
  (which nothing read), the ``prepare`` / ``capture_one`` / ``replay``
  ``NotImplementedError`` stubs, and the ``__init__`` re-exports for
  ``BaseCudaGraphBackend``.

* Rewrites the docstrings of ``cuda_graph_runner/__init__.py``,
  ``config_resolution.py``, ``compile_phase.py``, and the
  ``cuda_graph_backend_utils`` package init to describe the current
  architecture without referring to refactor phases.

Validation:
  - 42/42 module imports clean (every audited path).
  - mgsm_en N=200 on Qwen3-8B (default decode=full, prefill=tcpcg) =
    0.840, latency 64.34s, throughput 3410.4 tok/s. Above 0.80 floor.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 1a renamed is_in_piecewise_cuda_graph -> is_in_cuda_graph_capture
with the intent of "umbrella" semantics, but the umbrella was fiction:
only TCPCG ever sets the flag. Full-decode never sets a sglang flag,
and BCG has its own is_in_breakable_cuda_graph(). The generalized
name made callsites lie about what they were checking.

Revert all 35 callsites back to the original explicit names:

  is_in_cuda_graph_capture()           -> is_in_piecewise_cuda_graph()
  enable_cuda_graph_capture(...)       -> enable_piecewise_cuda_graph(...)
  CUDA_GRAPH_CAPTURE_FAILED_MSG        -> PIECEWISE_CUDA_GRAPH_CAPTURE_FAILED_MSG
  _in_cuda_graph_capture (private)     -> _in_piecewise_cuda_graph

is_in_breakable_cuda_graph() and is_in_torch_compile_warmup() unchanged
(both names describe what they actually test).

Callsites that genuinely need umbrella semantics will use explicit
inline `is_in_piecewise_cuda_graph() or is_in_breakable_cuda_graph()`
in subsequent commits — no helper alias.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds three new module-level pieces, no behavior change. The legacy
runners still own the actual capture/replay bodies; this commit lands
the base classes that subsequent phases lift those bodies into.

  cuda_graph_runner/base_runner.py     BaseCudaGraphRunner ABC, freeze_gc,
                                       get_batch_sizes_to_capture
  cuda_graph_runner/buffers.py         DecodeInputBuffers, PrefillInputBuffers
                                       (dataclasses + populate_from_forward_batch
                                       + _grouped_foreach_copy_)
  cuda_graph_backend/base.py           BaseCudaGraphBackend ABC: prepare /
                                       can_run / capture_one / replay /
                                       cleanup

Buffer dataclass copies are duplicates for now — legacy.py and
piecewise_cuda_graph_runner.py still hold their originals; subsequent
phases switch their imports across and delete the duplicates.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…backends

Replaces the static-method backend helpers + legacy CudaGraphRunner with
a real Decode runner backed by stateful backends.

  cuda_graph_backend/{full,breakable}.py  → DELETED (static-method shims)
  cuda_graph_backend/full_cudagraph_backend.py    NEW
  cuda_graph_backend/breakable_cudagraph_backend.py NEW
    - Each implements BaseCudaGraphBackend's 5 methods plus capture_session()
      ctx mgr.  Owns _graphs[shape], _outputs[shape], pool, and (Full only)
      memory_saver_adapter.
    - Replay path is uniform from runner POV: backend.replay(shape, fb, **kw)
      returns the captured output; static_fb is unused for these two.

  cuda_graph_runner/legacy.py  → DELETED
  cuda_graph_runner/decode_runner.py → real DecodeCudaGraphRunner(BaseCudaGraphRunner)
    - Lifts the entire CudaGraphRunner body (init, can_run, capture,
      capture_one_batch_size, recapture_if_needed, replay_prepare, replay,
      get_spec_info).
    - Backend dispatched off cuda_graph_mode["decode"]: full | breakable;
      tcpcg falls back to full with a one-shot warning.

  cuda_graph_runner/capture_mode.py    NEW — model_capture_mode + lora-variant globals
  cuda_graph_runner/pool.py            NEW — get/set_global_graph_memory_pool
                                       (used by speculative-draft runners)
  cuda_graph_runner/deepep_adapter.py  NEW — DeepEPCudaGraphRunnerAdapter
  compilation/torch_compile_decoration.py NEW — patch_model + _to_torch +
                                       set_torch_compile_config

Speculative draft runners no longer reuse `CudaGraphRunner.capture(self)` —
each inlines its own ~25-line capture loop using the relocated freeze_gc /
graph_capture / get_tensor_model_parallel_rank helpers.  Their imports
swap from `cuda_graph_runner` package re-exports to direct module paths.

`CudaGraphRunner` symbol is gone; eagle_worker / adaptive_runtime_state /
NPUGraphRunner all updated to `DecodeCudaGraphRunner`.

NPUGraphRunner override of `_create_device_graph` / `_capture_graph` is now
dead code — the new Decode runner uses backend.capture_one() and never
calls those methods.  NPU support needs a follow-up NPUCudaGraphBackend;
out of scope for this PR (CUDA H100 testing only).

Validation:
  Qwen3-8B mgsm_en N=200 (decode CG only, --disable-piecewise-cuda-graph):
  score 0.835, latency 63.58s, throughput 3,455 tok/s
  Within noise of prior commit's 0.835 baseline.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ckend

Replaces the legacy BCG + PCG runners with a single PrefillCudaGraphRunner
that picks between BreakableCudaGraphBackend and TCPiecewiseCudaGraphBackend
off cuda_graph_mode["prefill"].

  cuda_graph_backend/tcpcg.py  → DELETED (static-method shim)
  cuda_graph_backend/tcpcg_cudagraph_backend.py  NEW
    - Stateful TCPiecewiseCudaGraphBackend(BaseCudaGraphBackend).
    - prepare(runner): builds CompilationConfig, multi-platform-op walks
      the language model into compile mode, runs a dummy warmup, calls
      install_torch_compiled, then runs the warmup_compile loop over
      every shape so torch.compile finishes JIT compilation before the
      cuda-graph capture session opens.
    - capture_session(stream): sets enable_piecewise_cuda_graph +
      set_pcg_capture_stream so the piecewise backend captures on the
      right stream.
    - capture_one(shape, fn, dummies): runs forward_fn twice (jit warm
      + cuda-graph capture) — both inside the capture_session.
    - replay(shape, static_fb, **kw): invokes the wrapped
      model_runner.model.forward (NOT language_model.model.forward —
      the former is what builds LogitsProcessorOutput; storing the
      latter in self._compiled_fn was a bug caught by the smoke run).
    - runtime_session(): enable_piecewise_cuda_graph for replay path.

  breakable_cuda_graph_runner.py  → DELETED
  piecewise_cuda_graph_runner.py  → DELETED
  cuda_graph_runner/prefill_runner.py
    - Replaces the prior factory with a real
      PrefillCudaGraphRunner(BaseCudaGraphRunner).
    - Owns PrefillInputBuffers, capture_num_tokens, attention_layers /
      moe_layers / moe_fusions snapshots, and per-bs static_* buffers
      that BCG segments read at replay (allocated regardless of active
      backend; trivial cost).
    - can_run() enforces the BCG-prefill bs<=1 constraint via
      isinstance check; rejects target_verify (tcpcg-prefill captured
      with EXTEND only); per-token-count cap.
    - replay() opens backend.runtime_session(), runs replay_prepare
      (pad/populate/build static_forward_batch), inits attn metadata,
      opens set_forward_context, calls backend.replay(num_tokens,
      static_fb), slices output to raw_num_tokens.
    - _run_warmup_forward(): the per-shape warmup hook that
      TCPiecewiseCudaGraphBackend.prepare calls during install_compile.

  cuda_graph_backend/{full,breakable,tcpcg}_cudagraph_backend.py
    - All three now do their own jit warmup (2x forward_fn) inside
      capture_one rather than expecting the runner to drive warmup
      separately. Decode runner's explicit pre-warmup loop dropped.
    - Each backend stashes self._tp_group during prepare() so the
      barrier between warmup runs works without going through the
      runner.
    - Added BaseCudaGraphBackend.runtime_session() — default no-op for
      Full; opens enable_breakable_cuda_graph for BCG; opens
      enable_piecewise_cuda_graph for tcpcg. Decode/prefill runners
      wrap their replay paths in backend.runtime_session() so model
      code reads the correct is_in_*_cuda_graph flag.
    - has_shape() added to all three backends; tcpcg's always-True
      since torch.compile dispatches by tensor shape internally.

ModelRunner now imports DecodeCudaGraphRunner + PrefillCudaGraphRunner
from cuda_graph_runner package directly. set_torch_compile_config moved
to compilation/torch_compile_decoration.

Validation:
  Qwen3-8B mgsm_en N=200, default cuda_graph_mode={'decode':'full', 'prefill':'tcpcg'}:
  score 0.850, latency 64.82s, throughput 3,380 tok/s
  Above baseline 0.835.

Bug caught + fixed during smoke run: TCPiecewise.prepare initially
stored self._compiled_fn = language_model.model.forward (the inner
torch-compiled module). replay() invokes through this directly,
skipping the outer Qwen3ForCausalLM.forward wrapper that builds
LogitsProcessorOutput — so output came back as raw hidden states and
the runner's isinstance-dispatch hit the PPProxyTensors-only fallback
(AssertionError). Fix: store runner.model_runner.model.forward (the
outer wrapper). Validated against Qwen3-8B PCG path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…l,full) reject

Field/method renames for symmetry between phases:
  ModelRunner.graph_runner               → decode_cuda_graph_runner
  ModelRunner.piecewise_cuda_graph_runner → prefill_cuda_graph_runner
  ModelRunner.init_device_graphs()       → init_decode_cuda_graph()
  ModelRunner.init_piecewise_cuda_graphs() → init_prefill_cuda_graph()

Speculative-worker callers (eagle_worker / eagle_worker_v2 /
multi_layer_eagle_worker_v2 / eagle_info_v2 / adaptive_runtime_state)
all updated to access ``model_runner.decode_cuda_graph_runner`` instead
of the old ``graph_runner`` field.

(prefill, full) reject:
  - _downgrade_unsupported_combinations renamed to
    _reject_unsupported_combinations and now raises NotImplementedError
    at config-resolution time. Previous behavior silently downgraded
    to (prefill, disabled) with a warning. Per refactor goal: explicit
    over implicit; the user gets a clear error pointing to breakable
    or tcpcg.

Shim deletion:
  - python/sglang/srt/model_executor/breakable_cuda_graph/  → DELETED
    (4-file shim re-exporting from cuda_graph_backend_utils/breakable_cuda_graph/)
  - python/sglang/srt/compilation/piecewise_context_manager.py  → DELETED
    (re-exports superseded by direct imports from new homes)
  - test/registered/breakable_cuda_graph/test_breakable_cuda_graph.py
    repointed to the real location.

Validation:
  Qwen3-8B mgsm_en N=200, default tcpcg+full:
  score 0.835, latency 64.24s, throughput 3,418 tok/s
  Within noise of Phase E+F's 0.850.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
  test/registered/breakable_cuda_graph/  → test/registered/cuda_graph/breakable/
  test/registered/piecewise_cuda_graph/  → test/registered/cuda_graph/piecewise/

No content changes; pure file moves. Test imports were already
repointed to the real (non-shim) source locations in Phase G.

Note: .github/CODEOWNERS line 47 still references the deleted file
``piecewise_cuda_graph_runner.py``; left untouched to avoid touching
governance config without explicit ask.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 1A — BaseCudaGraphRunner now owns shared init (model_runner ref,
device, parallel sizes, attn-tp coords, tbo plugin) and the
_pad_to_bucket helper used by both Decode and Prefill replay_prepare.
Decode/Prefill subclasses call super().__init__(model_runner) and skip
the redundant field assignments. The padding helper has an explicit
assert that documents the can_run/replay_prepare contract.

Phase 1B — capture_session(stream) is now declared on the
BaseCudaGraphBackend ABC. Every backend already implemented it; the
declaration just makes the contract explicit.

Phase 2D — tcpcg phase boundaries clarified. prepare() runs steps 1+2
(JIT activate + install_compile + compile-loop pass inside
enable_torch_compile_warmup); capture_one() runs steps 3+4 (per-shape
warmup forward + capture forward), matching Full/BCG's 2x warmup + 1x
record pattern. _run_warmup_forward → _run_dummy_forward (it serves
both jit-activate and compile-loop callers; "warmup" was misleading).

Phase 2L — BCG static prefill buffers (static_seq_lens, static_extend_*,
static_req_pool_indices, static_orig_seq_lens) move from
PrefillCudaGraphRunner into BreakableCudaGraphBackend. Three new
default-no-op hooks on BaseCudaGraphBackend let the runner stay
uniform: setup_prefill_state, populate_prefill_dummy_inputs,
commit_prefill_serving_inputs. The _is_breakable_backend isinstance
flag and the ad-hoc bs>1 guard both go away — the latter moves into
BCG's can_run.

Phase 2M — cuda_graph_backend.factory.{resolve_decode_backend,
resolve_prefill_backend} replaces the per-runner _resolve_*_backend
functions. Phase / backend constants (PHASE_*, BACKEND_*,
ALLOWED_BACKENDS_PER_PHASE, DEFAULT_CUDA_GRAPH_MODE) move to factory.py
and are exported from cuda_graph_backend.

Phase 3G — PrefillInputBuffers gains create() factory and
populate_from_forward_batch() method, parallel to DecodeInputBuffers.
The 50-line allocation block and 50-line population block in
prefill_runner.py shrink to ~15 + ~15 lines. swa_translator is passed
as a callback so the buffers module stays free of model_runner deps.

Phase 3H — ForwardContext is a real @DataClass — fields declared at
class level, no custom __init__, no 5 set_* setters. set_forward_context
constructs with kwargs.

Phase 4O — resolve_cuda_graph_config moves out of
cuda_graph_runner/config_resolution.py (deleted) into
ServerArgs._resolve_cuda_graph_config. Single-pass parser: compat
rules now write directly to cuda_graph_mode["prefill"] = "disabled"
instead of mutating the legacy disable_piecewise_cuda_graph flag,
which is then derived once from the resolved mode. The double-parse
hack is gone.

Phase 4Q — BACKEND_FULL is no longer in the prefill allowed set.
_validate_canonical raises with the historical NotImplementedError-
style message when (prefill, full) is requested explicitly. The
separate _reject_unsupported_combinations stage is gone (validate
covers it).

Phase 4I — NPUCudaGraphBackend (mirrors FullCudaGraphBackend but uses
torch.npu.NPUGraph + torch.npu.graph + an async NPUGraph.update path
for variable seq_lens at replay) lives in
hardware_backend/npu/graph_runner/. The factory dispatches to it when
device == "npu". NPUGraphRunner trims to a thin subclass that handles
NPU-specific patch_model monkey-patch, the int32 cache_loc dtype, the
disk-backed profile context, and the async-update replay branch — the
dead _create_device_graph / _capture_graph overrides and the
self.graphs[bs].update calls (which referenced a field that moved to
self.backend._graphs in v1) are removed. Smoke import only — no NPU
hardware available on the test box.

Phase 4 P/J — torch_compile_decoration.py docstring clarified (calls
out the duplication-by-design with tcpcg's _toggle_multi_platform_ops).
cuda_graph_runner/__init__.py docstring refreshed (PrefillCudaGraphRunner
is real now, not a "factory wrapping legacy BCG/PCG").

Phase 3R — _pad_to_bucket asserts raw_size <= max(buckets) so the
upstream can_run/replay_prepare contract is local rather than implicit.

Validation (Qwen3-8B tp1, mgsm_en N=200, --num-threads 32):
  default tcpcg+full: 0.855 (v1 baseline 0.835, +0.020 ≈ 0.8σ)
  BCG (run 1):        0.810 (v1 baseline 0.840, -0.030 ≈ 1.2σ)
  BCG (run 2):        0.840 (v1 baseline 0.840, match)
Both within 1.5σ of historical baselines (1σ ≈ 0.026 at p=0.85, N=200).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Oasis-Git and others added 16 commits May 25, 2026 21:03
…chema

Use ``bs`` / ``max_bs`` as the single shape knob per phase. For prefill,
``bs`` carries the captured shape (token count for tc_piecewise; request
count for breakable) — physical interpretation depends on the backend but
the user-facing knob name is unified.

Schema now:
  decode:  {backend, max_bs, bs}
  prefill: {backend, max_bs, bs, tc_compiler}

Dropped four convenience CLI flags (the prefill ones had no other reader;
the decode ones were placeholders): --cuda-graph-max-num-tokens-decode,
--cuda-graph-max-num-tokens-prefill, --cuda-graph-num-tokens-decode,
--cuda-graph-num-tokens-prefill.

Legacy CLI aliases redirected:
  --piecewise-cuda-graph-tokens     -> --cuda-graph-bs-prefill
  --piecewise-cuda-graph-max-tokens -> --cuda-graph-max-bs-prefill

Renamed in _handle_gpu_memory_settings autotune (prefill["max_num_tokens"]
-> prefill["max_bs"], prefill["num_tokens"] -> prefill["bs"]), validator,
tc_piecewise backend, prefill runner, and model_runner.init_prefill_cuda_graph.

Also switched _validate_cuda_graph_config from logger.error+sys.exit(1)
to ``raise ValueError`` (one of three open follow-ups from the prior
commit) since sglang convention is overwhelmingly ``raise`` (~100x).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The module's contents (Phase/Backend types, default config dict,
``check_cuda_graph_backend``, ``parse_cuda_graph_config_arg``) all use
the ``config`` naming after the prior rename; the module filename was
the last hold-out. Updates 33 importers and a handful of stale
docstring references.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… fields

These dataclass fields had no phase suffix — ambiguous about whether
they targeted decode, prefill, or both. Replaced by the explicit
``cuda_graph_max_bs_decode`` / ``cuda_graph_bs_decode`` family.

The legacy CLI flags ``--cuda-graph-max-bs`` and ``--cuda-graph-bs``
keep working via ``DeprecatedAliasStoreAction`` with ``dest=`` pointing
at the renamed fields (so existing launch scripts are unaffected; users
get a one-line deprecation warning telling them which new flag to use).

Touched: ``bench_one_batch.main`` and the ``auto_benchmark_lib``
search-space entry (their ``server_args.cuda_graph_max_bs`` reads
redirected to the new field). Dropped the legacy translation block in
``_parse_cuda_graph_config`` and rephrased the
``_handle_gpu_memory_settings`` docstring to refer to
``cuda_graph_config[decode].max_bs`` instead of the bare name.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ses via dest=

Same pattern as the cuda_graph_max_bs removal. Drop three more
internal-only fields whose sole purpose was receiving a deprecated CLI
flag, then translating into ``cuda_graph_config``:

  - piecewise_cuda_graph_max_tokens
  - piecewise_cuda_graph_tokens
  - tc_piecewise_cuda_graph_compiler

CLI flags ``--piecewise-cuda-graph-max-tokens``, ``--piecewise-cuda-graph-tokens``,
and ``--piecewise-cuda-graph-compiler`` keep working via
``DeprecatedAliasStoreAction`` with ``dest=`` pointing at the
``cuda_graph_max_bs_prefill`` / ``cuda_graph_bs_prefill`` /
``cuda_graph_tc_compiler_prefill`` fields directly. Removes the three
corresponding translation lines from ``_parse_cuda_graph_config``.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…settings

``decode`` / ``prefill`` as bare local names shadow the
domain concepts. Rename to ``decode_config`` / ``prefill_config``
(mirrors the ``cuda_graph_config`` field name) — pure rename, no
behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… {decode,prefill}_cuda_graph_config

Pure rename for clarity: the local config-dict aliases inside
``_handle_gpu_memory_settings`` and friends now explicitly name what
they hold.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
  _generate_cuda_graph_batch_sizes        -> _generate_decode_cuda_graph_batch_sizes
  _generate_piecewise_cuda_graph_tokens   -> _generate_prefill_cuda_graph_batch_sizes

Both helpers now have an explicit phase prefix. The prefill one also
gets its parameter renamed (``max_num_tokens`` -> ``max_bs``) and its
docstring updated to reflect that prefill's ``bs`` carries the captured
token-bucket list under the unified schema.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The flag was a deprecated alias for ``--cuda-graph-config`` left over
from the cuda_graph_mode → cuda_graph_settings → cuda_graph_config
rename chain. No test passes it, and the underlying concept ("mode")
no longer matches the dict-of-settings shape. Dropping it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Self-review followups on the cuda_graph_config block:

1. ``_parse_cuda_graph_config`` local variable ``settings`` -> ``config``
   (and ``phase_settings`` -> ``phase_config``) for naming consistency
   with the renamed field. Same in ``_validate_cuda_graph_config``.
2. ``_handle_pd_disaggregation``'s
   ``prefill_cg_disabled_by_user`` predicate missed the new
   ``cuda_graph_backend_prefill`` convenience flag. A user passing
   ``--cuda-graph-backend-prefill=disabled`` would have bypassed the
   PD-prefill escalation that forces ``disable_cuda_graph=True``. Now
   checks that flag too.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
# Conflicts:
#	python/sglang/srt/model_executor/cuda_graph_runner/decode_runner.py
#	python/sglang/srt/speculative/eagle_info_v2.py
Three flags were ``DeprecatedAction`` (warn + discard, silent behavior
change vs. old semantics). User-friendlier to keep them functional:
auto-translate to the new flag, emit one deprecation warning per launch.

Adds a small ``DeprecatedStoreConstAction`` class (mirrors the existing
``DeprecatedStoreTrueAction`` but writes an arbitrary ``const_value``
into ``dest`` instead of ``True``).

Translation map:
  --disable-piecewise-cuda-graph -> cuda_graph_backend_prefill=disabled
  --enforce-piecewise-cuda-graph -> cuda_graph_backend_prefill=tc_piecewise
                                    (the explicit-set automatically skips
                                    the auto-disable cascade, matching the
                                    old ``--enforce`` contract)
  --enable-breakable-cuda-graph  -> cuda_graph_backend_prefill=breakable

Deleted: ``--enable-piecewise-cuda-graph`` (was an upstream no-op for
years; sole test usage in test_pcg_with_speculative_decoding_extra.py
also dropped — tc_piecewise is the default already).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Move all CG-related CLI flags into a single contiguous block, grouped
by role:

  1. ``--cuda-graph-config`` (canonical JSON entry, listed first)
  2. NEW convenience flags: ``--cuda-graph-backend-{decode,prefill}``,
     ``--cuda-graph-max-bs-{decode,prefill}``,
     ``--cuda-graph-bs-{decode,prefill}``,
     ``--cuda-graph-tc-compiler-prefill``
  3. CG debug / profiling flags: ``--disable-cuda-graph-padding``,
     ``--enable-profile-cuda-graph``, ``--enable-cudagraph-gc``,
     ``--debug-cuda-graph``
  4. ALL deprecated aliases grouped at the bottom of the CG block:
     ``--cuda-graph-max-bs``, ``--cuda-graph-bs``,
     ``--disable-cuda-graph``, ``--enable-breakable-cuda-graph``,
     ``--{prefill,decode}-cuda-graph-backend``,
     ``--disable-{prefill,decode}-cuda-graph``,
     ``--disable-piecewise-cuda-graph``, ``--enforce-piecewise-cuda-graph``,
     ``--piecewise-cuda-graph-tokens``, ``--piecewise-cuda-graph-compiler``,
     ``--piecewise-cuda-graph-max-tokens``

Pulls the orphaned piecewise-deprecated cluster (previously living
between torch-compile flags) into the consolidated CG section so the
five sub-sections of the file (config / convenience / debug /
deprecated) all sit together. ``--torch-compile-max-bs``, which was
sandwiched in the middle of the piecewise orphan, stays where it
belongs (with the other torch-compile flags).

Pure reordering — no behavior change. All 17 CG-related flags still
parse correctly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Update the section header above the 13 deprecated CG CLI aliases from
"kept for backward compat" to "Remove them later", flagging the block
as technical debt to retire.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ons is set

Files that have ``from __future__ import annotations`` evaluate
annotations lazily as strings; the quoted forward-ref form
(``"ForwardBatch"``) is then redundant. Strip the quotes in those
files for the in-annotation patterns (``: "ForwardBatch"``,
``-> "ForwardBatch"``, parametrized generics).

Touches 11 files: cuda_graph_backend/*.py, cuda_graph_runner/*.py,
cuda_graph_backend_utils/.../context_manager.py, model_runner,
mega_moe, trtllm_mla_backend, tokenspeed_mla_backend, and the NPU
backend that mirrors them.

Files that do NOT have ``from __future__ import annotations`` (e.g.
``mem_cache/sparsity/*``) retain the quoted form — quotes are still
required there to avoid runtime NameError at TYPE_CHECKING-only import.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…decode in test/runners

The removed-from-dataclass fields were also passed as Python kwargs to
``Engine(...)`` / ``SRTRunner(...)`` in many test files, which bypass
the CLI alias and go straight to ``ServerArgs.__init__`` — failing
with ``TypeError: ServerArgs.__init__() got an unexpected keyword
argument 'cuda_graph_max_bs'``.

Rename the kwarg uses in:
  - python/sglang/test/runners.py (signature + pass-through)
  - python/sglang/test/doc_patch.py
  - 10 test/registered/... files (model_loading, lora, quant, spec, rl)

Also flip one stale CLI flag literal in test_lora_update.py from
``--cuda-graph-max-bs`` to ``--cuda-graph-max-bs-decode`` (the deprecated
alias still works, but the test was the only place still emitting the
ambiguous flag from generated code).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added the lora label May 26, 2026
@ch-wan ch-wan self-assigned this May 26, 2026
Oasis-Git and others added 3 commits May 26, 2026 21:36
…ise fields

``test_piecewise_cuda_graph_support_1_gpu.py:test_embedding`` passed
``enforce_piecewise_cuda_graph=True`` and ``disable_piecewise_cuda_graph=True``
as Python kwargs to ``Engine(...)`` — both fields were removed from
``ServerArgs`` when the CG knobs got consolidated. The CLI aliases keep
working, but kwarg-style construction bypasses CLI translation and
hits ``TypeError: ServerArgs.__init__() got an unexpected keyword
argument 'enforce_piecewise_cuda_graph'``.

Translate the two kwargs to the new field:
  enforce_piecewise_cuda_graph=True   -> cuda_graph_backend_prefill="tc_piecewise"
  disable_piecewise_cuda_graph=True   -> cuda_graph_backend_prefill="disabled"

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…evel compat

The CLI deprecation paths
(``DeprecatedStoreConstAction`` / ``DeprecatedAliasStoreAction``) only
intercept ``argparse`` input. Python-API callers that do
``Engine(enforce_piecewise_cuda_graph=True, ...)`` go straight to
``ServerArgs(**kwargs)`` and hit
``TypeError: ServerArgs.__init__() got an unexpected keyword argument``.

Mirror the existing pattern (``disable_cuda_graph`` already lives on the
dataclass + translates inside ``_parse_cuda_graph_config``) for the
eight other deprecated names:

  enforce_piecewise_cuda_graph     -> cuda_graph_config[prefill].backend=tc_piecewise
  disable_piecewise_cuda_graph     -> cuda_graph_config[prefill].backend=disabled
  enable_breakable_cuda_graph      -> cuda_graph_config[prefill].backend=breakable
  cuda_graph_max_bs                -> cuda_graph_config[decode].max_bs
  cuda_graph_bs                    -> cuda_graph_config[decode].bs
  piecewise_cuda_graph_max_tokens  -> cuda_graph_config[prefill].max_bs
  piecewise_cuda_graph_tokens      -> cuda_graph_config[prefill].bs
  piecewise_cuda_graph_compiler    -> cuda_graph_config[prefill].tc_compiler

Both CLI and Python paths now arrive at the same per-phase config. The
CLI flag declarations keep their existing ``dest=`` redirects (still
write to the new fields directly); the restored legacy fields are the
kwarg-only entry point.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Oasis-Git
Copy link
Copy Markdown
Collaborator Author

/rerun-failed-ci

# Conflicts:
#	python/sglang/srt/server_args.py
@Oasis-Git
Copy link
Copy Markdown
Collaborator Author

/rerun-failed-ci

Oasis-Git and others added 5 commits May 27, 2026 22:08
Same shape as the earlier DeepseekV2/V3 registry-drop bug. ``deepseek_v4.py``
imports ``compile_in_capture_mode`` from ``sglang.srt.model_executor.cuda_graph_runner``,
which existed on main in the monolithic ``cuda_graph_runner.py`` but got
lost when cg-refactor split that module into a package.

Result: ``deepseek_v4.py`` fails at import time, gets silently dropped
from ``ModelRegistry``, loader falls back to the HF transformers path,
which doesn't know ``_DeepseekV4ConfigAlias`` and raises
``ValueError: Unrecognized configuration class``.

Restore the helper in ``cuda_graph_runner_utils/capture_mode.py`` next
to the rest of the capture-mode state, and re-export from
``cuda_graph_runner_utils.__init__`` and ``cuda_graph_runner.__init__``
so existing import sites keep working.

Verified: ``models/deepseek_v4`` imports cleanly,
``DeepseekV4ForCausalLM`` is back in ``ModelRegistry``.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
# Conflicts:
#	test/registered/cp/test_qwen3_30b.py
# Conflicts:
#	python/sglang/srt/model_executor/cuda_graph_runner/decode_runner.py
#	python/sglang/srt/model_executor/piecewise_cuda_graph_runner.py
# Conflicts:
#	python/sglang/srt/model_executor/cuda_graph_runner/decode_runner.py
…obs in prefill replay

Upstream PR sgl-project#26551 ("Remove dead fields and always-False plumbing
across SB / FB / LogitsMetadata") deleted these two fields from
``ForwardBatch``. ``PrefillCudaGraphRunner.replay_prepare`` was still
passing them when rebuilding the static ForwardBatch, causing
``AttributeError: 'ForwardBatch' object has no attribute
'temp_scaled_logprobs'`` at first replay (which surfaced through the
piecewise-CG error message in the CI log).

Drop both kwargs; keep ``temperature`` and ``top_p`` which remain on
the dataclass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Oasis-Git
Copy link
Copy Markdown
Collaborator Author

/rerun-failed-ci

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants