Skip to content

Add new LHS datapoint for SGL-GB200-FP8-1k1k#148

Merged
kyleliang-nv merged 2 commits intomainfrom
kylliang/update-sgl-gb200-fp8-1k1k
Feb 5, 2026
Merged

Add new LHS datapoint for SGL-GB200-FP8-1k1k#148
kyleliang-nv merged 2 commits intomainfrom
kylliang/update-sgl-gb200-fp8-1k1k

Conversation

@kyleliang-nv
Copy link
Copy Markdown
Collaborator

@kyleliang-nv kyleliang-nv commented Feb 5, 2026

Summary by CodeRabbit

  • Chores
    • Added a new runtime configuration enabling distributed inference across multiple frontends with GPU/node resource allocations. Introduces separate, tunable prefill and decode profiles with extensive performance knobs for parallelism, caching, memory, and execution modes, disaggregation backend selection, and benchmark settings for load testing, concurrency, and request rate.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Feb 5, 2026

Caution

Review failed

The pull request is closed.

📝 Walkthrough

Walkthrough

Adds a new YAML config recipes/gb200-fp8/1k1k/ultra-tpt.yaml defining a Dynamo multi-frontend GB200 FP8 setup with separate prefill and decode sglang_config blocks, disaggregation (nixl) backend, DeepEP/MoE and CUDA-graph options, resource allocations, and sa-bench benchmarking settings.

Changes

Cohort / File(s) Summary
New Recipe Configuration
recipes/gb200-fp8/1k1k/ultra-tpt.yaml
Adds a comprehensive YAML for gb200-fp8-1k1k-ultra-tpt: multi-frontend Dynamo topology (3 extra frontends), model image/path and GPU/node allocations, separate prefill and decode env and sglang_config subsections, disaggregation-mode fields (prefill/decode) with nixl backend, DeepEP/MoE and CUDA-graph flags, memory-fraction and token sizing parameters, and an sa-bench benchmark block (isl/osl/concurrency/req_rate).

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant DynamoFE as Dynamo Frontends (3)
    participant Prefill as Prefill Worker
    participant Decode as Decode Worker
    participant Nixl as Disaggregation Backend (nixl)
    participant Model as Model Replica
    participant Bench as sa-bench

    Client->>DynamoFE: request (inference)
    DynamoFE->>Prefill: route prefill stage (sglang_config.prefill)
    Prefill->>Nixl: dispatch prefill ops (disaggregation: prefill)
    Nixl->>Model: fetch/compute prefill tensors
    Model-->>Nixl: prefill results (KV cache)
    Nixl-->>Prefill: return populated cache
    Prefill-->>DynamoFE: prefill complete

    DynamoFE->>Decode: start decode stage (sglang_config.decode)
    Decode->>Nixl: dispatch decode ops (disaggregation: decode)
    Nixl->>Model: decode token steps (CUDA-graph / DeepEP / MoE)
    Model-->>Nixl: decode token outputs
    Nixl-->>Decode: token results
    Decode-->>DynamoFE: final response
    Bench->>DynamoFE: benchmark traffic (req_rate / concurrencies)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested reviewers

  • ishandhanani
  • gracehonv

Poem

🐰 In YAML fields I hop and sing,
Prefill, decode — I tune each string.
DeepEP dances, CUDA graphs gleam,
Nixl threads pathways for the team.
Ultra-TPT — a hopping dream!

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: adding a new configuration file (ultra-tpt.yaml) that serves as a datapoint for SGL-GB200-FP8-1k1k benchmarking.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch kylliang/update-sgl-gb200-fp8-1k1k

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@recipes/gb200-fp8/1k1k/ultra-tpt.yaml`:
- Line 163: The YAML uses the deprecated key prefill-round-robin-balance; remove
that key and instead add router-based load‑balancing settings under the Model
Gateway / Router section—for example set prefill_policy: round_robin and specify
decode_policy (or other documented router policies) in the router configuration
for the model gateway/router entry; update any related routing blocks that
reference prefill-round-robin-balance to use the new prefill_policy and
decode_policy keys so the gateway/router enforces round‑robin prefill behavior.
- Around line 6-18: The YAML uses a third-party SGLang image in model.container
which may mismatch the declared dynamo.version (0.8.1); update model.container
to the official NVIDIA Dynamo SGLang runtime that matches Dynamo v0.8.1 (for
example nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.8.1) so model.container aligns
with dynamo.version and the SGLang runtime bundled with that Dynamo release,
keeping model.path and precision unchanged.
🧹 Nitpick comments (1)
recipes/gb200-fp8/1k1k/ultra-tpt.yaml (1)

77-77: Minor: Trailing whitespace.

Line 77 has trailing whitespace after ep-size: 8.

🧹 Proposed fix
-      ep-size: 8 
+      ep-size: 8

Comment on lines +6 to +18
dynamo:
version: 0.8.1

frontend:
type: dynamo
enable_multiple_frontends: true
num_additional_frontends: 3
nginx_container: nginx

model:
path: "dsr1-fp8"
container: "lmsysorg/sglang:v0.5.8-cu130"
precision: "fp8"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, check if the file exists and read the relevant lines
head -20 recipes/gb200-fp8/1k1k/ultra-tpt.yaml

Repository: ishandhanani/srt-slurm

Length of output: 442


🏁 Script executed:

# Check Docker Hub for the sglang container image tag
curl -s "https://hub.docker.com/v2/repositories/lmsysorg/sglang/tags/?page_size=100&name=v0.5.8-cu130" | jq '.results[] | select(.name=="v0.5.8-cu130") | {name, tag_status}'

Repository: ishandhanani/srt-slurm

Length of output: 120


🌐 Web query:

dynamo version 0.8.1 release

💡 Result:

“Dynamo 0.8.1” most commonly refers to Autodesk Dynamo (for Revit):

  • Release date: July 3, 2015. [1]
  • Highlights: new hardware-accelerated graphics pipeline + colored geometry (e.g., Display.ByGeometryColor), Groups for graph organization, more robust Excel read/write, list/function updates (e.g., List.Clean, improved IF lacing), UI search/browse improvements, localization, and Revit node additions (e.g., FamilyInstance.SetRotation). [1][2]

If instead you meant NVIDIA Dynamo (AI inference platform), there is also a v0.8.1 line, but it’s a different product. [3][4]

Sources: [1] [2] [3] [4]


🏁 Script executed:

# Search for dynamo references in the repository
rg "dynamo" --type yaml --type markdown --type txt -A 2 -B 2 | head -100

Repository: ishandhanani/srt-slurm

Length of output: 5985


🌐 Web query:

sglang dynamo 0.8.1 GB200

💡 Result:

NVIDIA Dynamo v0.8.1 includes an SGLang backend/runtime that explicitly lists GB200 support = ✅ (and WideEP full support on H100s/GB200). [1]

For Dynamo 0.8.1 (release date Jan 23, 2026), NVIDIA publishes these relevant artifacts: [2]

  • Container (recommended): nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.8.1 (CUDA 12.9), which bundles SGLang v0.5.6.post2. [2]
  • Experimental CUDA 13 container: nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.8.1-cuda13 (CUDA 13.0), also SGLang v0.5.6.post2. [2]
  • PyPI wheel: ai-dynamo==0.8.1. [2]

If you meant SGLang (pip) directly on B200/GB200 without Dynamo, there have been compatibility issues discussed (e.g., PyTorch CUDA capability support for B200, sm_100). [3]


🏁 Script executed:

# Check if there's documentation or README about dynamo
fd -e md -e rst -e txt | xargs grep -l "dynamo" 2>/dev/null | head -10

Repository: ishandhanani/srt-slurm

Length of output: 210


Use official NVIDIA Dynamo container for version alignment.

The container image lmsysorg/sglang:v0.5.8-cu130 exists and is active on Docker Hub, and Dynamo v0.8.1 (released Jan 23, 2026) includes explicit GB200 support. However, consider using the official NVIDIA container nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.8.1 instead, which aligns with Dynamo v0.8.1 and bundles the tested SGLang v0.5.6.post2. Using a third-party SGLang image may introduce version skew between your Dynamo and SGLang installations.

🤖 Prompt for AI Agents
In `@recipes/gb200-fp8/1k1k/ultra-tpt.yaml` around lines 6 - 18, The YAML uses a
third-party SGLang image in model.container which may mismatch the declared
dynamo.version (0.8.1); update model.container to the official NVIDIA Dynamo
SGLang runtime that matches Dynamo v0.8.1 (for example
nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.8.1) so model.container aligns with
dynamo.version and the SGLang runtime bundled with that Dynamo release, keeping
model.path and precision unchanged.

ep-dispatch-algorithm: "static"
moe-dense-tp-size: 1
enable-dp-lm-head: true
prefill-round-robin-balance: true
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, locate and examine the file to see the context around line 163
find . -name "ultra-tpt.yaml" -type f

Repository: ishandhanani/srt-slurm

Length of output: 105


🌐 Web query:

SGLang prefill-round-robin-balance configuration flag documentation

💡 Result:

--prefill-round-robin-balance is documented in SGLang’s Server Arguments page under Deprecated arguments, with the note that it “is deprecated now” (no further behavior/details are provided in the docs). [1]

If your goal is round-robin balancing for prefill in Prefill/Decode (PD) disaggregation, the current documented way is to set the router’s load-balancing policy (e.g., prefill_policy: round_robin, decode_policy: ...) via the Model Gateway / Router configuration and its round_robin policy. [2]

Sources:
[1] SGLang docs — Server Arguments (Deprecated arguments list) https://docs.sglang.io/advanced_features/server_arguments.html
[2] SGLang docs — Model Gateway / Router (Load Balancing Policies, PD config knobs) https://docs.sglang.io/advanced_features/router.html


🏁 Script executed:

cat -n ./recipes/gb200-fp8/1k1k/ultra-tpt.yaml | sed -n '150,175p'

Repository: ishandhanani/srt-slurm

Length of output: 1163


Replace deprecated prefill-round-robin-balance with router-based load-balancing configuration.

The flag prefill-round-robin-balance is deprecated in SGLang. The modern approach is to configure load-balancing policies via the router configuration (e.g., prefill_policy: round_robin, decode_policy: ...) in the Model Gateway / Router section instead. Remove this flag and migrate to the documented router-based policy configuration.

🤖 Prompt for AI Agents
In `@recipes/gb200-fp8/1k1k/ultra-tpt.yaml` at line 163, The YAML uses the
deprecated key prefill-round-robin-balance; remove that key and instead add
router-based load‑balancing settings under the Model Gateway / Router
section—for example set prefill_policy: round_robin and specify decode_policy
(or other documented router policies) in the router configuration for the model
gateway/router entry; update any related routing blocks that reference
prefill-round-robin-balance to use the new prefill_policy and decode_policy keys
so the gateway/router enforces round‑robin prefill behavior.

@kyleliang-nv kyleliang-nv merged commit d228792 into main Feb 5, 2026
3 of 5 checks passed
kyleliang-nv added a commit that referenced this pull request Feb 6, 2026
…p8-1k1k

Add new LHS datapoint for SGL-GB200-FP8-1k1k
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant