Add new LHS datapoint for SGL-GB200-FP8-1k1k by kyleliang-nv · Pull Request #148 · ishandhanani/srt-slurm

kyleliang-nv · 2026-02-05T17:11:53Z

Summary by CodeRabbit

Chores
- Added a new runtime configuration enabling distributed inference across multiple frontends with GPU/node resource allocations. Introduces separate, tunable prefill and decode profiles with extensive performance knobs for parallelism, caching, memory, and execution modes, disaggregation backend selection, and benchmark settings for load testing, concurrency, and request rate.

coderabbitai · 2026-02-05T17:12:14Z

Caution

Review failed

The pull request is closed.

📝 Walkthrough

Walkthrough

Adds a new YAML config recipes/gb200-fp8/1k1k/ultra-tpt.yaml defining a Dynamo multi-frontend GB200 FP8 setup with separate prefill and decode sglang_config blocks, disaggregation (nixl) backend, DeepEP/MoE and CUDA-graph options, resource allocations, and sa-bench benchmarking settings.

Changes

Cohort / File(s)	Summary
New Recipe Configuration `recipes/gb200-fp8/1k1k/ultra-tpt.yaml`	Adds a comprehensive YAML for gb200-fp8-1k1k-ultra-tpt: multi-frontend Dynamo topology (3 extra frontends), model image/path and GPU/node allocations, separate `prefill` and `decode` env and `sglang_config` subsections, disaggregation-mode fields (prefill/decode) with `nixl` backend, DeepEP/MoE and CUDA-graph flags, memory-fraction and token sizing parameters, and an `sa-bench` benchmark block (isl/osl/concurrency/req_rate).

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant DynamoFE as Dynamo Frontends (3)
    participant Prefill as Prefill Worker
    participant Decode as Decode Worker
    participant Nixl as Disaggregation Backend (nixl)
    participant Model as Model Replica
    participant Bench as sa-bench

    Client->>DynamoFE: request (inference)
    DynamoFE->>Prefill: route prefill stage (sglang_config.prefill)
    Prefill->>Nixl: dispatch prefill ops (disaggregation: prefill)
    Nixl->>Model: fetch/compute prefill tensors
    Model-->>Nixl: prefill results (KV cache)
    Nixl-->>Prefill: return populated cache
    Prefill-->>DynamoFE: prefill complete

    DynamoFE->>Decode: start decode stage (sglang_config.decode)
    Decode->>Nixl: dispatch decode ops (disaggregation: decode)
    Nixl->>Model: decode token steps (CUDA-graph / DeepEP / MoE)
    Model-->>Nixl: decode token outputs
    Nixl-->>Decode: token results
    Decode-->>DynamoFE: final response
    Bench->>DynamoFE: benchmark traffic (req_rate / concurrencies)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Update gb200 recipes #130: Adds/expands gb200 FP8 recipe configurations with Dynamo multi-frontend and detailed sglang_config prefill/decode options (disaggregation backend, CUDA-graph, DeepEP/MoE), strongly related.
Update GB200-FP4 1k/8k configs #103: Modifies GB200 recipe configs' Dynamo/multi-frontend and sglang_config prefill/decode sections, overlapping configuration fields.
Add low-latency FP8 disagg with MTP for GB200 #116: Introduces gb200-fp8 YAML recipes with similar prefill/decode and disaggregation/memory tuning fields.

Suggested reviewers

ishandhanani
gracehonv

Poem

🐰 In YAML fields I hop and sing,
Prefill, decode — I tune each string.
DeepEP dances, CUDA graphs gleam,
Nixl threads pathways for the team.
Ultra-TPT — a hopping dream!

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main change: adding a new configuration file (ultra-tpt.yaml) that serves as a datapoint for SGL-GB200-FP8-1k1k benchmarking.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch kylliang/update-sgl-gb200-fp8-1k1k

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In `@recipes/gb200-fp8/1k1k/ultra-tpt.yaml`:
- Line 163: The YAML uses the deprecated key prefill-round-robin-balance; remove
that key and instead add router-based load‑balancing settings under the Model
Gateway / Router section—for example set prefill_policy: round_robin and specify
decode_policy (or other documented router policies) in the router configuration
for the model gateway/router entry; update any related routing blocks that
reference prefill-round-robin-balance to use the new prefill_policy and
decode_policy keys so the gateway/router enforces round‑robin prefill behavior.
- Around line 6-18: The YAML uses a third-party SGLang image in model.container
which may mismatch the declared dynamo.version (0.8.1); update model.container
to the official NVIDIA Dynamo SGLang runtime that matches Dynamo v0.8.1 (for
example nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.8.1) so model.container aligns
with dynamo.version and the SGLang runtime bundled with that Dynamo release,
keeping model.path and precision unchanged.

🧹 Nitpick comments (1)

recipes/gb200-fp8/1k1k/ultra-tpt.yaml (1)
77-77: Minor: Trailing whitespace.

Line 77 has trailing whitespace after ep-size: 8.
🧹 Proposed fix
-      ep-size: 8 
+      ep-size: 8

coderabbitai · 2026-02-05T17:15:59Z

+dynamo:
+  version: 0.8.1
+
+frontend:
+  type: dynamo
+  enable_multiple_frontends: true
+  num_additional_frontends: 3
+  nginx_container: nginx
+
+model:
+  path: "dsr1-fp8"
+  container: "lmsysorg/sglang:v0.5.8-cu130"
+  precision: "fp8"


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, check if the file exists and read the relevant lines head -20 recipes/gb200-fp8/1k1k/ultra-tpt.yaml

Repository: ishandhanani/srt-slurm

Length of output: 442

🏁 Script executed:

# Check Docker Hub for the sglang container image tag curl -s "https://hub.docker.com/v2/repositories/lmsysorg/sglang/tags/?page_size=100&name=v0.5.8-cu130" | jq '.results[] | select(.name=="v0.5.8-cu130") | {name, tag_status}'

Repository: ishandhanani/srt-slurm

Length of output: 120

🌐 Web query:

dynamo version 0.8.1 release

💡 Result:

“Dynamo 0.8.1” most commonly refers to Autodesk Dynamo (for Revit):

Release date: July 3, 2015. [1]

Highlights: new hardware-accelerated graphics pipeline + colored geometry (e.g., Display.ByGeometryColor), Groups for graph organization, more robust Excel read/write, list/function updates (e.g., List.Clean, improved IF lacing), UI search/browse improvements, localization, and Revit node additions (e.g., FamilyInstance.SetRotation). [1][2]

If instead you meant NVIDIA Dynamo (AI inference platform), there is also a v0.8.1 line, but it’s a different product. [3][4]

Sources: [1] [2] [3] [4]

🏁 Script executed:

# Search for dynamo references in the repository rg "dynamo" --type yaml --type markdown --type txt -A 2 -B 2 | head -100

Repository: ishandhanani/srt-slurm

Length of output: 5985

🌐 Web query:

sglang dynamo 0.8.1 GB200

💡 Result:

NVIDIA Dynamo v0.8.1 includes an SGLang backend/runtime that explicitly lists GB200 support = ✅ (and WideEP full support on H100s/GB200). [1]

For Dynamo 0.8.1 (release date Jan 23, 2026), NVIDIA publishes these relevant artifacts: [2]

Container (recommended): nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.8.1 (CUDA 12.9), which bundles SGLang v0.5.6.post2. [2]

Experimental CUDA 13 container: nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.8.1-cuda13 (CUDA 13.0), also SGLang v0.5.6.post2. [2]

PyPI wheel: ai-dynamo==0.8.1. [2]

If you meant SGLang (pip) directly on B200/GB200 without Dynamo, there have been compatibility issues discussed (e.g., PyTorch CUDA capability support for B200, sm_100). [3]

🏁 Script executed:

# Check if there's documentation or README about dynamo fd -e md -e rst -e txt | xargs grep -l "dynamo" 2>/dev/null | head -10

Repository: ishandhanani/srt-slurm

Length of output: 210

Use official NVIDIA Dynamo container for version alignment.

The container image lmsysorg/sglang:v0.5.8-cu130 exists and is active on Docker Hub, and Dynamo v0.8.1 (released Jan 23, 2026) includes explicit GB200 support. However, consider using the official NVIDIA container nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.8.1 instead, which aligns with Dynamo v0.8.1 and bundles the tested SGLang v0.5.6.post2. Using a third-party SGLang image may introduce version skew between your Dynamo and SGLang installations.

🤖 Prompt for AI Agents

In `@recipes/gb200-fp8/1k1k/ultra-tpt.yaml` around lines 6 - 18, The YAML uses a third-party SGLang image in model.container which may mismatch the declared dynamo.version (0.8.1); update model.container to the official NVIDIA Dynamo SGLang runtime that matches Dynamo v0.8.1 (for example nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.8.1) so model.container aligns with dynamo.version and the SGLang runtime bundled with that Dynamo release, keeping model.path and precision unchanged.

coderabbitai · 2026-02-05T17:15:59Z

+      ep-dispatch-algorithm: "static"
+      moe-dense-tp-size: 1
+      enable-dp-lm-head: true
+      prefill-round-robin-balance: true


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, locate and examine the file to see the context around line 163 find . -name "ultra-tpt.yaml" -type f

Repository: ishandhanani/srt-slurm

Length of output: 105

🌐 Web query:

SGLang prefill-round-robin-balance configuration flag documentation

💡 Result:

--prefill-round-robin-balance is documented in SGLang’s Server Arguments page under Deprecated arguments, with the note that it “is deprecated now” (no further behavior/details are provided in the docs). [1]

If your goal is round-robin balancing for prefill in Prefill/Decode (PD) disaggregation, the current documented way is to set the router’s load-balancing policy (e.g., prefill_policy: round_robin, decode_policy: ...) via the Model Gateway / Router configuration and its round_robin policy. [2]

Sources:
[1] SGLang docs — Server Arguments (Deprecated arguments list) https://docs.sglang.io/advanced_features/server_arguments.html
[2] SGLang docs — Model Gateway / Router (Load Balancing Policies, PD config knobs) https://docs.sglang.io/advanced_features/router.html

🏁 Script executed:

cat -n ./recipes/gb200-fp8/1k1k/ultra-tpt.yaml | sed -n '150,175p'

Repository: ishandhanani/srt-slurm

Length of output: 1163

Replace deprecated prefill-round-robin-balance with router-based load-balancing configuration.

The flag prefill-round-robin-balance is deprecated in SGLang. The modern approach is to configure load-balancing policies via the router configuration (e.g., prefill_policy: round_robin, decode_policy: ...) in the Model Gateway / Router section instead. Remove this flag and migrate to the documented router-based policy configuration.

🤖 Prompt for AI Agents

In `@recipes/gb200-fp8/1k1k/ultra-tpt.yaml` at line 163, The YAML uses the deprecated key prefill-round-robin-balance; remove that key and instead add router-based load‑balancing settings under the Model Gateway / Router section—for example set prefill_policy: round_robin and specify decode_policy (or other documented router policies) in the router configuration for the model gateway/router entry; update any related routing blocks that reference prefill-round-robin-balance to use the new prefill_policy and decode_policy keys so the gateway/router enforces round‑robin prefill behavior.

…p8-1k1k Add new LHS datapoint for SGL-GB200-FP8-1k1k

Add new LHS datapoint

1cfb4f6

coderabbitai bot reviewed Feb 5, 2026

View reviewed changes

Cleanup config

6272711

kyleliang-nv merged commit d228792 into main Feb 5, 2026
3 of 5 checks passed

kyleliang-nv added a commit that referenced this pull request Feb 6, 2026

Merge pull request #148 from ishandhanani/kylliang/update-sgl-gb200-f…

1cc936e

…p8-1k1k Add new LHS datapoint for SGL-GB200-FP8-1k1k

This was referenced Mar 12, 2026

Improve prefill and max throughput config perf for B200 #214

Merged

qwen3.5 nixl configs #224

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new LHS datapoint for SGL-GB200-FP8-1k1k#148

Add new LHS datapoint for SGL-GB200-FP8-1k1k#148
kyleliang-nv merged 2 commits intomainfrom
kylliang/update-sgl-gb200-fp8-1k1k

kyleliang-nv commented Feb 5, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Feb 5, 2026 •

edited

Loading

Review failed

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Feb 5, 2026

Uh oh!

coderabbitai bot Feb 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kyleliang-nv commented Feb 5, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kyleliang-nv commented Feb 5, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 5, 2026 •

edited

Loading