Update GB200-FP4 1k/8k configs by kyleliang-nv · Pull Request #103 · ishandhanani/srt-slurm

kyleliang-nv · 2026-01-27T02:18:38Z

Summary by CodeRabbit

New Features
- Added support for multiple frontend instances to enhance system scalability.
Improvements
- Increased context window capacity from 9,200 to 10,000 tokens.
- Updated inference engine to a newer stable release for improved performance and reliability.
- Optimized backend processing configuration.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2026-01-27T02:18:53Z

📝 Walkthrough

Walkthrough

Three GB200 FP4 recipe configuration files are updated with dynamo frontend settings, upgraded container image to v0.5.5.post2, increased context-length to 10000, and modified backend configurations across prefill and decode sections.

Changes

Cohort / File(s)	Summary
GB200 FP4 1k8k Recipe Configuration Updates `recipies/gb200-fp4/1k8k/low-latency.yaml`, `recipies/gb200-fp4/1k8k/max-tpt.yaml`, `recipies/gb200-fp4/1k8k/mid-curve.yaml`	Added dynamo frontend block (v0.7.0) with multiple frontend support; upgraded model container from nightly-dev variant to lmsysorg/sglang:v0.5.5.post2; increased context-length from 9200 to 10000 in prefill and decode sections; removed disaggregation-transfer-backend: nixl settings; removed or modified fp4-gemm-backend configurations; added SGLANG_FLASHINFER_FP4_GEMM_BACKEND environment variable in applicable sections.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

Add GB200 DSR1-FP4 1k/8k recipies #85: Modifies the same GB200 FP4 1k8k recipe files with overlapping changes to container image, prefill/decode configuration blocks, and backend settings.
Fix config for 1k/8k #94: Updates the same gb200-fp4/1k8k YAML files with consistent modifications to model container image and FP4 GEMM backend environment settings.
files for 8k/1k-fp4 #54: Applies similar config-level changes across gb200-fp4 recipes including container upgrade, dynamo/frontend block additions, context-length increases, and disaggregation backend flag removals.

Suggested reviewers

ishandhanani

Poem

🐰 Hops along with configs bright,
Dynamo frontends now in sight,
Containers upgraded, contexts grow,
Backend tweaks make pipelines flow! ✨

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main change - updating GB200-FP4 configuration files for 1k/8k token contexts with specific version upgrades and parameter adjustments.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🤖 Fix all issues with AI agents

In `@recipies/gb200-fp4/1k8k/low-latency.yaml`:
- Around line 71-74: The config sets context-length: 10000 but max-total-tokens:
8192, which can truncate or reject prefill; update the max-total-tokens value to
be at least 10000 (or lower context-length) so they align—modify the
max-total-tokens entry in this recipe to 10000 (or a larger value if desired) to
avoid mismatched limits.

In `@recipies/gb200-fp4/1k8k/max-tpt.yaml`:
- Around line 3-13: Remove the invalid SGLang server flag
enable_multiple_frontends (and related num_additional_frontends usage) from the
frontend block so the dynamo frontend invocation doesn't pass unrecognized
arguments, and change the model.container value from
"lmsysorg/sglang:v0.5.5.post2" to the Dynamo 0.7.0–pinned version
"lmsysorg/sglang:v0.5.3.post4" (or another version you have verified compatible)
so the model container matches dynamo.version 0.7.0.

In `@recipies/gb200-fp4/1k8k/mid-curve.yaml`:
- Around line 3-13: Update the model container and remove unsupported
multi-frontend flags: change the model.container value from
"lmsysorg/sglang:v0.5.5.post2" to "lmsysorg/sglang:0.5.3.post4", and remove the
unsupported frontend keys enable_multiple_frontends and
num_additional_frontends; for multi-tenant/multi-frontend serving, configure and
route traffic through the SGLang Model Gateway (Router) instead of using those
flags.

🧹 Nitpick comments (1)

recipies/gb200-fp4/1k8k/mid-curve.yaml (1)

65-65: Use the --fp4-gemm-backend CLI flag instead of the deprecated environment variable.

The SGLANG_FLASHINFER_FP4_GEMM_BACKEND environment variable exists in v0.5.5.post2 but is deprecated and replaced by the --fp4-gemm-backend server CLI flag. Note that mm_fp4 is a GEMM backend for quantized linear operations and applies to both prefill and decode phases (not decode-only).

coderabbitai · 2026-01-27T02:27:06Z

+      context-length: 10000
      mem-fraction-static: 0.95
      max-total-tokens: 8192
      chunked-prefill-size: 8192


⚠️ Potential issue | 🟠 Major

Align max-total-tokens with the new 10k context length.

context-length: 10000 exceeds max-total-tokens: 8192, which can truncate or reject longer contexts in prefill. Please set max-total-tokens ≥ 10000 (or reduce context-length).

🛠️ Proposed fix

- max-total-tokens: 8192 + max-total-tokens: 10000

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

context-length: 10000

mem-fraction-static: 0.95

max-total-tokens: 8192

chunked-prefill-size: 8192

context-length: 10000

mem-fraction-static: 0.95

max-total-tokens: 10000

chunked-prefill-size: 8192

🤖 Prompt for AI Agents

In `@recipies/gb200-fp4/1k8k/low-latency.yaml` around lines 71 - 74, The config sets context-length: 10000 but max-total-tokens: 8192, which can truncate or reject prefill; update the max-total-tokens value to be at least 10000 (or lower context-length) so they align—modify the max-total-tokens entry in this recipe to 10000 (or a larger value if desired) to avoid mismatched limits.

coderabbitai · 2026-01-27T02:27:06Z

+dynamo:
+  version: 0.7.0
+
+frontend:
+  type: dynamo
+  enable_multiple_frontends: true
+  num_additional_frontends: 9
+
 model:
  path: "dsr1"
-  container: "lmsysorg/sglang:nightly-dev-cu13-20260121-1e309030"
+  container: "lmsysorg/sglang:v0.5.5.post2"


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

sglang v0.5.5 release notes features documentation

💡 Result:

SGLang v0.5.5 — release notes (highlights / features)

From the official GitHub release “Release v0.5.5” (published Nov 6, 2025), the listed highlights are:

Day-0 support for Kimi-K2-Thinking and MiniMax-M2 models. [1]

Video + image generation (diffusion) support, linked to the LMSYS post announcing SGLang Diffusion. [1][2]

Blackwell kernel optimizations and MoE runner backend refactor. [1]

Overlap spec and prefill CUDA graph support expanded to more models. [1]

Gateway section: “Gateway Changes (48 commits)” and changelog range gateway-v0.2.1...gateway-v0.2.2. [1]

Documentation (official)

Main docs home: https://docs.sglang.io/ (also mirrored at https://docs.sglang.ai/). [3]

Install guide: https://docs.sglang.io/get_started/install.html. [4]

Diffusion models (SGLang Diffusion) docs: https://docs.sglang.io/supported_models/diffusion_models.html. [5]

Model Gateway / Router docs: https://docs.sglang.io/advanced_features/router.html and https://docs.sglang.io/advanced_features/sgl_model_gateway.html. [6][7]

Primary release-notes link

GitHub releases page (contains Release v0.5.5): https://github.com/sgl-project/sglang/releases. [1]

Sources: [1] GitHub Releases (v0.5.5) (github.com), [2] LMSYS diffusion announcement (lmsys.org), [3] Docs home (docs.sglang.io), [4] Install guide (docs.sglang.io), [5] Diffusion models docs (docs.sglang.io), [6] Model Gateway/Router docs (docs.sglang.io), [7] Model Gateway page (docs.sglang.io).

Citations:

1: https://github.com/sgl-project/sglang/releases?utm_source=openai

2: https://lmsys.org/blog/2025-11-07-sglang-diffusion/

3: https://docs.sglang.io/?utm_source=openai

4: https://docs.sglang.io/get_started/install.html?utm_source=openai

5: https://docs.sglang.io/supported_models/diffusion_models.html?utm_source=openai

6: https://docs.sglang.io/advanced_features/router.html?utm_source=openai

7: https://docs.sglang.io/advanced_features/sgl_model_gateway.html?utm_source=openai

🌐 Web query:

Dynamo 0.7.0 sglang compatibility enable_multiple_frontends

💡 Result:

For NVIDIA Dynamo v0.7.0, the supported/pinned SGLang version is 0.5.3.post4 (per the Dynamo 0.7.0 support matrix). If you’re hitting an error around enable_multiple_frontends, it’s almost always a version/flag mismatch (i.e., a flag expected by one version isn’t recognized by the other). [1]

Key points:

Dynamo’s SGLang worker (python -m dynamo.sglang) uses SGLang’s native argument parser, so you should only pass valid SGLang server args for the SGLang version Dynamo is using. [2]

In the published SGLang server arguments documentation, there is no documented enable_multiple_frontends / --enable-multiple-frontends server flag, so if something is trying to pass it, it’s likely coming from a mismatched wrapper/script or a different SGLang fork/version. [3]

Fix: ensure you install and run matching Dynamo/SGLang versions (e.g., install ai-dynamo==0.7.0 with the sglang extra, which is intended to align dependencies), and remove/stop passing enable_multiple_frontends. [1][2]

Sources
[1] Dynamo 0.7.0 Support Matrix (SGLang build dependency version) https://docs.nvidia.com/dynamo/archive/0.7.0/reference/support-matrix.html
[2] “Running SGLang with Dynamo” (native SGLang args passthrough) https://docs.nvidia.com/dynamo/dev/backends/sglang/README.html
[3] SGLang “Server Arguments” reference https://docs.sglang.io/advanced_features/server_arguments.html

🌐 Web query:

sglang lmsysorg container v0.5.5 supported features

💡 Result:

lmsysorg/sglang:v0.5.5 (SGLang v0.5.5) supports, per the v0.5.5 release highlights/changes:

Day-0 model support: Kimi-K2-Thinking and MiniMax-M2. [1]

Diffusion (image + video generation): “SGLang Diffusion” acceleration, with OpenAI-compatible API/CLI/Python entry points and support for major open-source video/image diffusion models (e.g., Wan, Hunyuan, Qwen-Image/Qwen-Image-Edit, Flux). [1][2]

NVIDIA Blackwell-focused improvements: kernel optimizations plus MoE runner backend refactor. [1]

Broader overlap/prefill improvements: “overlap spec” and prefill CUDA graph support for more models. [1]

Notable items in “What’s Changed”: decoupled quantization from vLLM dependency (GGUF SRT), direct video inference, Llama 4 + TensorRT-LLM MHA enablement, and multiple MoE runner/perf updates (incl. flashinfer_trtllm backend). [1]

Sources: [1] SGLang GitHub release v0.5.5, [2] LMSYS blog “SGLang Diffusion: Accelerating Video and Image Generation”.

Remove enable_multiple_frontends flag and pin sglang container to compatible version.

The configuration has two critical issues:

enable_multiple_frontends: true is not a valid SGLang server argument—it will cause the server to fail with an unrecognized flag error. This flag does not exist in SGLang's documented server arguments.

Container version mismatch: Dynamo 0.7.0 officially pins SGLang to v0.5.3.post4, but the configuration specifies lmsysorg/sglang:v0.5.5.post2. This version mismatch can cause compatibility issues. Use the pinned version or verify that v0.5.5 is compatible with your Dynamo 0.7.0 deployment.

🤖 Prompt for AI Agents

In `@recipies/gb200-fp4/1k8k/max-tpt.yaml` around lines 3 - 13, Remove the invalid SGLang server flag enable_multiple_frontends (and related num_additional_frontends usage) from the frontend block so the dynamo frontend invocation doesn't pass unrecognized arguments, and change the model.container value from "lmsysorg/sglang:v0.5.5.post2" to the Dynamo 0.7.0–pinned version "lmsysorg/sglang:v0.5.3.post4" (or another version you have verified compatible) so the model container matches dynamo.version 0.7.0.

coderabbitai · 2026-01-27T02:27:06Z

+dynamo:
+  version: 0.7.0
+
+frontend:
+  type: dynamo
+  enable_multiple_frontends: true
+  num_additional_frontends: 9

 model:
  path: "dsr1"
-  container: "lmsysorg/sglang:nightly-dev-cu13-20260121-1e309030"
+  container: "lmsysorg/sglang:v0.5.5.post2"


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

sglang v0.5.5.post2 release notes

💡 Result:

SGLang v0.5.5.post2 was tagged/released on Nov 12, 2025 and is primarily a patch release to “include the critical fix” from PR #12915. [1]

What changed (the “critical fix”)

PR #12915 modifies FlashInfer communication fusion defaults in python/sglang/srt/layers/flashinfer_comm_fusion.py: [2]

Lowered max_token_num default from 16384 → 2048 in:

ensure_workspace_initialized(...)

flashinfer_allreduce_residual_rmsnorm(...) [2]

Removed the fallback that disabled the fused path when input_tensor.shape[0] > max_token_num (it no longer returns None, None in that case). [2]

This PR is titled “Fix spec decoding acc length for dpsk-r1-fp4 tp8”, indicating the patch targets speculative decoding accuracy/length behavior for that configuration. [2]

Sources: GitHub release tag page for v0.5.5.post2 [1] and the referenced fix PR diff [2].

🌐 Web query:

Dynamo 0.7.0 sglang compatibility

💡 Result:

For Dynamo v0.7.0, the officially tested / build dependency SGLang version is 0.5.3.post4. [1]

Practical guidance from Dynamo’s SGLang backend docs:

Install via ai-dynamo[sglang] (recommended; Dynamo notes it “comes bundled with a stable sglang version”). [2]

If you install SGLang separately for development, Dynamo’s SGLang backend docs indicate you should use SGLang >= 0.5.3.post2 (but Dynamo’s 0.7.0 support matrix specifies 0.5.3.post4 as the known-compatible build version). [2] [1]

Feature notes for the Dynamo↔SGLang integration:

Core features like disaggregated serving and KV-aware routing are supported. [2]

KV Block Manager (KVBM) is not supported for SGLang in this release (planned). [3]

Sources: Dynamo v0.7.0 support matrix; Dynamo SGLang backend docs; Dynamo v0.7.0 release notes. [1] [2] [3]

🌐 Web query:

sglang enable_multiple_frontends support version

💡 Result:

enable_multiple_frontends is not a supported SGLang server argument in any currently documented SGLang release (it does not appear in the official “Server Arguments” reference / sglang.launch_server --help surface). [1]

If what you want is “multiple frontends / multiple entrypoints / multi-tenant traffic”, SGLang’s documented solution is to run workers behind the SGLang Model Gateway (Router) rather than a single launch_server flag. [2]

Sources: [1] [2]

Update container to use sglang 0.5.3.post4 and configure multi-frontend serving correctly.

Dynamo 0.7.0 officially supports sglang 0.5.3.post4, not v0.5.5.post2. Additionally, enable_multiple_frontends and num_additional_frontends are not supported SGLang server arguments. For multi-tenant serving, use the SGLang Model Gateway (Router) instead of these flags.

🤖 Prompt for AI Agents

In `@recipies/gb200-fp4/1k8k/mid-curve.yaml` around lines 3 - 13, Update the model container and remove unsupported multi-frontend flags: change the model.container value from "lmsysorg/sglang:v0.5.5.post2" to "lmsysorg/sglang:0.5.3.post4", and remove the unsupported frontend keys enable_multiple_frontends and num_additional_frontends; for multi-tenant/multi-frontend serving, configure and route traffic through the SGLang Model Gateway (Router) instead of using those flags.

Update GB200-FP4 1k/8k configs

53519da

kyleliang-nv requested a review from ishandhanani January 27, 2026 02:19

coderabbitai bot reviewed Jan 27, 2026

View reviewed changes

ishandhanani merged commit 8e347cf into main Jan 27, 2026
4 of 5 checks passed

coderabbitai bot mentioned this pull request Jan 28, 2026

Add GB200 FP8 1k/8k configs #115

Open

karen-sy pushed a commit that referenced this pull request Jan 28, 2026

Update GB200-FP4 1k/8k configs (#103)

f143d91

This was referenced Feb 5, 2026

Add new LHS datapoint for SGL-GB200-FP8-1k1k #148

Merged

Add B200 FP8/FP4 STP configs #162

Closed

coderabbitai bot mentioned this pull request Mar 18, 2026

qwen3.5 nixl configs #224

Merged

coderabbitai bot mentioned this pull request Apr 1, 2026

Add kimi-k2.5 nvfp4 GB200 vllm-disagg configs for 8k1k #234

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update GB200-FP4 1k/8k configs#103

Update GB200-FP4 1k/8k configs#103
ishandhanani merged 1 commit intomainfrom
kylliang/update_gb200_fp4_1k8k_config

kyleliang-nv commented Jan 27, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Jan 27, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Jan 27, 2026

Uh oh!

coderabbitai bot Jan 27, 2026

Uh oh!

coderabbitai bot Jan 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kyleliang-nv commented Jan 27, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 27, 2026

Choose a reason for hiding this comment

SGLang v0.5.5 — release notes (highlights / features)

Documentation (official)

Primary release-notes link

Uh oh!

coderabbitai bot Jan 27, 2026

Choose a reason for hiding this comment

What changed (the “critical fix”)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kyleliang-nv commented Jan 27, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 27, 2026 •

edited

Loading