Skip to content

Update GB200-FP4 1k/8k configs#103

Merged
ishandhanani merged 1 commit intomainfrom
kylliang/update_gb200_fp4_1k8k_config
Jan 27, 2026
Merged

Update GB200-FP4 1k/8k configs#103
ishandhanani merged 1 commit intomainfrom
kylliang/update_gb200_fp4_1k8k_config

Conversation

@kyleliang-nv
Copy link
Copy Markdown
Collaborator

@kyleliang-nv kyleliang-nv commented Jan 27, 2026

Summary by CodeRabbit

  • New Features

    • Added support for multiple frontend instances to enhance system scalability.
  • Improvements

    • Increased context window capacity from 9,200 to 10,000 tokens.
    • Updated inference engine to a newer stable release for improved performance and reliability.
    • Optimized backend processing configuration.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Jan 27, 2026

📝 Walkthrough

Walkthrough

Three GB200 FP4 recipe configuration files are updated with dynamo frontend settings, upgraded container image to v0.5.5.post2, increased context-length to 10000, and modified backend configurations across prefill and decode sections.

Changes

Cohort / File(s) Summary
GB200 FP4 1k8k Recipe Configuration Updates
recipies/gb200-fp4/1k8k/low-latency.yaml, recipies/gb200-fp4/1k8k/max-tpt.yaml, recipies/gb200-fp4/1k8k/mid-curve.yaml
Added dynamo frontend block (v0.7.0) with multiple frontend support; upgraded model container from nightly-dev variant to lmsysorg/sglang:v0.5.5.post2; increased context-length from 9200 to 10000 in prefill and decode sections; removed disaggregation-transfer-backend: nixl settings; removed or modified fp4-gemm-backend configurations; added SGLANG_FLASHINFER_FP4_GEMM_BACKEND environment variable in applicable sections.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

  • Add GB200 DSR1-FP4 1k/8k recipies #85: Modifies the same GB200 FP4 1k8k recipe files with overlapping changes to container image, prefill/decode configuration blocks, and backend settings.
  • Fix config for 1k/8k #94: Updates the same gb200-fp4/1k8k YAML files with consistent modifications to model container image and FP4 GEMM backend environment settings.
  • files for 8k/1k-fp4 #54: Applies similar config-level changes across gb200-fp4 recipes including container upgrade, dynamo/frontend block additions, context-length increases, and disaggregation backend flag removals.

Suggested reviewers

  • ishandhanani

Poem

🐰 Hops along with configs bright,
Dynamo frontends now in sight,
Containers upgraded, contexts grow,
Backend tweaks make pipelines flow!

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change - updating GB200-FP4 configuration files for 1k/8k token contexts with specific version upgrades and parameter adjustments.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In `@recipies/gb200-fp4/1k8k/low-latency.yaml`:
- Around line 71-74: The config sets context-length: 10000 but max-total-tokens:
8192, which can truncate or reject prefill; update the max-total-tokens value to
be at least 10000 (or lower context-length) so they align—modify the
max-total-tokens entry in this recipe to 10000 (or a larger value if desired) to
avoid mismatched limits.

In `@recipies/gb200-fp4/1k8k/max-tpt.yaml`:
- Around line 3-13: Remove the invalid SGLang server flag
enable_multiple_frontends (and related num_additional_frontends usage) from the
frontend block so the dynamo frontend invocation doesn't pass unrecognized
arguments, and change the model.container value from
"lmsysorg/sglang:v0.5.5.post2" to the Dynamo 0.7.0–pinned version
"lmsysorg/sglang:v0.5.3.post4" (or another version you have verified compatible)
so the model container matches dynamo.version 0.7.0.

In `@recipies/gb200-fp4/1k8k/mid-curve.yaml`:
- Around line 3-13: Update the model container and remove unsupported
multi-frontend flags: change the model.container value from
"lmsysorg/sglang:v0.5.5.post2" to "lmsysorg/sglang:0.5.3.post4", and remove the
unsupported frontend keys enable_multiple_frontends and
num_additional_frontends; for multi-tenant/multi-frontend serving, configure and
route traffic through the SGLang Model Gateway (Router) instead of using those
flags.
🧹 Nitpick comments (1)
recipies/gb200-fp4/1k8k/mid-curve.yaml (1)

65-65: Use the --fp4-gemm-backend CLI flag instead of the deprecated environment variable.

The SGLANG_FLASHINFER_FP4_GEMM_BACKEND environment variable exists in v0.5.5.post2 but is deprecated and replaced by the --fp4-gemm-backend server CLI flag. Note that mm_fp4 is a GEMM backend for quantized linear operations and applies to both prefill and decode phases (not decode-only).

Comment on lines +71 to 74
context-length: 10000
mem-fraction-static: 0.95
max-total-tokens: 8192
chunked-prefill-size: 8192
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Align max-total-tokens with the new 10k context length.

context-length: 10000 exceeds max-total-tokens: 8192, which can truncate or reject longer contexts in prefill. Please set max-total-tokens ≥ 10000 (or reduce context-length).

🛠️ Proposed fix
-      max-total-tokens: 8192
+      max-total-tokens: 10000
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
context-length: 10000
mem-fraction-static: 0.95
max-total-tokens: 8192
chunked-prefill-size: 8192
context-length: 10000
mem-fraction-static: 0.95
max-total-tokens: 10000
chunked-prefill-size: 8192
🤖 Prompt for AI Agents
In `@recipies/gb200-fp4/1k8k/low-latency.yaml` around lines 71 - 74, The config
sets context-length: 10000 but max-total-tokens: 8192, which can truncate or
reject prefill; update the max-total-tokens value to be at least 10000 (or lower
context-length) so they align—modify the max-total-tokens entry in this recipe
to 10000 (or a larger value if desired) to avoid mismatched limits.

Comment on lines +3 to +13
dynamo:
version: 0.7.0

frontend:
type: dynamo
enable_multiple_frontends: true
num_additional_frontends: 9

model:
path: "dsr1"
container: "lmsysorg/sglang:nightly-dev-cu13-20260121-1e309030"
container: "lmsysorg/sglang:v0.5.5.post2"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

sglang v0.5.5 release notes features documentation

💡 Result:

SGLang v0.5.5 — release notes (highlights / features)

From the official GitHub release “Release v0.5.5” (published Nov 6, 2025), the listed highlights are:

  • Day-0 support for Kimi-K2-Thinking and MiniMax-M2 models. [1]
  • Video + image generation (diffusion) support, linked to the LMSYS post announcing SGLang Diffusion. [1][2]
  • Blackwell kernel optimizations and MoE runner backend refactor. [1]
  • Overlap spec and prefill CUDA graph support expanded to more models. [1]
  • Gateway section: “Gateway Changes (48 commits)” and changelog range gateway-v0.2.1...gateway-v0.2.2. [1]

Documentation (official)

  • Main docs home: https://docs.sglang.io/ (also mirrored at https://docs.sglang.ai/). [3]
  • Install guide: https://docs.sglang.io/get_started/install.html. [4]
  • Diffusion models (SGLang Diffusion) docs: https://docs.sglang.io/supported_models/diffusion_models.html. [5]
  • Model Gateway / Router docs: https://docs.sglang.io/advanced_features/router.html and https://docs.sglang.io/advanced_features/sgl_model_gateway.html. [6][7]

Primary release-notes link

  • GitHub releases page (contains Release v0.5.5): https://github.com/sgl-project/sglang/releases. [1]

Sources: [1] GitHub Releases (v0.5.5) (github.com), [2] LMSYS diffusion announcement (lmsys.org), [3] Docs home (docs.sglang.io), [4] Install guide (docs.sglang.io), [5] Diffusion models docs (docs.sglang.io), [6] Model Gateway/Router docs (docs.sglang.io), [7] Model Gateway page (docs.sglang.io).

Citations:


🌐 Web query:

Dynamo 0.7.0 sglang compatibility enable_multiple_frontends

💡 Result:

For NVIDIA Dynamo v0.7.0, the supported/pinned SGLang version is 0.5.3.post4 (per the Dynamo 0.7.0 support matrix). If you’re hitting an error around enable_multiple_frontends, it’s almost always a version/flag mismatch (i.e., a flag expected by one version isn’t recognized by the other). [1]

Key points:

  • Dynamo’s SGLang worker (python -m dynamo.sglang) uses SGLang’s native argument parser, so you should only pass valid SGLang server args for the SGLang version Dynamo is using. [2]
  • In the published SGLang server arguments documentation, there is no documented enable_multiple_frontends / --enable-multiple-frontends server flag, so if something is trying to pass it, it’s likely coming from a mismatched wrapper/script or a different SGLang fork/version. [3]
  • Fix: ensure you install and run matching Dynamo/SGLang versions (e.g., install ai-dynamo==0.7.0 with the sglang extra, which is intended to align dependencies), and remove/stop passing enable_multiple_frontends. [1][2]

Sources
[1] Dynamo 0.7.0 Support Matrix (SGLang build dependency version) https://docs.nvidia.com/dynamo/archive/0.7.0/reference/support-matrix.html
[2] “Running SGLang with Dynamo” (native SGLang args passthrough) https://docs.nvidia.com/dynamo/dev/backends/sglang/README.html
[3] SGLang “Server Arguments” reference https://docs.sglang.io/advanced_features/server_arguments.html


🌐 Web query:

sglang lmsysorg container v0.5.5 supported features

💡 Result:

lmsysorg/sglang:v0.5.5 (SGLang v0.5.5) supports, per the v0.5.5 release highlights/changes:

  • Day-0 model support: Kimi-K2-Thinking and MiniMax-M2. [1]
  • Diffusion (image + video generation): “SGLang Diffusion” acceleration, with OpenAI-compatible API/CLI/Python entry points and support for major open-source video/image diffusion models (e.g., Wan, Hunyuan, Qwen-Image/Qwen-Image-Edit, Flux). [1][2]
  • NVIDIA Blackwell-focused improvements: kernel optimizations plus MoE runner backend refactor. [1]
  • Broader overlap/prefill improvements: “overlap spec” and prefill CUDA graph support for more models. [1]
  • Notable items in “What’s Changed”: decoupled quantization from vLLM dependency (GGUF SRT), direct video inference, Llama 4 + TensorRT-LLM MHA enablement, and multiple MoE runner/perf updates (incl. flashinfer_trtllm backend). [1]

Sources: [1] SGLang GitHub release v0.5.5, [2] LMSYS blog “SGLang Diffusion: Accelerating Video and Image Generation”.


Remove enable_multiple_frontends flag and pin sglang container to compatible version.

The configuration has two critical issues:

  1. enable_multiple_frontends: true is not a valid SGLang server argument—it will cause the server to fail with an unrecognized flag error. This flag does not exist in SGLang's documented server arguments.

  2. Container version mismatch: Dynamo 0.7.0 officially pins SGLang to v0.5.3.post4, but the configuration specifies lmsysorg/sglang:v0.5.5.post2. This version mismatch can cause compatibility issues. Use the pinned version or verify that v0.5.5 is compatible with your Dynamo 0.7.0 deployment.

🤖 Prompt for AI Agents
In `@recipies/gb200-fp4/1k8k/max-tpt.yaml` around lines 3 - 13, Remove the invalid
SGLang server flag enable_multiple_frontends (and related
num_additional_frontends usage) from the frontend block so the dynamo frontend
invocation doesn't pass unrecognized arguments, and change the model.container
value from "lmsysorg/sglang:v0.5.5.post2" to the Dynamo 0.7.0–pinned version
"lmsysorg/sglang:v0.5.3.post4" (or another version you have verified compatible)
so the model container matches dynamo.version 0.7.0.

Comment on lines +3 to +13
dynamo:
version: 0.7.0

frontend:
type: dynamo
enable_multiple_frontends: true
num_additional_frontends: 9

model:
path: "dsr1"
container: "lmsysorg/sglang:nightly-dev-cu13-20260121-1e309030"
container: "lmsysorg/sglang:v0.5.5.post2"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

sglang v0.5.5.post2 release notes

💡 Result:

SGLang v0.5.5.post2 was tagged/released on Nov 12, 2025 and is primarily a patch release to “include the critical fix” from PR #12915. [1]

What changed (the “critical fix”)

PR #12915 modifies FlashInfer communication fusion defaults in python/sglang/srt/layers/flashinfer_comm_fusion.py: [2]

  • Lowered max_token_num default from 16384 → 2048 in:
    • ensure_workspace_initialized(...)
    • flashinfer_allreduce_residual_rmsnorm(...) [2]
  • Removed the fallback that disabled the fused path when input_tensor.shape[0] > max_token_num (it no longer returns None, None in that case). [2]

This PR is titled “Fix spec decoding acc length for dpsk-r1-fp4 tp8”, indicating the patch targets speculative decoding accuracy/length behavior for that configuration. [2]

Sources: GitHub release tag page for v0.5.5.post2 [1] and the referenced fix PR diff [2].


🌐 Web query:

Dynamo 0.7.0 sglang compatibility

💡 Result:

For Dynamo v0.7.0, the officially tested / build dependency SGLang version is 0.5.3.post4. [1]

Practical guidance from Dynamo’s SGLang backend docs:

  • Install via ai-dynamo[sglang] (recommended; Dynamo notes it “comes bundled with a stable sglang version”). [2]
  • If you install SGLang separately for development, Dynamo’s SGLang backend docs indicate you should use SGLang >= 0.5.3.post2 (but Dynamo’s 0.7.0 support matrix specifies 0.5.3.post4 as the known-compatible build version). [2] [1]

Feature notes for the Dynamo↔SGLang integration:

  • Core features like disaggregated serving and KV-aware routing are supported. [2]
  • KV Block Manager (KVBM) is not supported for SGLang in this release (planned). [3]

Sources: Dynamo v0.7.0 support matrix; Dynamo SGLang backend docs; Dynamo v0.7.0 release notes. [1] [2] [3]


🌐 Web query:

sglang enable_multiple_frontends support version

💡 Result:

enable_multiple_frontends is not a supported SGLang server argument in any currently documented SGLang release (it does not appear in the official “Server Arguments” reference / sglang.launch_server --help surface). [1]

If what you want is “multiple frontends / multiple entrypoints / multi-tenant traffic”, SGLang’s documented solution is to run workers behind the SGLang Model Gateway (Router) rather than a single launch_server flag. [2]

Sources: [1] [2]


Update container to use sglang 0.5.3.post4 and configure multi-frontend serving correctly.

Dynamo 0.7.0 officially supports sglang 0.5.3.post4, not v0.5.5.post2. Additionally, enable_multiple_frontends and num_additional_frontends are not supported SGLang server arguments. For multi-tenant serving, use the SGLang Model Gateway (Router) instead of these flags.

🤖 Prompt for AI Agents
In `@recipies/gb200-fp4/1k8k/mid-curve.yaml` around lines 3 - 13, Update the model
container and remove unsupported multi-frontend flags: change the
model.container value from "lmsysorg/sglang:v0.5.5.post2" to
"lmsysorg/sglang:0.5.3.post4", and remove the unsupported frontend keys
enable_multiple_frontends and num_additional_frontends; for
multi-tenant/multi-frontend serving, configure and route traffic through the
SGLang Model Gateway (Router) instead of using those flags.

@ishandhanani ishandhanani merged commit 8e347cf into main Jan 27, 2026
4 of 5 checks passed
karen-sy pushed a commit that referenced this pull request Jan 28, 2026
@coderabbitai coderabbitai bot mentioned this pull request Mar 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants