-
Notifications
You must be signed in to change notification settings - Fork 37
Update GB200-FP4 1k/8k configs #103
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,10 +1,16 @@ | ||
| # 4P1D, with 12 Decode Nodes. Uses single batch overlap | ||
|
|
||
| name: "gb200-fp4-max-tpt" | ||
|
|
||
| dynamo: | ||
| version: 0.7.0 | ||
|
|
||
| frontend: | ||
| type: dynamo | ||
| enable_multiple_frontends: true | ||
| num_additional_frontends: 9 | ||
|
|
||
| model: | ||
| path: "dsr1" | ||
| container: "lmsysorg/sglang:nightly-dev-cu13-20260121-1e309030" | ||
| container: "lmsysorg/sglang:v0.5.5.post2" | ||
|
Comment on lines
+3
to
+13
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🧩 Analysis chain🌐 Web query:
💡 Result: SGLang v0.5.5 — release notes (highlights / features)From the official GitHub release “Release v0.5.5” (published Nov 6, 2025), the listed highlights are:
Documentation (official)
Primary release-notes link
Sources: [1] GitHub Releases (v0.5.5) (github.com), [2] LMSYS diffusion announcement (lmsys.org), [3] Docs home (docs.sglang.io), [4] Install guide (docs.sglang.io), [5] Diffusion models docs (docs.sglang.io), [6] Model Gateway/Router docs (docs.sglang.io), [7] Model Gateway page (docs.sglang.io). Citations:
🌐 Web query:
💡 Result: For NVIDIA Dynamo v0.7.0, the supported/pinned SGLang version is Key points:
Sources 🌐 Web query:
💡 Result:
Sources: [1] SGLang GitHub release v0.5.5, [2] LMSYS blog “SGLang Diffusion: Accelerating Video and Image Generation”. Remove The configuration has two critical issues:
🤖 Prompt for AI Agents |
||
| precision: "fp4" | ||
|
|
||
| resources: | ||
|
|
@@ -56,13 +62,13 @@ backend: | |
| SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK: "1024" | ||
| SGLANG_MOE_NVFP4_DISPATCH: "1" | ||
| SGLANG_CUTEDSL_MOE_NVFP4_DISPATCH: "1" # Used in older sglang versions | ||
| SGLANG_FLASHINFER_FP4_GEMM_BACKEND: "cutlass" | ||
|
|
||
| sglang_config: | ||
| prefill: | ||
| # Model configuration | ||
| served-model-name: "deepseek-ai/DeepSeek-R1" | ||
| trust-remote-code: true | ||
| disaggregation-transfer-backend: nixl | ||
|
|
||
| # KV cache and attention | ||
| kv-cache-dtype: "fp8_e4m3" | ||
|
|
@@ -80,7 +86,7 @@ backend: | |
| stream-interval: 50 | ||
| decode-log-interval: 1000 | ||
| watchdog-timeout: 1000000 | ||
| context-length: 9200 | ||
| context-length: 10000 | ||
| disable-shared-experts-fusion: true | ||
| eplb-algorithm: "deepseek" | ||
| disaggregation-bootstrap-port: 30001 | ||
|
|
@@ -112,7 +118,6 @@ backend: | |
| # Model configuration | ||
| served-model-name: "deepseek-ai/DeepSeek-R1" | ||
| trust-remote-code: true | ||
| disaggregation-transfer-backend: nixl | ||
|
|
||
| # KV cache and attention | ||
| kv-cache-dtype: "fp8_e4m3" | ||
|
|
@@ -130,7 +135,7 @@ backend: | |
| stream-interval: 50 | ||
| decode-log-interval: 1000 | ||
| watchdog-timeout: 1000000 | ||
| context-length: 9200 | ||
| context-length: 10000 | ||
| disable-shared-experts-fusion: true | ||
| eplb-algorithm: "deepseek" | ||
| disaggregation-bootstrap-port: 30001 | ||
|
|
@@ -228,7 +233,6 @@ backend: | |
| enable-dp-lm-head: true | ||
| prefill-round-robin-balance: true | ||
| enable-dp-attention: true | ||
| fp4-gemm-backend: "flashinfer_cutlass" | ||
|
|
||
| # Parallelism | ||
| tp-size: 48 | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,11 +1,16 @@ | ||
| # 4P1D, with 8 Decode Nodes. Does not use single batch overlap but allows us to currently drive higher | ||
| # per gpu throughput | ||
| name: "gb200-fp4-mid-curve" | ||
|
|
||
| name: "gb200-fp4-max-tpt-2" | ||
| dynamo: | ||
| version: 0.7.0 | ||
|
|
||
| frontend: | ||
| type: dynamo | ||
| enable_multiple_frontends: true | ||
| num_additional_frontends: 9 | ||
|
|
||
| model: | ||
| path: "dsr1" | ||
| container: "lmsysorg/sglang:nightly-dev-cu13-20260121-1e309030" | ||
| container: "lmsysorg/sglang:v0.5.5.post2" | ||
|
Comment on lines
+3
to
+13
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🧩 Analysis chain🌐 Web query:
💡 Result: SGLang v0.5.5.post2 was tagged/released on Nov 12, 2025 and is primarily a patch release to “include the critical fix” from PR What changed (the “critical fix”)PR #12915 modifies FlashInfer communication fusion defaults in
This PR is titled “Fix spec decoding acc length for dpsk-r1-fp4 tp8”, indicating the patch targets speculative decoding accuracy/length behavior for that configuration. [2] Sources: GitHub release tag page for v0.5.5.post2 [1] and the referenced fix PR diff [2]. 🌐 Web query:
💡 Result: For Dynamo v0.7.0, the officially tested / build dependency SGLang version is Practical guidance from Dynamo’s SGLang backend docs:
Feature notes for the Dynamo↔SGLang integration:
Sources: Dynamo v0.7.0 support matrix; Dynamo SGLang backend docs; Dynamo v0.7.0 release notes. [1] [2] [3] 🌐 Web query:
💡 Result:
If what you want is “multiple frontends / multiple entrypoints / multi-tenant traffic”, SGLang’s documented solution is to run workers behind the SGLang Model Gateway (Router) rather than a single Sources: [1] [2] Update container to use sglang 0.5.3.post4 and configure multi-frontend serving correctly. Dynamo 0.7.0 officially supports sglang 0.5.3.post4, not v0.5.5.post2. Additionally, 🤖 Prompt for AI Agents |
||
| precision: "fp4" | ||
|
|
||
| resources: | ||
|
|
@@ -57,6 +62,7 @@ backend: | |
| SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK: "1024" | ||
| SGLANG_MOE_NVFP4_DISPATCH: "1" | ||
| SGLANG_CUTEDSL_MOE_NVFP4_DISPATCH: "1" # Used in older sglang versions | ||
| SGLANG_FLASHINFER_FP4_GEMM_BACKEND: "cutlass" | ||
|
|
||
| sglang_config: | ||
| prefill: | ||
|
|
@@ -67,7 +73,6 @@ backend: | |
| # KV cache and attention | ||
| kv-cache-dtype: "fp8_e4m3" | ||
| attention-backend: "trtllm_mla" | ||
| disaggregation-transfer-backend: nixl | ||
|
|
||
| # Quantization | ||
| quantization: "modelopt_fp4" | ||
|
|
@@ -81,7 +86,7 @@ backend: | |
| stream-interval: 50 | ||
| decode-log-interval: 1000 | ||
| watchdog-timeout: 1000000 | ||
| context-length: 9200 | ||
| context-length: 10000 | ||
| disable-shared-experts-fusion: true | ||
| eplb-algorithm: "deepseek" | ||
| disaggregation-bootstrap-port: 30001 | ||
|
|
@@ -117,7 +122,6 @@ backend: | |
| # KV cache and attention | ||
| kv-cache-dtype: "fp8_e4m3" | ||
| attention-backend: "trtllm_mla" | ||
| disaggregation-transfer-backend: nixl | ||
|
|
||
| # Quantization | ||
| quantization: "modelopt_fp4" | ||
|
|
@@ -131,7 +135,7 @@ backend: | |
| stream-interval: 50 | ||
| decode-log-interval: 1000 | ||
| watchdog-timeout: 1000000 | ||
| context-length: 9200 | ||
| context-length: 10000 | ||
| disable-shared-experts-fusion: true | ||
| eplb-algorithm: "deepseek" | ||
| disaggregation-bootstrap-port: 30001 | ||
|
|
@@ -228,7 +232,6 @@ backend: | |
| enable-dp-lm-head: true | ||
| prefill-round-robin-balance: true | ||
| enable-dp-attention: true | ||
| fp4-gemm-backend: "flashinfer_cutlass" | ||
|
|
||
| # Parallelism | ||
| tp-size: 32 | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Align
max-total-tokenswith the new 10k context length.context-length: 10000exceedsmax-total-tokens: 8192, which can truncate or reject longer contexts in prefill. Please setmax-total-tokens≥ 10000 (or reducecontext-length).🛠️ Proposed fix
📝 Committable suggestion
🤖 Prompt for AI Agents