-
Notifications
You must be signed in to change notification settings - Fork 37
Update gb200 recipes #130
Update gb200 recipes #130
Changes from all commits
0c90137
acb398b
6fa4eb6
55e80e9
6ff475f
41385b5
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,16 +1,17 @@ | ||
| name: "gb200-fp4-1p2d" | ||
| name: "gb200-fp4-1k8k-low-latency" | ||
|
|
||
| dynamo: | ||
| version: 0.7.0 | ||
| version: 0.8.1 | ||
|
Comment on lines
+1
to
+4
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🧩 Analysis chain🏁 Script executed: #!/bin/bash
# Check if dynamo 0.8.1 exists on PyPI
curl -s "https://pypi.org/pypi/dynamo/0.8.1/json" | head -c 200Repository: ishandhanani/srt-slurm Length of output: 90 🏁 Script executed: # Search for other dynamo version references in the repository
rg "dynamo:" --type yaml -A 1 | head -50Repository: ishandhanani/srt-slurm Length of output: 3082 🏁 Script executed: cat recipes/gb200-fp4/1k8k/low-latency.yaml | head -20Repository: ishandhanani/srt-slurm Length of output: 405 🏁 Script executed: # Search for all dynamo version specifications with "version:" pattern
rg "dynamo:" -A 2 -B 0 | rg "version:|hash:|install:" | sort -uRepository: ishandhanani/srt-slurm Length of output: 28548 Dynamo version 0.8.1 does not exist on PyPI and must be corrected. While the experiment name follows the standardized convention, the dynamo configuration is invalid. The version 🤖 Prompt for AI Agents |
||
|
|
||
| frontend: | ||
| type: dynamo | ||
| frontend: | ||
| type: dynamo | ||
| enable_multiple_frontends: true | ||
| num_additional_frontends: 4 | ||
| num_additional_frontends: 3 | ||
| nginx_container: nginx | ||
|
|
||
| model: | ||
| path: "dsr1" | ||
| container: "lmsysorg/sglang:v0.5.5.post2" | ||
| path: "dsfp4" | ||
| container: "lmsysorg/sglang:v0.5.8-cu130-runtime" | ||
| precision: "fp4" | ||
|
|
||
| resources: | ||
|
|
@@ -37,7 +38,6 @@ backend: | |
| NCCL_CUMEM_ENABLE: "1" | ||
| SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True" | ||
| SGLANG_ENABLE_JIT_DEEPGEMM: "false" | ||
| SGLANG_ENABLE_FLASHINFER_GEMM: "true" | ||
|
|
||
| decode_environment: | ||
| TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800" | ||
|
|
@@ -54,12 +54,11 @@ backend: | |
| NCCL_CUMEM_ENABLE: "1" | ||
| SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True" | ||
| SGLANG_ENABLE_JIT_DEEPGEMM: "false" | ||
| SGLANG_ENABLE_FLASHINFER_GEMM: "true" | ||
|
|
||
| sglang_config: | ||
| prefill: | ||
| disaggregation-mode: "prefill" | ||
| served-model-name: "deepseek-ai/DeepSeek-R1" | ||
| disaggregation-mode: "prefill" | ||
| trust-remote-code: true | ||
| disable-radix-cache: true | ||
| kv-cache-dtype: "fp8_e4m3" | ||
|
|
@@ -81,10 +80,12 @@ backend: | |
| data-parallel-size: 1 | ||
| tensor-parallel-size: 4 | ||
| expert-parallel-size: 1 | ||
| fp4-gemm-backend: "flashinfer_trtllm" | ||
| disaggregation-transfer-backend: nixl | ||
|
|
||
| decode: | ||
| disaggregation-mode: "decode" | ||
| served-model-name: "deepseek-ai/DeepSeek-R1" | ||
| disaggregation-mode: "decode" | ||
| prefill-round-robin-balance: true | ||
| trust-remote-code: true | ||
| disable-radix-cache: true | ||
|
|
@@ -103,6 +104,8 @@ backend: | |
| moe-dense-tp-size: 1 | ||
| tensor-parallel-size: 4 | ||
| expert-parallel-size: 1 | ||
| fp4-gemm-backend: "flashinfer_trtllm" | ||
| disaggregation-transfer-backend: nixl | ||
|
|
||
| benchmark: | ||
| type: "sa-bench" | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,16 +1,17 @@ | ||
| name: "gb200-fp4-max-tpt" | ||
| name: "gb200-fp4-1k8k-max-tpt" | ||
|
|
||
| dynamo: | ||
| version: 0.7.0 | ||
| version: 0.8.1 | ||
|
|
||
| frontend: | ||
| type: dynamo | ||
| enable_multiple_frontends: true | ||
| num_additional_frontends: 9 | ||
| nginx_container: nginx | ||
|
|
||
| model: | ||
| path: "dsr1" | ||
| container: "lmsysorg/sglang:v0.5.5.post2" | ||
| path: "dsfp4" | ||
| container: "lmsysorg/sglang:v0.5.8-cu130-runtime" | ||
| precision: "fp4" | ||
|
|
||
| resources: | ||
|
|
@@ -32,7 +33,6 @@ backend: | |
| SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" | ||
| SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" | ||
| SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" | ||
| SGLANG_HACK_SEQ_BOOTSTRAP_ROOM: "1" | ||
| MC_TE_METRIC: "true" | ||
| MC_FORCE_MNNVL: "1" | ||
| NCCL_MNNVL_ENABLE: "1" | ||
|
|
@@ -51,7 +51,6 @@ backend: | |
| SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" | ||
| SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" | ||
| SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" | ||
| SGLANG_HACK_SEQ_BOOTSTRAP_ROOM: "1" | ||
| MC_TE_METRIC: "true" | ||
| MC_FORCE_MNNVL: "1" | ||
| NCCL_MNNVL_ENABLE: "1" | ||
|
|
@@ -61,14 +60,14 @@ backend: | |
| SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1" | ||
| SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK: "1024" | ||
| SGLANG_MOE_NVFP4_DISPATCH: "1" | ||
| SGLANG_CUTEDSL_MOE_NVFP4_DISPATCH: "1" # Used in older sglang versions | ||
| SGLANG_FLASHINFER_FP4_GEMM_BACKEND: "cutlass" | ||
|
Comment on lines
61
to
63
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Potential redundancy: The decode_environment still contains This creates:
If 🤖 Prompt for AI Agents |
||
|
|
||
| sglang_config: | ||
| prefill: | ||
| # Model configuration | ||
| served-model-name: "deepseek-ai/DeepSeek-R1" | ||
| trust-remote-code: true | ||
| disaggregation-transfer-backend: nixl | ||
|
|
||
| # KV cache and attention | ||
| kv-cache-dtype: "fp8_e4m3" | ||
|
|
@@ -108,6 +107,7 @@ backend: | |
| # Performance optimizations | ||
| disable-cuda-graph: true | ||
| enable-dp-attention: true | ||
| fp4-gemm-backend: "flashinfer_cutlass" | ||
|
|
||
| # Parallelism | ||
| tp-size: 4 | ||
|
|
@@ -118,6 +118,7 @@ backend: | |
| # Model configuration | ||
| served-model-name: "deepseek-ai/DeepSeek-R1" | ||
| trust-remote-code: true | ||
| disaggregation-transfer-backend: nixl | ||
|
|
||
| # KV cache and attention | ||
| kv-cache-dtype: "fp8_e4m3" | ||
|
|
@@ -233,6 +234,7 @@ backend: | |
| enable-dp-lm-head: true | ||
| prefill-round-robin-balance: true | ||
| enable-dp-attention: true | ||
| fp4-gemm-backend: "flashinfer_cutlass" | ||
|
|
||
| # Parallelism | ||
| tp-size: 48 | ||
|
|
@@ -243,5 +245,5 @@ benchmark: | |
| type: "sa-bench" | ||
| isl: 1024 | ||
| osl: 8192 | ||
| concurrencies: "256x512x1024x2048x8192" | ||
| concurrencies: "256x512x1024x2048" | ||
| req_rate: "inf" | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🌐 Web query:
sglang v0.5.8 flashinfer_trtllm backend configuration💡 Result:
In SGLang v0.5.8,
flashinfer_trtllmis configured as the MoE runner backend via--moe-runner-backend(it’s not the attention backend flag). The valid options list includesflashinfer_trtllm. [1]CLI example (as shown in the SGLang docs)
This is an example of “hybrid attention” (different prefill vs decode backends) while using
flashinfer_trtllmfor the MoE runner. [2]YAML config equivalent
Launch with:
(Config-file support is documented; CLI args override YAML.) [1]
Sources: Server arguments (MoE + config support) [1]; Attention backend doc showing
flashinfer_trtllmusage in an example command [2].🌐 Web query:
sglang v0.5.8 disaggregation-transfer-backend nixl options💡 Result:
In SGLang v0.5.8, the NIXL “options” for PD disaggregation are mainly (a) selecting NIXL as the transfer backend, plus (b) the general PD-disaggregation knobs (there aren’t extra SGLang CLI flags that are NIXL-specific in the docs).
1) Enable NIXL transfer backend (PD disaggregation)
Use:
--disaggregation-transfer-backend nixl(supported values:mooncake,nixl,ascend,fake) [1]and run separate servers:
--disaggregation-mode prefill--disaggregation-mode decode[1][2]Docs example (single node) uses exactly these flags with NIXL. [2]
2) PD-disaggregation knobs you can still use with NIXL
These are the PD-disaggregation CLI arguments listed by SGLang (apply regardless of backend selection): [1]
--disaggregation-bootstrap-port(prefill bootstrap server port)--disaggregation-decode-tp(set on prefill server; decode TP size)--disaggregation-decode-dp(set on prefill server; decode DP size)--disaggregation-prefill-pp(set on decode server; prefill PP size)--disaggregation-decode-enable-offload-kvcache--disaggregation-decode-enable-fake-auto--num-reserved-decode-tokens--disaggregation-decode-polling-interval--disaggregation-ib-deviceexists, but its description is tied to auto-detection for the mooncake backend (not described as a NIXL control). [1]3) Install/bring NIXL (dependency)
SGLang’s PD-disaggregation doc says to install NIXL via pip (or build from source, e.g., if you already have UCX installed). [2]
Sources:
[1] SGLang “Server Arguments” (PD disaggregation section). (docs.sglang.io)
[2] SGLang “PD Disaggregation” (NIXL requirements + usage examples). (docs.sglang.io)
Citations:
🌐 Web query:
sglang 0.5.8 release notes backend options💡 Result:
SGLang v0.5.8 was released Jan 23, 2026 (PyPI upload date). [1]
v0.5.8 release notes (backend-related items)
diffusersbackend (plus multi-LoRA inference, “SLA attention backends”, CLI warmup switch, ComfyUI plugin, and broad perf improvements). [2]“Backend options” you can configure (runtime/kernel backends)
From SGLang server CLI args (see
python -m sglang.launch_server --help), the main backend-selector knobs are: [3]--model-impl:auto/sglang/transformers(model implementation choice). [3]--attention-backend: attention kernel backend. [3]prefill_attention_backendanddecode_attention_backend(experimental; override attention backend separately for prefill vs decode). [3]--sampling-backend: sampling kernel backend. [3]--grammar-backend: backend for grammar-guided decoding. [3]--mm-attention-backend: multimodal attention backend. [3]Sources: PyPI sglang 0.5.8 release entry [1]; GitHub release notes for v0.5.8 [2]; SGLang “Server Arguments” docs (kernel backend flags) [3].
Use
moe-runner-backend: flashinfer_trtllminstead offp4-gemm-backend: flashinfer_trtllm.In SGLang v0.5.8,
flashinfer_trtllmis a valid backend value but must be configured via themoe-runner-backendoption, notfp4-gemm-backend. The config keyfp4-gemm-backenddoes not exist in the official SGLang configuration schema. Thedisaggregation-transfer-backend: nixlis valid and correct.🤖 Prompt for AI Agents