-
Notifications
You must be signed in to change notification settings - Fork 37
Add new LHS datapoint for SGL-GB200-FP8-1k1k #148
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,176 @@ | ||
| name: "gb200-fp8-1k1k-ultra-tpt" | ||
|
|
||
| dynamo: | ||
| version: 0.8.1 | ||
|
|
||
| frontend: | ||
| type: dynamo | ||
| enable_multiple_frontends: true | ||
| num_additional_frontends: 3 | ||
| nginx_container: nginx | ||
|
|
||
| model: | ||
| path: "dsr1-fp8" | ||
| container: "lmsysorg/sglang:v0.5.8-cu130" | ||
| precision: "fp8" | ||
|
|
||
| resources: | ||
| gpu_type: "gb200" | ||
| prefill_nodes: 2 | ||
| prefill_workers: 1 | ||
| decode_nodes: 2 | ||
| decode_workers: 1 | ||
| gpus_per_node: 4 | ||
|
|
||
| backend: | ||
|
|
||
| # Prefill-specific environment variables | ||
| prefill_environment: | ||
| TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800" | ||
| SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1" | ||
| DYN_SKIP_SGLANG_LOG_FORMATTING: "1" | ||
| MC_TE_METRIC: "true" | ||
| SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" | ||
| SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" | ||
| SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" | ||
| SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True" | ||
| MC_FORCE_MNNVL: "1" | ||
| NCCL_MNNVL_ENABLE: "1" | ||
| NCCL_CUMEM_ENABLE: "1" | ||
| SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0" | ||
| SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1" | ||
| PYTHONUNBUFFERED: "1" | ||
|
|
||
| # Decode-specific environment variables | ||
| decode_environment: | ||
| TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800" | ||
| SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1" | ||
| DYN_SKIP_SGLANG_LOG_FORMATTING: "1" | ||
| SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK: "640" | ||
| MC_TE_METRIC: "true" | ||
| SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" | ||
| SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" | ||
| SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" | ||
| SGLANG_DECODE_BOOTSTRAP_TIMEOUT: "1000" | ||
| SGLANG_HACK_SEQ_BOOTSTRAP_ROOM: "1" | ||
| SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True" | ||
| MC_FORCE_MNNVL: "1" | ||
| NCCL_MNNVL_ENABLE: "1" | ||
| NCCL_CUMEM_ENABLE: "1" | ||
| SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0" | ||
| SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1" | ||
| PYTHONUNBUFFERED: "1" | ||
|
|
||
| sglang_config: | ||
| prefill: | ||
| # Model configuration | ||
| served-model-name: "deepseek-ai/DeepSeek-R1" | ||
| skip-tokenizer-init: true | ||
| trust-remote-code: true | ||
|
|
||
| # Parallelism | ||
| tp-size: 8 | ||
| dp-size: 8 | ||
| ep-size: 8 | ||
| enable-dp-attention: true | ||
|
|
||
| # KV cache and attention | ||
| attention-backend: "trtllm_mla" | ||
| kv-cache-dtype: "fp8_e4m3" | ||
|
|
||
| # Radix cache disabled | ||
| disable-radix-cache: true | ||
|
|
||
| # Other flags | ||
| stream-interval: 50 | ||
| max-running-requests: 8192 | ||
| context-length: 2200 | ||
| watchdog-timeout: 1000000 | ||
| disable-shared-experts-fusion: true | ||
| eplb-algorithm: "deepseek" | ||
| disaggregation-bootstrap-port: 30001 | ||
|
|
||
| # Prefill-specific mode | ||
| disaggregation-mode: "prefill" | ||
|
|
||
| # Memory and token limits | ||
| mem-fraction-static: 0.75 | ||
| max-total-tokens: 524288 | ||
| chunked-prefill-size: 131072 | ||
|
|
||
| # Request handling | ||
| load-balance-method: "round_robin" | ||
|
|
||
| # Performance optimizations | ||
| disable-cuda-graph: true | ||
|
|
||
| # DeepEP configuration | ||
| moe-a2a-backend: "deepep" | ||
| deepep-mode: "normal" | ||
| ep-dispatch-algorithm: "dynamic" | ||
| moe-dense-tp-size: 1 | ||
| enable-dp-lm-head: true | ||
| ep-num-redundant-experts: 32 | ||
| deepep-config: "/configs/deepep_config.json" | ||
|
|
||
| disaggregation-transfer-backend: nixl | ||
|
|
||
| decode: | ||
| # Model configuration | ||
| served-model-name: "deepseek-ai/DeepSeek-R1" | ||
| skip-tokenizer-init: true | ||
| trust-remote-code: true | ||
|
|
||
| # Parallelism | ||
| tp-size: 8 | ||
| dp-size: 8 | ||
| ep-size: 8 | ||
| enable-dp-attention: true | ||
|
|
||
| # KV cache and attention | ||
| attention-backend: "trtllm_mla" | ||
| kv-cache-dtype: "fp8_e4m3" | ||
|
|
||
| # Radix cache disabled | ||
| disable-radix-cache: true | ||
|
|
||
| # Other flags | ||
| stream-interval: 50 | ||
| decode-log-interval: 1000 | ||
| max-running-requests: 5120 | ||
| context-length: 2200 | ||
| watchdog-timeout: 1000000 | ||
| disable-shared-experts-fusion: true | ||
| eplb-algorithm: "deepseek" | ||
| disaggregation-bootstrap-port: 30001 | ||
|
|
||
| # Decode-specific mode | ||
| disaggregation-mode: "decode" | ||
|
|
||
| # Memory and token limits | ||
| mem-fraction-static: 0.82 | ||
| chunked-prefill-size: 36864 | ||
|
|
||
| # DeepEP configuration | ||
| moe-a2a-backend: "deepep" | ||
| deepep-mode: "low_latency" | ||
| ep-dispatch-algorithm: "static" | ||
| moe-dense-tp-size: 1 | ||
| enable-dp-lm-head: true | ||
| prefill-round-robin-balance: true | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🧩 Analysis chain🏁 Script executed: # First, locate and examine the file to see the context around line 163
find . -name "ultra-tpt.yaml" -type fRepository: ishandhanani/srt-slurm Length of output: 105 🌐 Web query:
💡 Result:
If your goal is round-robin balancing for prefill in Prefill/Decode (PD) disaggregation, the current documented way is to set the router’s load-balancing policy (e.g., Sources: 🏁 Script executed: cat -n ./recipes/gb200-fp8/1k1k/ultra-tpt.yaml | sed -n '150,175p'Repository: ishandhanani/srt-slurm Length of output: 1163 Replace deprecated The flag 🤖 Prompt for AI Agents |
||
| ep-num-redundant-experts: 32 | ||
| deepep-config: "/configs/deepep_config.json" | ||
|
|
||
| # CUDA graphs | ||
| cuda-graph-bs: [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 264, 272, 280, 288, 296, 304, 312, 320, 328, 336, 344, 352, 360, 368, 376, 384, 416, 448, 480, 512, 544, 576, 608, 640] | ||
| cuda-graph-max-bs: 640 | ||
|
|
||
| disaggregation-transfer-backend: nixl | ||
|
|
||
| benchmark: | ||
| type: "sa-bench" | ||
| isl: 1024 | ||
| osl: 1024 | ||
| concurrencies: "4096" | ||
| req_rate: "inf" | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# First, check if the file exists and read the relevant lines head -20 recipes/gb200-fp8/1k1k/ultra-tpt.yamlRepository: ishandhanani/srt-slurm
Length of output: 442
🏁 Script executed:
Repository: ishandhanani/srt-slurm
Length of output: 120
🌐 Web query:
dynamo version 0.8.1 release💡 Result:
“Dynamo 0.8.1” most commonly refers to Autodesk Dynamo (for Revit):
Display.ByGeometryColor), Groups for graph organization, more robust Excel read/write, list/function updates (e.g.,List.Clean, improvedIFlacing), UI search/browse improvements, localization, and Revit node additions (e.g.,FamilyInstance.SetRotation). [1][2]If instead you meant NVIDIA Dynamo (AI inference platform), there is also a v0.8.1 line, but it’s a different product. [3][4]
Sources: [1] [2] [3] [4]
🏁 Script executed:
Repository: ishandhanani/srt-slurm
Length of output: 5985
🌐 Web query:
sglang dynamo 0.8.1 GB200💡 Result:
NVIDIA Dynamo v0.8.1 includes an SGLang backend/runtime that explicitly lists GB200 support = ✅ (and WideEP full support on H100s/GB200). [1]
For Dynamo 0.8.1 (release date Jan 23, 2026), NVIDIA publishes these relevant artifacts: [2]
nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.8.1(CUDA 12.9), which bundles SGLang v0.5.6.post2. [2]nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.8.1-cuda13(CUDA 13.0), also SGLang v0.5.6.post2. [2]ai-dynamo==0.8.1. [2]If you meant SGLang (pip) directly on B200/GB200 without Dynamo, there have been compatibility issues discussed (e.g., PyTorch CUDA capability support for B200,
sm_100). [3]🏁 Script executed:
Repository: ishandhanani/srt-slurm
Length of output: 210
Use official NVIDIA Dynamo container for version alignment.
The container image
lmsysorg/sglang:v0.5.8-cu130exists and is active on Docker Hub, and Dynamo v0.8.1 (released Jan 23, 2026) includes explicit GB200 support. However, consider using the official NVIDIA containernvcr.io/nvidia/ai-dynamo/sglang-runtime:0.8.1instead, which aligns with Dynamo v0.8.1 and bundles the tested SGLang v0.5.6.post2. Using a third-party SGLang image may introduce version skew between your Dynamo and SGLang installations.🤖 Prompt for AI Agents