-
Notifications
You must be signed in to change notification settings - Fork 204
(radixark sgl maintainer submission): Add DSV4 FP4 GB300 dynamo-sglang MTP disagg benchmarks #1297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 2 commits
Commits
Show all changes
30 commits
Select commit
Hold shift + click to select a range
ce53cf1
add mtp configs
ch-wan debb7b9
Add sbatch_directives to MTP recipes (root-cause fix)
ch-wan 59f899d
Change deepgemm flags
Fridge003 5f02885
Move MTP recipes up to 8k1k/ with -mtp filename suffix
ch-wan 50c6c59
fix
ch-wan 4bceea3
Drop custom_tokenizer from MTP recipes — incompatible with sa-bench
ch-wan 8738f93
Merge branch 'main' into sglang-disagg-gb300-mtp-0507
ch-wan e1a5081
Pin srt-slurm to fork w/ SGLangDeepseekV4Tokenizer callable + restore…
ch-wan 530762b
Bump sglang container to nightly-dev-cu13-20260508-2cf1a4ab (latest m…
ch-wan 6db9e2c
Restore base dsv4-fp4-gb300-dynamo-sglang image to staging tag
ch-wan 164f5a2
Pin MTP recipes to dynamo 81d0555e (matches working base recipes)
ch-wan 6d28994
Explicitly disable CAR_V2 in multi-node decode MTP recipes
ch-wan 9c4c244
Explicitly disable CAR_V2 in 8k1k base decode recipes too
ch-wan 9814b42
Set both old and new sglang thinking/reasoning env vars in MTP recipes
ch-wan 3e049e8
Set tool-call-parser=deepseekv4 to enable DSV4 chat encoding (gsm8k r…
ch-wan 255e7fb
Revert CAR_V2 explicit-disable in non-MTP base 8k1k recipes
ch-wan cb59807
Trim verbose comments and drop deprecated env var names in MTP recipes
ch-wan 9ff03f2
Revert MTP recipes to staging-dev container (gsm8k accuracy fix)
ch-wan 9b06113
Bump dynamo hash to 34d55a5 to fix DSV4 chat-template formatter
ch-wan 36bf040
Bump sglang container to nightly-dev-cu13-20260509-9ee83034
ch-wan 3275282
Switch DSV4 MTP recipes to nixl KV transfer backend
ch-wan 1ffcab9
Merge remote-tracking branch 'origin/main' into sglang-disagg-gb300-m…
ch-wan daa6785
Revert "Switch DSV4 MTP recipes to nixl KV transfer backend"
ch-wan 072e2ee
Bump MTP recipes to sglang nightly with mooncake DSv4 fix
ch-wan eae8d32
Merge remote-tracking branch 'origin/main' into sglang-disagg-gb300-m…
ch-wan 6ff6545
gb300-cw: switch srt-slurm pin to NVIDIA/srt-slurm main (#144 merged)
ch-wan cb45485
gb300-cw: track NVIDIA/srt-slurm main instead of pinning a commit
ch-wan 79d2cb6
Bump MTP recipes to sglang nightly 20260510-2473659e
ch-wan 32e623d
Merge remote-tracking branch 'origin/main' into sglang-disagg-gb300-m…
ch-wan 35a2f9a
fix: use shared gb300 dsv4 model path
Oseltamivir File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
125 changes: 125 additions & 0 deletions
125
...i_node/srt-slurm-recipes/sglang/deepseek-v4/8k1k/mtp/disagg-low-latency-1p1d-tp4-tp4.yaml
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,125 @@ | ||
| name: "dsv4-pro-gb300-disagg-8k1k-low-latency-1p1d-tp4-tp4-mtp" | ||
|
|
||
| frontend: | ||
| type: dynamo | ||
| enable_multiple_frontends: true | ||
| num_additional_frontends: 8 | ||
|
|
||
| dynamo: | ||
| hash: "9d3c913d300eb368cda28b3f98a23a5762621e0d" | ||
| install: true | ||
|
|
||
| model: | ||
| path: "deepseek-v4-pro" | ||
| container: "lmsysorg/sglang-staging:deepseek-v4-grace-blackwell-dev" | ||
| precision: "mxfp4" | ||
|
|
||
| sbatch_directives: | ||
| cpus-per-task: "144" | ||
| mem: "0" | ||
|
|
||
| resources: | ||
| gpu_type: "gb300" | ||
| gpus_per_node: 4 | ||
| prefill_nodes: 1 | ||
| prefill_workers: 1 | ||
| decode_nodes: 1 | ||
| decode_workers: 1 | ||
|
|
||
| backend: | ||
| type: sglang | ||
|
|
||
| prefill_environment: | ||
| PYTHONUNBUFFERED: "1" | ||
| SGLANG_RADIX_DISABLE_REUSE: "1" | ||
| SGLANG_JIT_DEEPGEMM_PRECOMPILE: "0" | ||
| SGLANG_ENABLE_THINKING: "1" | ||
| SGLANG_REASONING_EFFORT: "max" | ||
| SGLANG_OPT_SWA_SPLIT_LEAF_ON_INSERT: "1" | ||
| SGLANG_OPT_USE_JIT_NORM: "1" | ||
| SGLANG_OPT_USE_JIT_INDEXER_METADATA: "1" | ||
| SGLANG_OPT_USE_TOPK_V2: "1" | ||
| NCCL_MNNVL_ENABLE: "1" | ||
| NCCL_CUMEM_ENABLE: "1" | ||
| SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True" | ||
| MC_FORCE_MNNVL: "1" | ||
| SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" | ||
| SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" | ||
| SGLANG_OPT_SWA_RELEASE_LEAF_LOCK_AFTER_WINDOW: "1" | ||
|
|
||
| decode_environment: | ||
| PYTHONUNBUFFERED: "1" | ||
| SGLANG_RADIX_DISABLE_REUSE: "1" | ||
| SGLANG_JIT_DEEPGEMM_PRECOMPILE: "0" | ||
| SGLANG_ENABLE_THINKING: "1" | ||
| SGLANG_REASONING_EFFORT: "max" | ||
| SGLANG_OPT_SWA_SPLIT_LEAF_ON_INSERT: "1" | ||
| SGLANG_OPT_USE_JIT_NORM: "1" | ||
| SGLANG_OPT_USE_JIT_INDEXER_METADATA: "1" | ||
| SGLANG_OPT_USE_TOPK_V2: "1" | ||
| NCCL_MNNVL_ENABLE: "1" | ||
| NCCL_CUMEM_ENABLE: "1" | ||
| SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True" | ||
| MC_FORCE_MNNVL: "1" | ||
| SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" | ||
| SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" | ||
| SGLANG_OPT_SWA_RELEASE_LEAF_LOCK_AFTER_WINDOW: "1" | ||
| # SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2 intentionally NOT set: CAR_V2 | ||
| # is single-node only and corrupts results in 2-node decode setups. | ||
|
|
||
| sglang_config: | ||
| prefill: | ||
| served-model-name: "deepseek-ai/DeepSeek-V4-Pro" | ||
| model-path: "/model/" | ||
| trust-remote-code: true | ||
|
|
||
| disaggregation-mode: "prefill" | ||
| disaggregation-transfer-backend: mooncake | ||
|
|
||
| tensor-parallel-size: 4 | ||
| data-parallel-size: 1 | ||
| expert-parallel-size: 1 | ||
|
|
||
| moe-runner-backend: "flashinfer_mxfp4" | ||
| disable-flashinfer-autotune: true | ||
|
|
||
| mem-fraction-static: 0.9 | ||
| max-running-requests: 8 | ||
| cuda-graph-max-bs: 8 | ||
| chunked-prefill-size: 32768 | ||
|
|
||
| decode: | ||
| served-model-name: "deepseek-ai/DeepSeek-V4-Pro" | ||
| model-path: "/model/" | ||
| trust-remote-code: true | ||
|
|
||
| disaggregation-mode: "decode" | ||
| disaggregation-transfer-backend: mooncake | ||
|
|
||
| tensor-parallel-size: 4 | ||
| data-parallel-size: 1 | ||
| expert-parallel-size: 1 | ||
|
|
||
| moe-runner-backend: "flashinfer_mxfp4" | ||
| disable-flashinfer-autotune: true | ||
|
|
||
| speculative-algo: "EAGLE" | ||
| speculative-num-steps: 3 | ||
| speculative-eagle-topk: 1 | ||
| speculative-num-draft-tokens: 4 | ||
|
|
||
| mem-fraction-static: 0.9 | ||
| max-running-requests: 8 | ||
| cuda-graph-max-bs: 8 | ||
| swa-full-tokens-ratio: 0.1 | ||
| context-length: 16384 | ||
|
|
||
| benchmark: | ||
| type: "sa-bench" | ||
| isl: 8192 | ||
| osl: 1024 | ||
| random_range_ratio: 0.8 | ||
| concurrencies: "1" | ||
| req_rate: "inf" | ||
| use_chat_template: true | ||
| custom_tokenizer: "sa_bench_tokenizers.sglang_deepseek_v4.SGLangDeepseekV4Tokenizer" | ||
139 changes: 139 additions & 0 deletions
139
..._node/srt-slurm-recipes/sglang/deepseek-v4/8k1k/mtp/disagg-low-latency-1p6d-dep4-tp4.yaml
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,139 @@ | ||
| name: "dsv4-pro-gb300-disagg-8k1k-low-latency-1p6d-dep4-tp4-mtp" | ||
|
|
||
| frontend: | ||
| type: dynamo | ||
| enable_multiple_frontends: true | ||
| num_additional_frontends: 8 | ||
|
|
||
| dynamo: | ||
| hash: "9d3c913d300eb368cda28b3f98a23a5762621e0d" | ||
| install: true | ||
|
|
||
| model: | ||
| path: "deepseek-v4-pro" | ||
| container: "lmsysorg/sglang-staging:deepseek-v4-grace-blackwell-dev" | ||
| precision: "mxfp4" | ||
|
|
||
| sbatch_directives: | ||
| cpus-per-task: "144" | ||
| mem: "0" | ||
|
|
||
| resources: | ||
| gpu_type: "gb300" | ||
| gpus_per_node: 4 | ||
| prefill_nodes: 1 | ||
| prefill_workers: 1 | ||
| decode_nodes: 6 | ||
| decode_workers: 6 | ||
|
|
||
|
claude[bot] marked this conversation as resolved.
|
||
| backend: | ||
| type: sglang | ||
|
|
||
| prefill_environment: | ||
| PYTHONUNBUFFERED: "1" | ||
| SGLANG_RADIX_DISABLE_REUSE: "1" | ||
| SGLANG_JIT_DEEPGEMM_PRECOMPILE: "0" | ||
| SGLANG_ENABLE_THINKING: "1" | ||
| SGLANG_REASONING_EFFORT: "max" | ||
| SGLANG_OPT_SWA_SPLIT_LEAF_ON_INSERT: "1" | ||
| SGLANG_OPT_USE_JIT_NORM: "1" | ||
| SGLANG_OPT_USE_JIT_INDEXER_METADATA: "1" | ||
| SGLANG_OPT_USE_TOPK_V2: "1" | ||
|
|
||
| SGLANG_OPT_SWA_EVICT_DROP_PAGE_MARGIN: "1" | ||
| SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2: "1" | ||
| SGLANG_OPT_USE_FAST_MASK_EP: "1" | ||
| SGLANG_OPT_USE_DEEPGEMM_MEGA_MOE: "1" | ||
| SGLANG_OPT_FIX_HASH_MEGA_MOE: "1" | ||
| SGLANG_OPT_DEEPGEMM_MEGA_MOE_NUM_MAX_TOKENS_PER_RANK: "9216" | ||
| SGLANG_OPT_FIX_MEGA_MOE_MEMORY: "1" | ||
| SGLANG_OPT_FIX_NEXTN_MEGA_MOE: "1" | ||
| SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK: "0" | ||
|
|
||
| NCCL_MNNVL_ENABLE: "1" | ||
| NCCL_CUMEM_ENABLE: "1" | ||
| SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True" | ||
| MC_FORCE_MNNVL: "1" | ||
| SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" | ||
| SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" | ||
| SGLANG_OPT_SWA_RELEASE_LEAF_LOCK_AFTER_WINDOW: "1" | ||
|
|
||
| decode_environment: | ||
| PYTHONUNBUFFERED: "1" | ||
| SGLANG_RADIX_DISABLE_REUSE: "1" | ||
| SGLANG_JIT_DEEPGEMM_PRECOMPILE: "0" | ||
| SGLANG_ENABLE_THINKING: "1" | ||
| SGLANG_REASONING_EFFORT: "max" | ||
| SGLANG_OPT_SWA_SPLIT_LEAF_ON_INSERT: "1" | ||
| SGLANG_OPT_USE_JIT_NORM: "1" | ||
| SGLANG_OPT_USE_JIT_INDEXER_METADATA: "1" | ||
| SGLANG_OPT_USE_TOPK_V2: "1" | ||
| NCCL_MNNVL_ENABLE: "1" | ||
| NCCL_CUMEM_ENABLE: "1" | ||
| SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True" | ||
| MC_FORCE_MNNVL: "1" | ||
| SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" | ||
| SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" | ||
| SGLANG_OPT_SWA_RELEASE_LEAF_LOCK_AFTER_WINDOW: "1" | ||
| # SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2 intentionally NOT set: CAR_V2 | ||
| # is single-node only and corrupts results in 2-node decode setups. | ||
|
|
||
| sglang_config: | ||
| prefill: | ||
| served-model-name: "deepseek-ai/DeepSeek-V4-Pro" | ||
| model-path: "/model/" | ||
| trust-remote-code: true | ||
|
|
||
| disaggregation-mode: "prefill" | ||
| disaggregation-transfer-backend: mooncake | ||
|
|
||
| tensor-parallel-size: 4 | ||
| data-parallel-size: 4 | ||
| expert-parallel-size: 4 | ||
|
|
||
| enable-dp-attention: true | ||
| enable-dp-lm-head: true | ||
|
|
||
| moe-a2a-backend: "deepep" | ||
| deepep-config: '{"normal_dispatch":{"num_sms":96},"normal_combine":{"num_sms":96}}' | ||
|
|
||
| mem-fraction-static: 0.9 | ||
| max-running-requests: 128 | ||
| cuda-graph-max-bs: 128 | ||
| chunked-prefill-size: 32768 | ||
|
|
||
| decode: | ||
| served-model-name: "deepseek-ai/DeepSeek-V4-Pro" | ||
| model-path: "/model/" | ||
| trust-remote-code: true | ||
|
|
||
| disaggregation-mode: "decode" | ||
| disaggregation-transfer-backend: mooncake | ||
|
|
||
| tensor-parallel-size: 4 | ||
| data-parallel-size: 1 | ||
| expert-parallel-size: 1 | ||
|
|
||
| moe-runner-backend: "flashinfer_mxfp4" | ||
| disable-flashinfer-autotune: true | ||
|
|
||
| speculative-algo: "EAGLE" | ||
| speculative-num-steps: 3 | ||
| speculative-eagle-topk: 1 | ||
| speculative-num-draft-tokens: 4 | ||
|
|
||
| mem-fraction-static: 0.9 | ||
| max-running-requests: 128 | ||
| cuda-graph-max-bs: 128 | ||
| swa-full-tokens-ratio: 0.1 | ||
| context-length: 16384 | ||
|
|
||
| benchmark: | ||
| type: "sa-bench" | ||
| isl: 8192 | ||
| osl: 1024 | ||
| random_range_ratio: 0.8 | ||
| concurrencies: "8x32x64" | ||
| req_rate: "inf" | ||
| use_chat_template: true | ||
| custom_tokenizer: "sa_bench_tokenizers.sglang_deepseek_v4.SGLangDeepseekV4Tokenizer" | ||
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🟡 All 6 new MTP recipes set
model.precision: "mxfp4", but every existing siblingdsv4SGLang recipe inbenchmarks/multi_node/srt-slurm-recipes/sglang/deepseek-v4/8k1k/usesprecision: "fp4"— even though they share the samemoe-runner-backend: flashinfer_mxfp4— and the matrix entrydsv4-fp4-gb300-dynamo-sglang-mtpitself hasprecision: fp4. Nit: align all 6 MTP recipes toprecision: "fp4"to match the established convention; this is metadata-only (InferenceX aggregation keys off the matrix-level precision, not the recipe yaml), so runtime impact is minimal.Extended reasoning...
What the inconsistency is
Each of the 6 new files at
benchmarks/multi_node/srt-slurm-recipes/sglang/deepseek-v4/8k1k/mtp/*.yamlhas:Whereas all 6 pre-existing sibling recipes at
benchmarks/multi_node/srt-slurm-recipes/sglang/deepseek-v4/8k1k/disagg-gb300-*.yamluseprecision: "fp4"(line 37 of each), despite carrying the samemoe-runner-backend: "flashinfer_mxfp4"setting in theirsglang_config. The matrix entry added in.github/configs/nvidia-master.yamlfor these MTP recipes also usesprecision: fp4, andAGENTS.mdlists onlyfp4andfp8as recognized precisions in the project.Step-by-step proof of the divergence
benchmarks/multi_node/srt-slurm-recipes/sglang/deepseek-v4/8k1k/mtp/disagg-low-latency-1p1d-tp4-tp4.yamlline 15:precision: "mxfp4".benchmarks/multi_node/srt-slurm-recipes/sglang/deepseek-v4/8k1k/disagg-gb300-1p1d-dep4-dep8-3-c256.yaml(or any of the 6 sibling recipes added in Update DeepSeek V4 Pro FP4 GB300 disaggregated SGLang benchmarks #1295) around line 37:precision: "fp4".moe-runner-backend: "flashinfer_mxfp4"in theirsglang_config.decodeblocks..github/configs/nvidia-master.yamlat the newdsv4-fp4-gb300-dynamo-sglang-mtp:block:precision: fp4.So within the same PR, the matrix says
fp4and the recipe yamls saymxfp4, while the equivalent non-MTP sibling recipes that share the same MoE backend sayfp4at the recipe level too. That is a copy-paste inconsistency with the established convention.Addressing the refutation: what the runtime impact actually is
The refutation correctly notes that InferenceX's own aggregation pipelines (
utils/summarize.py,utils/collect_eval_results.py,utils/matrix_logic/generate_sweep_configs.py,launch_gb300-cw.sh) key off the matrix-levelprecisionfield fromnvidia-master.yaml, not the recipe yaml'smodel.precision. Since the matrix entry is correctlyfp4, in-repo aggregation/labeling is unaffected — the original framing of "confusing labels in eval/result aggregation pipelines" overstates the impact. The recipe-level field is consumed externally by srt-slurm/srtctl, and the upstream source (elvischenv/srt-slurm@dsv4-gb300-disagg-8k1k-mtp) presumably acceptsmxfp4. So this is not a runtime breakage.Why it's still worth fixing
It is purely a cross-recipe metadata uniformity nit: every sibling
dsv4SGLang recipe in the same directory tree, even ones using the identicalflashinfer_mxfp4MoE backend, declaresprecision: "fp4"at the recipe level. Themxfp4label here will trip up future grep-based audits and contradicts the project-wide enum inAGENTS.md. The fix is to replaceprecision: "mxfp4"withprecision: "fp4"on line 15 of all 6 new MTP recipes — no other change required.