This repository was archived by the owner on Apr 20, 2026. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 37
Add MTP (Multi-Token Prediction) recipe variants for H200 configurations #120
Merged
Merged
Changes from all commits
Commits
Show all changes
8 commits
Select commit
Hold shift + click to select a range
1ac6faa
Add MTP recipe variants for H200 8k1k configurations
xutizhou 09e3cf1
Add MTP recipe variants for H200 1k1k configurations
xutizhou 3c5f200
Reduce batch sizes for MTP configs based on TRTLLM patterns
xutizhou 339d084
Expand concurrency ranges and tune MTP memory settings for 8k1k recipes
xutizhou d820abf
Revert non mtp
ishandhanani 9c36c45
Merge remote-tracking branch 'origin/main' into xutizhou/main
ishandhanani ee525dd
Update h200 mtp recipes
ishandhanani 2e72bef
Update h200 mtp recipes
ishandhanani File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,66 @@ | ||
| name: "agg-tp-h200-fp8-mtp" | ||
|
|
||
| model: | ||
| path: "dsfp8" | ||
| container: "lmsysorg/sglang:v0.5.8-cu130-runtime" | ||
| precision: "fp8" | ||
|
|
||
| resources: | ||
| gpu_type: "h200" | ||
| agg_nodes: 1 | ||
| agg_workers: 1 | ||
| gpus_per_node: 8 | ||
|
|
||
| backend: | ||
|
|
||
| # Aggregated environment variables | ||
| aggregated_environment: | ||
| SGLANG_ENABLE_SPEC_V2: "1" | ||
| SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1" | ||
| SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" | ||
| SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" | ||
| SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" | ||
|
|
||
| sglang_config: | ||
| aggregated: | ||
| # Model configuration | ||
| served-model-name: "deepseek-ai/DeepSeek-R1" | ||
| model-path: "/model/" | ||
| skip-tokenizer-init: true | ||
| trust-remote-code: true | ||
| watchdog-timeout: 1000000 | ||
|
|
||
| # Parallelism | ||
| tp-size: 8 | ||
| dp-size: 1 | ||
|
|
||
| # KV cache and attention | ||
| attention-backend: "flashinfer" | ||
|
|
||
| # Radix cache disabled | ||
| disable-radix-cache: true | ||
|
|
||
| # Other flags | ||
| stream-interval: 10 | ||
| max-running-requests: 128 # sum of all dp | ||
|
|
||
| # Memory and token limits | ||
| mem-fraction-static: 0.75 | ||
| max-prefill-tokens: 32768 | ||
| chunked-prefill-size: 32768 | ||
|
|
||
| # CUDA graphs | ||
| cuda-graph-max-bs: 128 | ||
|
|
||
| # MTP settings | ||
| speculative-algorithm: "EAGLE" | ||
| speculative-num-steps: 2 | ||
| speculative-eagle-topk: 1 | ||
| speculative-num-draft-tokens: 3 | ||
|
|
||
| benchmark: | ||
| type: "sa-bench" | ||
| isl: 1024 | ||
| osl: 1024 | ||
| concurrencies: "1x4x16x32x64x128x256x512" | ||
| req_rate: "inf" |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,115 @@ | ||
| name: "bs256-1p6d-h200-fp8-mtp" | ||
|
|
||
| model: | ||
| path: "dsfp8" | ||
| container: "lmsysorg/sglang:v0.5.8-cu130-runtime" | ||
| precision: "fp8" | ||
|
|
||
| resources: | ||
| gpu_type: "h200" | ||
| prefill_nodes: 1 | ||
| prefill_workers: 1 | ||
| decode_nodes: 6 | ||
| decode_workers: 6 | ||
| gpus_per_node: 8 | ||
|
|
||
| backend: | ||
|
|
||
| # Prefill-specific environment variables | ||
| prefill_environment: | ||
| SGLANG_ENABLE_SPEC_V2: "1" | ||
| SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1" | ||
| SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" | ||
| SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" | ||
| SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" | ||
|
|
||
| # Decode-specific environment variables | ||
| decode_environment: | ||
| SGLANG_ENABLE_SPEC_V2: "1" | ||
| SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1" | ||
| SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" | ||
| SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" | ||
| SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" | ||
|
|
||
| sglang_config: | ||
| prefill: | ||
| # Model configuration | ||
| served-model-name: "deepseek-ai/DeepSeek-R1" | ||
| model-path: "/model/" | ||
| skip-tokenizer-init: true | ||
| trust-remote-code: true | ||
| watchdog-timeout: 1000000 | ||
|
|
||
| # Parallelism | ||
| tp-size: 8 | ||
| dp-size: 8 | ||
| ep-size: 8 | ||
| enable-dp-attention: true | ||
| # KV cache and attention | ||
| attention-backend: "flashinfer" | ||
|
|
||
| # Radix cache disabled | ||
| disable-radix-cache: true | ||
|
|
||
| # Other flags | ||
| # stream-interval: 50 | ||
| max-running-requests: 512 | ||
|
|
||
|
|
||
| # Prefill-specific mode | ||
| disaggregation-bootstrap-port: 30001 | ||
| disaggregation-mode: "prefill" | ||
| disaggregation-transfer-backend: nixl | ||
|
|
||
| # Memory and token limits | ||
| mem-fraction-static: 0.75 | ||
| max-prefill-tokens: 65536 | ||
| chunked-prefill-size: 262144 | ||
|
|
||
| # Request handling | ||
| load-balance-method: "round_robin" | ||
|
|
||
|
|
||
| decode: | ||
| # Model configuration | ||
| served-model-name: "deepseek-ai/DeepSeek-R1" | ||
| model-path: "/model/" | ||
| skip-tokenizer-init: true | ||
| trust-remote-code: true | ||
| watchdog-timeout: 1000000 | ||
|
|
||
| # Parallelism | ||
| tp-size: 8 | ||
| dp-size: 8 | ||
| ep-size: 8 | ||
| enable-dp-attention: true | ||
|
|
||
| # KV cache and attention | ||
| attention-backend: "flashinfer" | ||
|
|
||
| # Other flags | ||
| disable-radix-cache: true | ||
| stream-interval: 10 | ||
|
|
||
| # Disagg | ||
| disaggregation-bootstrap-port: 30001 | ||
| disaggregation-mode: "decode" | ||
| disaggregation-transfer-backend: nixl | ||
|
|
||
| # Memory and token limits | ||
| mem-fraction-static: 0.75 | ||
| max-running-requests: 128 | ||
| cuda-graph-max-bs: 128 | ||
|
|
||
| # MTP settings | ||
| speculative-algorithm: "EAGLE" | ||
| speculative-num-steps: 2 | ||
| speculative-eagle-topk: 1 | ||
| speculative-num-draft-tokens: 3 | ||
|
|
||
| benchmark: | ||
| type: "sa-bench" | ||
| isl: 1024 | ||
| osl: 1024 | ||
| concurrencies: "128x256x512x1024x2048" | ||
| req_rate: "inf" |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,115 @@ | ||
| name: "bs256-1p6d-h200-fp8-mtp" | ||
|
|
||
| model: | ||
| path: "dsfp8" | ||
| container: "lmsysorg/sglang:v0.5.8-cu130-runtime" | ||
| precision: "fp8" | ||
|
|
||
| resources: | ||
| gpu_type: "h200" | ||
| prefill_nodes: 1 | ||
| prefill_workers: 1 | ||
| decode_nodes: 6 | ||
| decode_workers: 6 | ||
| gpus_per_node: 8 | ||
|
|
||
| backend: | ||
|
|
||
| # Prefill-specific environment variables | ||
| prefill_environment: | ||
| SGLANG_ENABLE_SPEC_V2: "1" | ||
| SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1" | ||
| SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" | ||
| SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" | ||
| SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" | ||
|
|
||
| # Decode-specific environment variables | ||
| decode_environment: | ||
| SGLANG_ENABLE_SPEC_V2: "1" | ||
| SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1" | ||
| SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" | ||
| SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" | ||
| SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" | ||
|
|
||
| sglang_config: | ||
| prefill: | ||
| # Model configuration | ||
| served-model-name: "deepseek-ai/DeepSeek-R1" | ||
| model-path: "/model/" | ||
| skip-tokenizer-init: true | ||
| trust-remote-code: true | ||
| watchdog-timeout: 1000000 | ||
|
|
||
| # Parallelism | ||
| tp-size: 8 | ||
| dp-size: 1 | ||
| ep-size: 1 | ||
|
|
||
| # KV cache and attention | ||
| attention-backend: "flashinfer" | ||
|
|
||
| # Radix cache disabled | ||
| disable-radix-cache: true | ||
|
|
||
| # Other flags | ||
| # stream-interval: 50 | ||
| max-running-requests: 512 | ||
|
|
||
|
|
||
| # Prefill-specific mode | ||
| disaggregation-bootstrap-port: 30001 | ||
| disaggregation-mode: "prefill" | ||
| disaggregation-transfer-backend: nixl | ||
|
|
||
| # Memory and token limits | ||
| mem-fraction-static: 0.7 | ||
| max-prefill-tokens: 163840 | ||
| chunked-prefill-size: 163840 | ||
|
|
||
| # Request handling | ||
| load-balance-method: "round_robin" | ||
|
|
||
|
|
||
| decode: | ||
| # Model configuration | ||
| served-model-name: "deepseek-ai/DeepSeek-R1" | ||
| model-path: "/model/" | ||
| skip-tokenizer-init: true | ||
| trust-remote-code: true | ||
| watchdog-timeout: 1000000 | ||
|
|
||
| # Parallelism | ||
| tp-size: 8 | ||
| dp-size: 1 | ||
| ep-size: 1 | ||
|
|
||
| # KV cache and attention | ||
| attention-backend: "flashinfer" | ||
|
|
||
| # Other flags | ||
| disable-radix-cache: true | ||
| stream-interval: 10 | ||
|
|
||
| # Disagg | ||
| disaggregation-bootstrap-port: 30001 | ||
| disaggregation-mode: "decode" | ||
| disaggregation-transfer-backend: nixl | ||
|
|
||
| # Memory and token limits | ||
| mem-fraction-static: 0.75 | ||
| max-running-requests: 128 | ||
| cuda-graph-max-bs: 128 | ||
|
|
||
| # MTP settings | ||
| speculative-algorithm: "EAGLE" | ||
| speculative-num-steps: 2 | ||
| speculative-eagle-topk: 1 | ||
| speculative-num-draft-tokens: 3 | ||
|
|
||
| benchmark: | ||
| type: "sa-bench" | ||
| isl: 1024 | ||
| osl: 1024 | ||
| # concurrencies: "128x256x512" | ||
| concurrencies: "512x1024x2048" | ||
| req_rate: "inf" | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
Repository: ishandhanani/srt-slurm
Length of output: 48
🏁 Script executed:
Repository: ishandhanani/srt-slurm
Length of output: 108
🏁 Script executed:
Repository: ishandhanani/srt-slurm
Length of output: 106
🏁 Script executed:
Repository: ishandhanani/srt-slurm
Length of output: 24678
🏁 Script executed:
Repository: ishandhanani/srt-slurm
Length of output: 3184
Benchmark concurrencies exceed
max-running-requests(128).Line 106 targets 512/1024/2048 while the decode configuration caps at 128 (line 92–93), causing the server to queue requests and preventing the benchmark from fully testing the intended concurrency levels. Raise
max-running-requests/cuda-graph-max-bsto support higher concurrencies, or revert to the 128/256/512 set to match current limits.🔧 Option: align concurrencies with current runtime caps
🤖 Prompt for AI Agents