diff --git a/.github/workflows/ci.yaml b/.github/workflows/ci.yaml
index eba897bb..dccdba05 100644
--- a/.github/workflows/ci.yaml
+++ b/.github/workflows/ci.yaml
@@ -4,7 +4,7 @@ on:
push:
branches: [main, master]
pull_request:
- branches: [main, master]
+ branches: [main, master, sa-submission-q2-2026]
jobs:
lint:
@@ -119,3 +119,4 @@ jobs:
exit(1)
print(f'\nAll {len(recipes)} recipes valid')
"
+
diff --git a/docs/accuracy.md b/docs/accuracy.md
index f5588c9f..98b69b46 100644
--- a/docs/accuracy.md
+++ b/docs/accuracy.md
@@ -1,6 +1,6 @@
# Accuracy Benchmarks
-In srt-slurm, users can run different accuracy benchmarks by setting the benchmark section in the config yaml file. Supported benchmarks include `mmlu`, `gpqa` and `longbenchv2`.
+In srt-slurm, users can run different accuracy benchmarks by setting the benchmark section in the config yaml file. Supported benchmarks include `mmlu`, `gpqa`, `longbenchv2`, and `lm-eval`.
## Table of Contents
@@ -14,6 +14,7 @@ In srt-slurm, users can run different accuracy benchmarks by setting the benchma
- [Example: Quick Validation](#example-quick-validation)
- [Output](#output)
- [Important Notes](#important-notes)
+- [lm-eval (InferenceX)](#lm-eval-inferencex)
---
@@ -191,3 +192,84 @@ The output includes per-category scores and aggregate metrics:
4. **Categories**: Running specific categories is useful for targeted validation (e.g., just testing summarization capabilities)
+## lm-eval (InferenceX)
+
+The `lm-eval` benchmark runner integrates [EleutherAI/lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) via InferenceX's `benchmark_lib.sh`. Unlike the built-in benchmarks above, this runner sources evaluation logic from an external InferenceX workspace mounted at `/infmax-workspace`.
+
+This is used by InferenceX CI to run evals such as GSM8K and GPQA against NVIDIA multi-node disaggregated deployments on GB200, GB300, B200, B300, H100, and H200. AMD MI355X multi-node evals are handled by InferenceX's upstreamed AMD Slurm path, not by this srt-slurm runner.
+
+In InferenceX CI, recipes normally keep their throughput benchmark configuration. `do_sweep.py` invokes the registered `lm-eval` runner as a post-step when `RUN_EVAL=true`, or as the only benchmark-like step when `EVAL_ONLY=true`. There is no separate `infmax-eval` benchmark type.
+
+### How it works
+
+1. `RuntimeContext` mounts the host path from `INFMAX_WORKSPACE` at `/infmax-workspace` inside the Slurm container.
+2. `do_sweep.py` starts infrastructure, workers, and the frontend for the normal recipe topology.
+3. For `EVAL_ONLY=true`, `do_sweep.py` skips the throughput benchmark stage and runs `_run_post_eval()` directly after frontend startup.
+4. `_run_post_eval()` waits for the OpenAI-compatible endpoint on port 8000 and, in eval-only mode, performs the full `wait_for_model()` health check for the configured prefill/decode or aggregated topology.
+5. `_run_post_eval()` launches the registered `lm-eval` runner on the head node and passes through InferenceX metadata such as framework, precision, sequence length, prefill/decode topology, and eval concurrency.
+6. The runner script (`benchmarks/scripts/lm-eval/bench.sh`) uses `MODEL_NAME` from `do_sweep.py`, or auto-discovers the served model from `/v1/models` as a fallback.
+7. The runner sources `/infmax-workspace/benchmarks/benchmark_lib.sh`, runs `run_eval --framework lm-eval`, and calls `append_lm_eval_summary`.
+8. Eval artifacts are copied to `/logs/eval_results/` for InferenceX launcher-side artifact pickup.
+
+### EVAL_ONLY mode
+
+srt-slurm supports an `EVAL_ONLY` mode for CI jobs that should only validate accuracy. This is controlled by environment variables from the InferenceX workflow:
+
+| Env var | Description |
+|---------|-------------|
+| `EVAL_ONLY` | Set to `true` to skip the throughput benchmark stage and run eval only |
+| `RUN_EVAL` | Set to `true` to run eval after the throughput benchmark completes |
+| `EVAL_CONC` | Concurrent requests for lm-eval, normally set by InferenceX from the generated `eval-conc` value |
+| `INFMAX_WORKSPACE` | Host path to the InferenceX checkout that should be mounted at `/infmax-workspace` |
+| `MODEL_NAME` | Served model alias for OpenAI-compatible requests; set by `do_sweep.py` from `config.served_model_name` |
+
+When `EVAL_ONLY=true`:
+- Stage 4 skips the throughput benchmark entirely. No throughput result JSON is expected from srt-slurm.
+- The eval path uses the full `wait_for_model()` health check before starting lm-eval.
+- `_run_post_eval()` launches the `lm-eval` runner and returns its exit code.
+- Eval failure is fatal because eval is the only purpose of the job.
+
+When `RUN_EVAL=true` (without `EVAL_ONLY`):
+- Throughput benchmark runs normally
+- After benchmark completes successfully, eval runs as a post-step
+- Eval failure is non-fatal; the benchmark job still succeeds if throughput passed
+
+### Environment variables
+
+The following env vars are passed through to the lm-eval runner container:
+
+| Env var | Purpose |
+|---------|---------|
+| `RUN_EVAL`, `EVAL_ONLY`, `IS_MULTINODE` | Control whether eval runs and how InferenceX classifies the artifact |
+| `FRAMEWORK`, `PRECISION`, `MODEL_PREFIX`, `RUNNER_TYPE`, `SPEC_DECODING` | Benchmark identity metadata for `meta_env.json` |
+| `ISL`, `OSL`, `RESULT_FILENAME` | Sequence length and result-file metadata |
+| `MODEL`, `MODEL_PATH`, `MODEL_NAME` | Model metadata and the served model alias used for requests |
+| `MAX_MODEL_LEN`, `EVAL_MAX_MODEL_LEN` | Context-length metadata used by InferenceX eval helpers when available |
+| `PREFILL_TP`, `PREFILL_EP`, `PREFILL_NUM_WORKERS`, `PREFILL_DP_ATTN` | Prefill-side topology metadata |
+| `DECODE_TP`, `DECODE_EP`, `DECODE_NUM_WORKERS`, `DECODE_DP_ATTN` | Decode-side topology metadata |
+| `EVAL_CONC`, `EVAL_CONCURRENT_REQUESTS` | Eval concurrency controls |
+
+The runner maps srt-slurm's `PREFILL_DP_ATTN` and `DECODE_DP_ATTN` names to InferenceX's `PREFILL_DP_ATTENTION` and `DECODE_DP_ATTENTION` names before calling `append_lm_eval_summary`. This is required for multi-node summary tables to preserve prefill/decode DPA state.
+
+### Concurrency
+
+Eval concurrency is ultimately read by InferenceX's `benchmark_lib.sh` from `EVAL_CONCURRENT_REQUESTS`. The runner script sets that value from `EVAL_CONC` when present, preserves an existing `EVAL_CONCURRENT_REQUESTS` otherwise, and falls back to `256` only if neither variable is set:
+
+```bash
+export EVAL_CONCURRENT_REQUESTS="${EVAL_CONC:-${EVAL_CONCURRENT_REQUESTS:-256}}"
+```
+
+The InferenceX workflow sets `EVAL_CONC` from the generated `eval-conc` value. For multi-node configs, InferenceX selects the `8k1k` entry with the highest max eligible concurrency for each `(model, runner, framework, precision, spec-decoding, prefill-dp-attn, decode-dp-attn)` group, then sets `eval-conc` to the upper median of that config's eligible concurrency list. If `EVAL_CONC` is not set in the environment, `do_sweep.py` falls back to the max of the recipe benchmark concurrency list.
+
+### Output
+
+Eval artifacts are written to `/logs/eval_results/` inside the container:
+- `meta_env.json` - metadata used by InferenceX aggregation and summary tables
+- `results*.json` - lm-eval scores per task
+- `sample*.jsonl` - per-sample outputs
+
+These are collected by the InferenceX NVIDIA launch scripts and uploaded as workflow artifacts. In eval-only mode the InferenceX workflow expects eval artifacts, not throughput benchmark artifacts.
+
+### Intricacies
+1. Eval floor of 16
+ - There is 1 sweep config of conc: [1], which causes evals to take >4hrs to complete.
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen1dep16_batch32_eplb0_mtp2.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen1dep16_batch32_eplb0_mtp2.yaml
new file mode 100644
index 00000000..21edc148
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen1dep16_batch32_eplb0_mtp2.yaml
@@ -0,0 +1,135 @@
+name: "glm5_nvfp4_ISL1K_OSL1K_ctx1dep4_gen1dep16_batch32_eplb0_mtp2"
+
+# ctx: 1 prefill worker, TP4/EP4
+# gen: 1 decode worker, TP16/EP16, enable_attention_dp=true, max_batch=32
+# concurrency: 666
+
+model:
+ path: "nvidia/GLM5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb200"
+
+ prefill_nodes: 1
+ prefill_workers: 1
+ gpus_per_prefill: 4
+
+ decode_workers: 1
+ decode_nodes: 4
+ gpus_per_decode: 16
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 4
+ moe_expert_parallel_size: 4
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 16
+ max_num_tokens: 16384
+ max_seq_len: 1064
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: CUTEDSL
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.6
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ speculative_config:
+ decoding_type: MTP
+ num_nextn_predict_layers: 2
+
+ decode:
+ tensor_parallel_size: 16
+ moe_expert_parallel_size: 16
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ enable_lm_head_tp_in_adp: true
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 32
+ max_num_tokens: 96
+ max_seq_len: 2088
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ - 8
+ - 16
+ - 24
+ - 32
+ moe_config:
+ backend: CUTEDSL
+ use_low_precision_moe_combine: true
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.7
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+ speculative_config:
+ decoding_type: MTP
+ num_nextn_predict_layers: 2
+
+benchmark:
+ type: "sa-bench"
+ isl: 1024
+ osl: 1024
+ concurrencies: "666"
+ req_rate: "inf"
+ custom_tokenizer: "glm_moe_dsa"
+ use_chat_template: false
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen1dep16_batch64_eplb0_mtp1.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen1dep16_batch64_eplb0_mtp1.yaml
new file mode 100644
index 00000000..ebcd45d1
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen1dep16_batch64_eplb0_mtp1.yaml
@@ -0,0 +1,139 @@
+name: "glm5_nvfp4_ISL1K_OSL1K_ctx1dep4_gen1dep16_batch64_eplb0_mtp1"
+
+# ctx: 1 prefill worker, TP4/EP4
+# gen: 1 decode worker, TP16/EP16, enable_attention_dp=true, max_batch=64
+# concurrency: 1229
+
+model:
+ path: "nvidia/GLM5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb200"
+
+ prefill_nodes: 1
+ prefill_workers: 1
+ gpus_per_prefill: 4
+
+ decode_workers: 1
+ decode_nodes: 4
+ gpus_per_decode: 16
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 4
+ moe_expert_parallel_size: 4
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 16
+ max_num_tokens: 16384
+ max_seq_len: 1064
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: CUTEDSL
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.6
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ speculative_config:
+ decoding_type: MTP
+ num_nextn_predict_layers: 1
+
+ decode:
+ tensor_parallel_size: 16
+ moe_expert_parallel_size: 16
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ enable_lm_head_tp_in_adp: true
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 64
+ max_num_tokens: 128
+ max_seq_len: 2088
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ - 8
+ - 16
+ - 24
+ - 32
+ - 40
+ - 48
+ - 56
+ - 64
+ moe_config:
+ backend: CUTEDSL
+ use_low_precision_moe_combine: true
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.7
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+ speculative_config:
+ decoding_type: MTP
+ num_nextn_predict_layers: 1
+
+benchmark:
+ type: "sa-bench"
+ isl: 1024
+ osl: 1024
+ concurrencies: "1229"
+ req_rate: "inf"
+ custom_tokenizer: "glm_moe_dsa"
+ use_chat_template: false
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen1dep32_batch16_allconc_eplb0_mtp3.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen1dep32_batch16_allconc_eplb0_mtp3.yaml
new file mode 100644
index 00000000..68af65ee
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen1dep32_batch16_allconc_eplb0_mtp3.yaml
@@ -0,0 +1,133 @@
+name: "glm5_nvfp4_ISL1K_OSL1K_ctx1dep4_gen1dep32_batch16_allconc_eplb0_mtp3"
+
+# ctx: 1 prefill worker, TP4/EP4
+# gen: 1 decode worker, TP32/EP32, enable_attention_dp=true, max_batch=16
+# concurrencies: 333 (batch8), 666 (batch16)
+
+model:
+ path: "nvidia/GLM5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb200"
+
+ prefill_nodes: 1
+ prefill_workers: 1
+ gpus_per_prefill: 4
+
+ decode_workers: 1
+ decode_nodes: 8
+ gpus_per_decode: 32
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 4
+ moe_expert_parallel_size: 4
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 16
+ max_num_tokens: 16384
+ max_seq_len: 1064
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: CUTEDSL
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.6
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ speculative_config:
+ decoding_type: MTP
+ num_nextn_predict_layers: 3
+
+ decode:
+ tensor_parallel_size: 32
+ moe_expert_parallel_size: 32
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ enable_lm_head_tp_in_adp: true
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 16
+ max_num_tokens: 64
+ max_seq_len: 2088
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ - 8
+ - 16
+ moe_config:
+ backend: CUTEDSL
+ use_low_precision_moe_combine: true
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.6
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+ speculative_config:
+ decoding_type: MTP
+ num_nextn_predict_layers: 3
+
+benchmark:
+ type: "sa-bench"
+ isl: 1024
+ osl: 1024
+ concurrencies: "333x666"
+ req_rate: "inf"
+ custom_tokenizer: "glm_moe_dsa"
+ use_chat_template: false
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen4tep8_batch16_eplb0_mtp2.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen4tep8_batch16_eplb0_mtp2.yaml
new file mode 100644
index 00000000..d6d3dcf1
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen4tep8_batch16_eplb0_mtp2.yaml
@@ -0,0 +1,134 @@
+name: "glm5_nvfp4_ISL1K_OSL1K_ctx1dep4_gen4tep8_batch16_eplb0_mtp2"
+
+# ctx: 1 prefill worker, TP4/EP4
+# gen: 4 decode workers, TP8/EP8, enable_attention_dp=false, max_batch=16
+# concurrency: 96
+
+model:
+ path: "nvidia/GLM5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb200"
+
+ prefill_nodes: 1
+ prefill_workers: 1
+ gpus_per_prefill: 4
+
+ decode_workers: 4
+ decode_nodes: 8
+ gpus_per_decode: 8
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 4
+ moe_expert_parallel_size: 4
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 16
+ max_num_tokens: 16384
+ max_seq_len: 1064
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: CUTEDSL
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.6
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ speculative_config:
+ decoding_type: MTP
+ num_nextn_predict_layers: 2
+
+ decode:
+ allreduce_strategy: MNNVL
+ tensor_parallel_size: 8
+ moe_expert_parallel_size: 8
+ pipeline_parallel_size: 1
+ enable_attention_dp: false
+ enable_lm_head_tp_in_adp: false
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 16
+ max_num_tokens: 48
+ max_seq_len: 2088
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ - 8
+ - 16
+ moe_config:
+ backend: TRTLLM
+ use_low_precision_moe_combine: true
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.9
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+ speculative_config:
+ decoding_type: MTP
+ num_nextn_predict_layers: 2
+
+benchmark:
+ type: "sa-bench"
+ isl: 1024
+ osl: 1024
+ concurrencies: "96"
+ req_rate: "inf"
+ custom_tokenizer: "glm_moe_dsa"
+ use_chat_template: false
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen4tep8_batch32_allconc_eplb0_mtp3.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen4tep8_batch32_allconc_eplb0_mtp3.yaml
new file mode 100644
index 00000000..da187faf
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen4tep8_batch32_allconc_eplb0_mtp3.yaml
@@ -0,0 +1,136 @@
+name: "glm5_nvfp4_ISL1K_OSL1K_ctx1dep4_gen4tep8_batch32_allconc_eplb0_mtp3"
+
+# ctx: 1 prefill worker, TP4/EP4
+# gen: 4 decode workers, TP8/EP8, enable_attention_dp=false, max_batch=32
+# concurrencies: 8 (batch1), 44 (batch8), 192 (batch32)
+
+model:
+ path: "nvidia/GLM5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb200"
+
+ prefill_nodes: 1
+ prefill_workers: 1
+ gpus_per_prefill: 4
+
+ decode_workers: 4
+ decode_nodes: 8
+ gpus_per_decode: 8
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 4
+ moe_expert_parallel_size: 4
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 16
+ max_num_tokens: 16384
+ max_seq_len: 1064
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: CUTEDSL
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.6
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ speculative_config:
+ decoding_type: MTP
+ num_nextn_predict_layers: 3
+
+ decode:
+ allreduce_strategy: MNNVL
+ tensor_parallel_size: 8
+ moe_expert_parallel_size: 8
+ pipeline_parallel_size: 1
+ enable_attention_dp: false
+ enable_lm_head_tp_in_adp: false
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 32
+ max_num_tokens: 128
+ max_seq_len: 2088
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ - 8
+ - 16
+ - 24
+ - 32
+ moe_config:
+ backend: TRTLLM
+ use_low_precision_moe_combine: true
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.9
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+ speculative_config:
+ decoding_type: MTP
+ num_nextn_predict_layers: 3
+
+benchmark:
+ type: "sa-bench"
+ isl: 1024
+ osl: 1024
+ concurrencies: "8x44x192"
+ req_rate: "inf"
+ custom_tokenizer: "glm_moe_dsa"
+ use_chat_template: false
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen5tep4_batch1_eplb0_mtp3.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen5tep4_batch1_eplb0_mtp3.yaml
new file mode 100644
index 00000000..a6121cd0
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen5tep4_batch1_eplb0_mtp3.yaml
@@ -0,0 +1,131 @@
+name: "glm5_nvfp4_ISL1K_OSL1K_ctx1dep4_gen5tep4_batch1_eplb0_mtp3"
+
+# ctx: 1 prefill worker, TP4/EP4
+# gen: 5 decode workers, TP4/EP4, enable_attention_dp=false, max_batch=1
+# concurrency: 10
+
+model:
+ path: "nvidia/GLM5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb200"
+
+ prefill_nodes: 1
+ prefill_workers: 1
+ gpus_per_prefill: 4
+
+ decode_workers: 5
+ decode_nodes: 5
+ gpus_per_decode: 4
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 4
+ moe_expert_parallel_size: 4
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 16
+ max_num_tokens: 16384
+ max_seq_len: 1064
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: CUTEDSL
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.6
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ speculative_config:
+ decoding_type: MTP
+ num_nextn_predict_layers: 3
+
+ decode:
+ tensor_parallel_size: 4
+ moe_expert_parallel_size: 4
+ pipeline_parallel_size: 1
+ enable_attention_dp: false
+ enable_lm_head_tp_in_adp: false
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 1
+ max_num_tokens: 4
+ max_seq_len: 2088
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ moe_config:
+ backend: TRTLLM
+ use_low_precision_moe_combine: true
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.85
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+ speculative_config:
+ decoding_type: MTP
+ num_nextn_predict_layers: 3
+
+benchmark:
+ type: "sa-bench"
+ isl: 1024
+ osl: 1024
+ concurrencies: "10"
+ req_rate: "inf"
+ custom_tokenizer: "glm_moe_dsa"
+ use_chat_template: false
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/MTP/ctx2dep4_gen1dep16_batch256_eplb256_mtp1.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/MTP/ctx2dep4_gen1dep16_batch256_eplb256_mtp1.yaml
new file mode 100644
index 00000000..dc176b2d
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/MTP/ctx2dep4_gen1dep16_batch256_eplb256_mtp1.yaml
@@ -0,0 +1,167 @@
+name: "glm5_nvfp4_ISL1K_OSL1K_ctx2dep4_gen1dep16_batch256_eplb256_mtp1"
+
+# ctx: 2 prefill workers, TP4/EP4
+# gen: 1 decode worker, TP16/EP16, enable_attention_dp=true, max_batch=256
+# EPLB: num_slots=256
+# concurrency: 4301
+
+model:
+ path: "nvidia/GLM5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb200"
+
+ prefill_nodes: 2
+ prefill_workers: 2
+ gpus_per_prefill: 4
+
+ decode_workers: 1
+ decode_nodes: 4
+ gpus_per_decode: 16
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 4
+ moe_expert_parallel_size: 4
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 16
+ max_num_tokens: 16384
+ max_seq_len: 1064
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: CUTEDSL
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.6
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ speculative_config:
+ decoding_type: MTP
+ num_nextn_predict_layers: 1
+
+ decode:
+ tensor_parallel_size: 16
+ moe_expert_parallel_size: 16
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ enable_lm_head_tp_in_adp: true
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 256
+ max_num_tokens: 512
+ max_seq_len: 2088
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ - 8
+ - 16
+ - 24
+ - 32
+ - 40
+ - 48
+ - 56
+ - 64
+ - 72
+ - 80
+ - 88
+ - 96
+ - 104
+ - 112
+ - 120
+ - 128
+ - 136
+ - 144
+ - 152
+ - 160
+ - 168
+ - 176
+ - 184
+ - 192
+ - 200
+ - 208
+ - 216
+ - 224
+ - 232
+ - 240
+ - 248
+ - 256
+ moe_config:
+ backend: CUTEDSL
+ use_low_precision_moe_combine: true
+ load_balancer:
+ layer_updates_per_iter: 1
+ num_slots: 256
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.7
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+ speculative_config:
+ decoding_type: MTP
+ num_nextn_predict_layers: 1
+
+benchmark:
+ type: "sa-bench"
+ isl: 1024
+ osl: 1024
+ concurrencies: "4301"
+ req_rate: "inf"
+ custom_tokenizer: "glm_moe_dsa"
+ use_chat_template: false
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/MTP/ctx3dep4_gen1dep32_batch128_eplb288_mtp1.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/MTP/ctx3dep4_gen1dep32_batch128_eplb288_mtp1.yaml
new file mode 100644
index 00000000..a7a1c790
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/MTP/ctx3dep4_gen1dep32_batch128_eplb288_mtp1.yaml
@@ -0,0 +1,151 @@
+name: "glm5_nvfp4_ISL1K_OSL1K_ctx3dep4_gen1dep32_batch128_eplb288_mtp1"
+
+# ctx: 3 prefill workers, TP4/EP4
+# gen: 1 decode worker, TP32/EP32, enable_attention_dp=true, max_batch=128
+# EPLB: num_slots=288
+# concurrency: 4301
+
+model:
+ path: "nvidia/GLM5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb200"
+
+ prefill_nodes: 3
+ prefill_workers: 3
+ gpus_per_prefill: 4
+
+ decode_workers: 1
+ decode_nodes: 8
+ gpus_per_decode: 32
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 4
+ moe_expert_parallel_size: 4
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 16
+ max_num_tokens: 16384
+ max_seq_len: 1064
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: CUTEDSL
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.6
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ speculative_config:
+ decoding_type: MTP
+ num_nextn_predict_layers: 1
+
+ decode:
+ tensor_parallel_size: 32
+ moe_expert_parallel_size: 32
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ enable_lm_head_tp_in_adp: true
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 128
+ max_num_tokens: 256
+ max_seq_len: 2088
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ - 8
+ - 16
+ - 24
+ - 32
+ - 40
+ - 48
+ - 56
+ - 64
+ - 72
+ - 80
+ - 88
+ - 96
+ - 104
+ - 112
+ - 120
+ - 128
+ moe_config:
+ backend: CUTEDSL
+ use_low_precision_moe_combine: true
+ load_balancer:
+ layer_updates_per_iter: 1
+ num_slots: 288
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.6
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+ speculative_config:
+ decoding_type: MTP
+ num_nextn_predict_layers: 1
+
+benchmark:
+ type: "sa-bench"
+ isl: 1024
+ osl: 1024
+ concurrencies: "4301"
+ req_rate: "inf"
+ custom_tokenizer: "glm_moe_dsa"
+ use_chat_template: false
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen1dep32_batch32_eplb0_mtp0.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen1dep32_batch32_eplb0_mtp0.yaml
new file mode 100644
index 00000000..7412a109
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen1dep32_batch32_eplb0_mtp0.yaml
@@ -0,0 +1,129 @@
+name: "glm5_nvfp4_ISL1K_OSL1K_ctx1dep4_gen1dep32_batch32_eplb0_mtp0"
+
+# ctx: 1 prefill worker, TP4/EP4
+# gen: 1 decode worker, TP32/EP32, enable_attention_dp=true, max_batch=32
+# concurrency: 1229
+
+model:
+ path: "nvidia/GLM5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb200"
+
+ prefill_nodes: 1
+ prefill_workers: 1
+ gpus_per_prefill: 4
+
+ decode_workers: 1
+ decode_nodes: 8
+ gpus_per_decode: 32
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 4
+ moe_expert_parallel_size: 4
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 16
+ max_num_tokens: 16384
+ max_seq_len: 1064
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: CUTEDSL
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.6
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+
+ decode:
+ tensor_parallel_size: 32
+ moe_expert_parallel_size: 32
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ enable_lm_head_tp_in_adp: false
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 32
+ max_num_tokens: 32
+ max_seq_len: 2088
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ - 8
+ - 16
+ - 24
+ - 32
+ moe_config:
+ backend: CUTEDSL
+ use_low_precision_moe_combine: true
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.7
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+
+benchmark:
+ type: "sa-bench"
+ isl: 1024
+ osl: 1024
+ concurrencies: "1229"
+ req_rate: "inf"
+ custom_tokenizer: "glm_moe_dsa"
+ use_chat_template: false
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen4tep8_batch128_allconc_eplb0_mtp0.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen4tep8_batch128_allconc_eplb0_mtp0.yaml
new file mode 100644
index 00000000..e969c07d
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen4tep8_batch128_allconc_eplb0_mtp0.yaml
@@ -0,0 +1,142 @@
+name: "glm5_nvfp4_ISL1K_OSL1K_ctx1dep4_gen4tep8_batch128_allconc_eplb0_mtp0"
+
+# ctx: 1 prefill worker, TP4/EP4
+# gen: 4 decode workers, TP8/EP8, enable_attention_dp=false, max_batch=128
+# Merged concurrencies: batch1(4), batch32(180), batch64(360), batch128(616)
+
+model:
+ path: "nvidia/GLM5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb200"
+
+ prefill_nodes: 1
+ prefill_workers: 1
+ gpus_per_prefill: 4
+
+ decode_workers: 4
+ decode_nodes: 8
+ gpus_per_decode: 8
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 4
+ moe_expert_parallel_size: 4
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 16
+ max_num_tokens: 16384
+ max_seq_len: 1064
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: CUTEDSL
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.6
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+
+ decode:
+ allreduce_strategy: MNNVL
+ tensor_parallel_size: 8
+ moe_expert_parallel_size: 8
+ pipeline_parallel_size: 1
+ enable_attention_dp: false
+ enable_lm_head_tp_in_adp: false
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 128
+ max_num_tokens: 128
+ max_seq_len: 2088
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ - 8
+ - 16
+ - 24
+ - 32
+ - 40
+ - 48
+ - 56
+ - 64
+ - 72
+ - 80
+ - 88
+ - 96
+ - 104
+ - 112
+ - 120
+ - 128
+ moe_config:
+ backend: TRTLLM
+ use_low_precision_moe_combine: true
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.9
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+
+benchmark:
+ type: "sa-bench"
+ isl: 1024
+ osl: 1024
+ concurrencies: "4x180x360x616"
+ req_rate: "inf"
+ custom_tokenizer: "glm_moe_dsa"
+ use_chat_template: false
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen5tep4_batch8_allconc_eplb0_mtp0.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen5tep4_batch8_allconc_eplb0_mtp0.yaml
new file mode 100644
index 00000000..fb583747
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen5tep4_batch8_allconc_eplb0_mtp0.yaml
@@ -0,0 +1,126 @@
+name: "glm5_nvfp4_ISL1K_OSL1K_ctx1dep4_gen5tep4_batch8_allconc_eplb0_mtp0"
+
+# ctx: 1 prefill worker, TP4/EP4
+# gen: 5 decode workers, TP4/EP4, enable_attention_dp=false, max_batch=8
+# Merged concurrencies: batch1(5), batch2(15), batch4(30), batch8(50)
+
+model:
+ path: "nvidia/GLM5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb200"
+
+ prefill_nodes: 1
+ prefill_workers: 1
+ gpus_per_prefill: 4
+
+ decode_workers: 5
+ decode_nodes: 5
+ gpus_per_decode: 4
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 4
+ moe_expert_parallel_size: 4
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 16
+ max_num_tokens: 16384
+ max_seq_len: 1064
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: CUTEDSL
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.6
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+
+ decode:
+ tensor_parallel_size: 4
+ moe_expert_parallel_size: 4
+ pipeline_parallel_size: 1
+ enable_attention_dp: false
+ enable_lm_head_tp_in_adp: false
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 8
+ max_num_tokens: 8
+ max_seq_len: 2088
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ - 8
+ moe_config:
+ backend: TRTLLM
+ use_low_precision_moe_combine: true
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.9
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+
+benchmark:
+ type: "sa-bench"
+ isl: 1024
+ osl: 1024
+ concurrencies: "5x15x30x50"
+ req_rate: "inf"
+ custom_tokenizer: "glm_moe_dsa"
+ use_chat_template: false
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/STP/ctx2dep4_gen1dep16_batch128_eplb0_mtp0.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/STP/ctx2dep4_gen1dep16_batch128_eplb0_mtp0.yaml
new file mode 100644
index 00000000..e057ce05
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/STP/ctx2dep4_gen1dep16_batch128_eplb0_mtp0.yaml
@@ -0,0 +1,141 @@
+name: "glm5_nvfp4_ISL1K_OSL1K_ctx2dep4_gen1dep16_batch128_eplb0_mtp0"
+
+# ctx: 2 prefill workers, TP4/EP4
+# gen: 1 decode worker, TP16/EP16, enable_attention_dp=true, max_batch=128
+# concurrency: 2253
+
+model:
+ path: "nvidia/GLM5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb200"
+
+ prefill_nodes: 2
+ prefill_workers: 2
+ gpus_per_prefill: 4
+
+ decode_workers: 1
+ decode_nodes: 4
+ gpus_per_decode: 16
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 4
+ moe_expert_parallel_size: 4
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 16
+ max_num_tokens: 16384
+ max_seq_len: 1064
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: CUTEDSL
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.6
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+
+ decode:
+ tensor_parallel_size: 16
+ moe_expert_parallel_size: 16
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ enable_lm_head_tp_in_adp: false
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 128
+ max_num_tokens: 128
+ max_seq_len: 2088
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ - 8
+ - 16
+ - 24
+ - 32
+ - 40
+ - 48
+ - 56
+ - 64
+ - 72
+ - 80
+ - 88
+ - 96
+ - 104
+ - 112
+ - 120
+ - 128
+ moe_config:
+ backend: CUTEDSL
+ use_low_precision_moe_combine: true
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.75
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+
+benchmark:
+ type: "sa-bench"
+ isl: 1024
+ osl: 1024
+ concurrencies: "2253"
+ req_rate: "inf"
+ custom_tokenizer: "glm_moe_dsa"
+ use_chat_template: false
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/STP/ctx2dep4_gen1dep16_batch512_eplb256_mtp0.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/STP/ctx2dep4_gen1dep16_batch512_eplb256_mtp0.yaml
new file mode 100644
index 00000000..d221dde2
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/STP/ctx2dep4_gen1dep16_batch512_eplb256_mtp0.yaml
@@ -0,0 +1,193 @@
+name: "glm5_nvfp4_ISL1K_OSL1K_ctx2dep4_gen1dep16_batch512_eplb256_mtp0"
+
+# ctx: 2 prefill workers, TP4/EP4
+# gen: 1 decode worker, TP16/EP16, enable_attention_dp=true, max_batch=512
+# EPLB: num_slots=256
+# concurrency: 8192
+
+model:
+ path: "nvidia/GLM5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb200"
+
+ prefill_nodes: 2
+ prefill_workers: 2
+ gpus_per_prefill: 4
+
+ decode_workers: 1
+ decode_nodes: 4
+ gpus_per_decode: 16
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 4
+ moe_expert_parallel_size: 4
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 16
+ max_num_tokens: 16384
+ max_seq_len: 1064
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: CUTEDSL
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.6
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+
+ decode:
+ tensor_parallel_size: 16
+ moe_expert_parallel_size: 16
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ enable_lm_head_tp_in_adp: false
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 512
+ max_num_tokens: 512
+ max_seq_len: 2088
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ - 8
+ - 16
+ - 24
+ - 32
+ - 40
+ - 48
+ - 56
+ - 64
+ - 72
+ - 80
+ - 88
+ - 96
+ - 104
+ - 112
+ - 120
+ - 128
+ - 136
+ - 144
+ - 152
+ - 160
+ - 168
+ - 176
+ - 184
+ - 192
+ - 200
+ - 208
+ - 216
+ - 224
+ - 232
+ - 240
+ - 248
+ - 256
+ - 264
+ - 272
+ - 280
+ - 288
+ - 296
+ - 304
+ - 312
+ - 320
+ - 328
+ - 336
+ - 344
+ - 352
+ - 360
+ - 368
+ - 376
+ - 384
+ - 392
+ - 400
+ - 408
+ - 416
+ - 424
+ - 432
+ - 440
+ - 448
+ - 456
+ - 464
+ - 472
+ - 480
+ - 488
+ - 496
+ - 504
+ - 512
+ moe_config:
+ backend: CUTEDSL
+ use_low_precision_moe_combine: true
+ load_balancer:
+ layer_updates_per_iter: 1
+ num_slots: 256
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.75
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+
+benchmark:
+ type: "sa-bench"
+ isl: 1024
+ osl: 1024
+ concurrencies: "8192"
+ req_rate: "inf"
+ custom_tokenizer: "glm_moe_dsa"
+ use_chat_template: false
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/STP/ctx2dep4_gen1dep32_batch64_eplb0_mtp0.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/STP/ctx2dep4_gen1dep32_batch64_eplb0_mtp0.yaml
new file mode 100644
index 00000000..bbad79c1
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/STP/ctx2dep4_gen1dep32_batch64_eplb0_mtp0.yaml
@@ -0,0 +1,133 @@
+name: "glm5_nvfp4_ISL1K_OSL1K_ctx2dep4_gen1dep32_batch64_eplb0_mtp0"
+
+# ctx: 2 prefill workers, TP4/EP4
+# gen: 1 decode worker, TP32/EP32, enable_attention_dp=true, max_batch=64
+# concurrency: 2253
+
+model:
+ path: "nvidia/GLM5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb200"
+
+ prefill_nodes: 2
+ prefill_workers: 2
+ gpus_per_prefill: 4
+
+ decode_workers: 1
+ decode_nodes: 8
+ gpus_per_decode: 32
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 4
+ moe_expert_parallel_size: 4
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 16
+ max_num_tokens: 16384
+ max_seq_len: 1064
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: CUTEDSL
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.6
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+
+ decode:
+ tensor_parallel_size: 32
+ moe_expert_parallel_size: 32
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ enable_lm_head_tp_in_adp: false
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 64
+ max_num_tokens: 64
+ max_seq_len: 2088
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ - 8
+ - 16
+ - 24
+ - 32
+ - 40
+ - 48
+ - 56
+ - 64
+ moe_config:
+ backend: CUTEDSL
+ use_low_precision_moe_combine: true
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.7
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+
+benchmark:
+ type: "sa-bench"
+ isl: 1024
+ osl: 1024
+ concurrencies: "2253"
+ req_rate: "inf"
+ custom_tokenizer: "glm_moe_dsa"
+ use_chat_template: false
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/STP/ctx4dep4_gen1dep32_batch256_eplb288_mtp0.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/STP/ctx4dep4_gen1dep32_batch256_eplb288_mtp0.yaml
new file mode 100644
index 00000000..26d2d29e
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/STP/ctx4dep4_gen1dep32_batch256_eplb288_mtp0.yaml
@@ -0,0 +1,161 @@
+name: "glm5_nvfp4_ISL1K_OSL1K_ctx4dep4_gen1dep32_batch256_eplb288_mtp0"
+
+# ctx: 4 prefill workers, TP4/EP4
+# gen: 1 decode worker, TP32/EP32, enable_attention_dp=true, max_batch=256
+# EPLB: num_slots=288
+# concurrency: 8192
+
+model:
+ path: "nvidia/GLM5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb200"
+
+ prefill_nodes: 4
+ prefill_workers: 4
+ gpus_per_prefill: 4
+
+ decode_workers: 1
+ decode_nodes: 8
+ gpus_per_decode: 32
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 4
+ moe_expert_parallel_size: 4
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 16
+ max_num_tokens: 16384
+ max_seq_len: 1064
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: CUTEDSL
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.6
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+
+ decode:
+ tensor_parallel_size: 32
+ moe_expert_parallel_size: 32
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ enable_lm_head_tp_in_adp: false
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 256
+ max_num_tokens: 256
+ max_seq_len: 2088
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ - 8
+ - 16
+ - 24
+ - 32
+ - 40
+ - 48
+ - 56
+ - 64
+ - 72
+ - 80
+ - 88
+ - 96
+ - 104
+ - 112
+ - 120
+ - 128
+ - 136
+ - 144
+ - 152
+ - 160
+ - 168
+ - 176
+ - 184
+ - 192
+ - 200
+ - 208
+ - 216
+ - 224
+ - 232
+ - 240
+ - 248
+ - 256
+ moe_config:
+ backend: CUTEDSL
+ use_low_precision_moe_combine: true
+ load_balancer:
+ layer_updates_per_iter: 1
+ num_slots: 288
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.7
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+
+benchmark:
+ type: "sa-bench"
+ isl: 1024
+ osl: 1024
+ concurrencies: "8192"
+ req_rate: "inf"
+ custom_tokenizer: "glm_moe_dsa"
+ use_chat_template: false
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/MTP/ctx10dep4_gen1dep16_batch64_eplb0_mtp1.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/MTP/ctx10dep4_gen1dep16_batch64_eplb0_mtp1.yaml
new file mode 100644
index 00000000..420192c2
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/MTP/ctx10dep4_gen1dep16_batch64_eplb0_mtp1.yaml
@@ -0,0 +1,139 @@
+name: "glm5_nvfp4_ISL8K_OSL1K_ctx10dep4_gen1dep16_batch64_eplb0_mtp1"
+
+# ctx: 10 prefill workers, TP4/EP4
+# gen: 1 decode worker, TP16/EP16, enable_attention_dp=true, max_batch=64
+# concurrency: 1229
+
+model:
+ path: "nvidia/GLM5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb200"
+
+ prefill_nodes: 10
+ prefill_workers: 10
+ gpus_per_prefill: 4
+
+ decode_workers: 1
+ decode_nodes: 4
+ gpus_per_decode: 16
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 4
+ moe_expert_parallel_size: 4
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 2
+ max_num_tokens: 16640
+ max_seq_len: 8232
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: CUTEDSL
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.6
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ speculative_config:
+ decoding_type: MTP
+ num_nextn_predict_layers: 1
+
+ decode:
+ tensor_parallel_size: 16
+ moe_expert_parallel_size: 16
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ enable_lm_head_tp_in_adp: true
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 64
+ max_num_tokens: 128
+ max_seq_len: 9256
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ - 8
+ - 16
+ - 24
+ - 32
+ - 40
+ - 48
+ - 56
+ - 64
+ moe_config:
+ backend: CUTEDSL
+ use_low_precision_moe_combine: true
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.7
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+ speculative_config:
+ decoding_type: MTP
+ num_nextn_predict_layers: 1
+
+benchmark:
+ type: "sa-bench"
+ isl: 8192
+ osl: 1024
+ concurrencies: "1229"
+ req_rate: "inf"
+ custom_tokenizer: "glm_moe_dsa"
+ use_chat_template: false
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/MTP/ctx1dep4_gen2tep8_batch16_eplb0_mtp3.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/MTP/ctx1dep4_gen2tep8_batch16_eplb0_mtp3.yaml
new file mode 100644
index 00000000..da3186e5
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/MTP/ctx1dep4_gen2tep8_batch16_eplb0_mtp3.yaml
@@ -0,0 +1,133 @@
+name: "glm5_nvfp4_ISL8K_OSL1K_ctx1dep4_gen2tep8_batch16_eplb0_mtp3"
+
+# ctx: 1 prefill worker, TP4/EP4
+# gen: 2 decode workers, TP8/EP8, max_batch=16, concurrency: 46
+
+model:
+ path: "nvidia/GLM5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb200"
+
+ prefill_nodes: 1
+ prefill_workers: 1
+ gpus_per_prefill: 4
+
+ decode_workers: 2
+ decode_nodes: 4
+ gpus_per_decode: 8
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 4
+ moe_expert_parallel_size: 4
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 2
+ max_num_tokens: 16640
+ max_seq_len: 8232
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: CUTEDSL
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.6
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ speculative_config:
+ decoding_type: MTP
+ num_nextn_predict_layers: 3
+
+ decode:
+ allreduce_strategy: MNNVL
+ tensor_parallel_size: 8
+ moe_expert_parallel_size: 8
+ pipeline_parallel_size: 1
+ enable_attention_dp: false
+ enable_lm_head_tp_in_adp: false
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 16
+ max_num_tokens: 64
+ max_seq_len: 9256
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ - 8
+ - 16
+ moe_config:
+ backend: TRTLLM
+ use_low_precision_moe_combine: true
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.9
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+ speculative_config:
+ decoding_type: MTP
+ num_nextn_predict_layers: 3
+
+benchmark:
+ type: "sa-bench"
+ isl: 8192
+ osl: 1024
+ concurrencies: "46"
+ req_rate: "inf"
+ custom_tokenizer: "glm_moe_dsa"
+ use_chat_template: false
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/MTP/ctx1dep4_gen4tep8_batch8_allconc_eplb0_mtp3.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/MTP/ctx1dep4_gen4tep8_batch8_allconc_eplb0_mtp3.yaml
new file mode 100644
index 00000000..fb94a549
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/MTP/ctx1dep4_gen4tep8_batch8_allconc_eplb0_mtp3.yaml
@@ -0,0 +1,133 @@
+name: "glm5_nvfp4_ISL8K_OSL1K_ctx1dep4_gen4tep8_batch8_allconc_eplb0_mtp3"
+
+# ctx: 1 prefill worker, TP4/EP4
+# gen: 4 decode workers, TP8/EP8, max_batch=8
+# concurrencies: 4 (batch1), 48 (batch8)
+
+model:
+ path: "nvidia/GLM5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb200"
+
+ prefill_nodes: 1
+ prefill_workers: 1
+ gpus_per_prefill: 4
+
+ decode_workers: 4
+ decode_nodes: 8
+ gpus_per_decode: 8
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 4
+ moe_expert_parallel_size: 4
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 2
+ max_num_tokens: 16640
+ max_seq_len: 8232
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: CUTEDSL
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.6
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ speculative_config:
+ decoding_type: MTP
+ num_nextn_predict_layers: 3
+
+ decode:
+ allreduce_strategy: MNNVL
+ tensor_parallel_size: 8
+ moe_expert_parallel_size: 8
+ pipeline_parallel_size: 1
+ enable_attention_dp: false
+ enable_lm_head_tp_in_adp: false
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 8
+ max_num_tokens: 32
+ max_seq_len: 9256
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ - 8
+ moe_config:
+ backend: TRTLLM
+ use_low_precision_moe_combine: true
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.9
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+ speculative_config:
+ decoding_type: MTP
+ num_nextn_predict_layers: 3
+
+benchmark:
+ type: "sa-bench"
+ isl: 8192
+ osl: 1024
+ concurrencies: "4x48"
+ req_rate: "inf"
+ custom_tokenizer: "glm_moe_dsa"
+ use_chat_template: false
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/MTP/ctx1dep4_gen5tep4_batch1_eplb0_mtp3.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/MTP/ctx1dep4_gen5tep4_batch1_eplb0_mtp3.yaml
new file mode 100644
index 00000000..0a13cce4
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/MTP/ctx1dep4_gen5tep4_batch1_eplb0_mtp3.yaml
@@ -0,0 +1,130 @@
+name: "glm5_nvfp4_ISL8K_OSL1K_ctx1dep4_gen5tep4_batch1_eplb0_mtp3"
+
+# ctx: 1 prefill worker, TP4/EP4
+# gen: 5 decode workers, TP4/EP4, max_batch=1, concurrency: 5
+
+model:
+ path: "nvidia/GLM5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb200"
+
+ prefill_nodes: 1
+ prefill_workers: 1
+ gpus_per_prefill: 4
+
+ decode_workers: 5
+ decode_nodes: 5
+ gpus_per_decode: 4
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 4
+ moe_expert_parallel_size: 4
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 2
+ max_num_tokens: 16640
+ max_seq_len: 8232
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: CUTEDSL
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.6
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ speculative_config:
+ decoding_type: MTP
+ num_nextn_predict_layers: 3
+
+ decode:
+ tensor_parallel_size: 4
+ moe_expert_parallel_size: 4
+ pipeline_parallel_size: 1
+ enable_attention_dp: false
+ enable_lm_head_tp_in_adp: false
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 1
+ max_num_tokens: 4
+ max_seq_len: 9256
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ moe_config:
+ backend: TRTLLM
+ use_low_precision_moe_combine: true
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.85
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+ speculative_config:
+ decoding_type: MTP
+ num_nextn_predict_layers: 3
+
+benchmark:
+ type: "sa-bench"
+ isl: 8192
+ osl: 1024
+ concurrencies: "5"
+ req_rate: "inf"
+ custom_tokenizer: "glm_moe_dsa"
+ use_chat_template: false
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/MTP/ctx3dep4_gen1dep32_batch4_eplb0_mtp3.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/MTP/ctx3dep4_gen1dep32_batch4_eplb0_mtp3.yaml
new file mode 100644
index 00000000..440a4f73
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/MTP/ctx3dep4_gen1dep32_batch4_eplb0_mtp3.yaml
@@ -0,0 +1,130 @@
+name: "glm5_nvfp4_ISL8K_OSL1K_ctx3dep4_gen1dep32_batch4_eplb0_mtp3"
+
+# ctx: 3 prefill workers, TP4/EP4
+# gen: 1 decode worker, TP32/EP32, max_batch=4, concurrency: 167
+
+model:
+ path: "nvidia/GLM5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb200"
+
+ prefill_nodes: 3
+ prefill_workers: 3
+ gpus_per_prefill: 4
+
+ decode_workers: 1
+ decode_nodes: 8
+ gpus_per_decode: 32
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 4
+ moe_expert_parallel_size: 4
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 2
+ max_num_tokens: 16640
+ max_seq_len: 8232
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: CUTEDSL
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.6
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ speculative_config:
+ decoding_type: MTP
+ num_nextn_predict_layers: 3
+
+ decode:
+ tensor_parallel_size: 32
+ moe_expert_parallel_size: 32
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ enable_lm_head_tp_in_adp: true
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 4
+ max_num_tokens: 16
+ max_seq_len: 9256
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ moe_config:
+ backend: CUTEDSL
+ use_low_precision_moe_combine: true
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.6
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+ speculative_config:
+ decoding_type: MTP
+ num_nextn_predict_layers: 3
+
+benchmark:
+ type: "sa-bench"
+ isl: 8192
+ osl: 1024
+ concurrencies: "167"
+ req_rate: "inf"
+ custom_tokenizer: "glm_moe_dsa"
+ use_chat_template: false
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/MTP/ctx5dep4_gen1dep32_batch8_eplb0_mtp3.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/MTP/ctx5dep4_gen1dep32_batch8_eplb0_mtp3.yaml
new file mode 100644
index 00000000..492f1b4c
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/MTP/ctx5dep4_gen1dep32_batch8_eplb0_mtp3.yaml
@@ -0,0 +1,132 @@
+name: "glm5_nvfp4_ISL8K_OSL1K_ctx5dep4_gen1dep32_batch8_eplb0_mtp3"
+
+# ctx: 5 prefill workers, TP4/EP4
+# gen: 1 decode worker, TP32/EP32, enable_attention_dp=true, max_batch=8
+# concurrency: 333
+
+model:
+ path: "nvidia/GLM5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb200"
+
+ prefill_nodes: 5
+ prefill_workers: 5
+ gpus_per_prefill: 4
+
+ decode_workers: 1
+ decode_nodes: 8
+ gpus_per_decode: 32
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 4
+ moe_expert_parallel_size: 4
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 2
+ max_num_tokens: 16640
+ max_seq_len: 8232
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: CUTEDSL
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.6
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ speculative_config:
+ decoding_type: MTP
+ num_nextn_predict_layers: 3
+
+ decode:
+ tensor_parallel_size: 32
+ moe_expert_parallel_size: 32
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ enable_lm_head_tp_in_adp: true
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 8
+ max_num_tokens: 32
+ max_seq_len: 9256
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ - 8
+ moe_config:
+ backend: CUTEDSL
+ use_low_precision_moe_combine: true
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.6
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+ speculative_config:
+ decoding_type: MTP
+ num_nextn_predict_layers: 3
+
+benchmark:
+ type: "sa-bench"
+ isl: 8192
+ osl: 1024
+ concurrencies: "333"
+ req_rate: "inf"
+ custom_tokenizer: "glm_moe_dsa"
+ use_chat_template: false
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/MTP/ctx7dep4_gen1dep16_batch32_eplb0_mtp2.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/MTP/ctx7dep4_gen1dep16_batch32_eplb0_mtp2.yaml
new file mode 100644
index 00000000..d22fbcf1
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/MTP/ctx7dep4_gen1dep16_batch32_eplb0_mtp2.yaml
@@ -0,0 +1,135 @@
+name: "glm5_nvfp4_ISL8K_OSL1K_ctx7dep4_gen1dep16_batch32_eplb0_mtp2"
+
+# ctx: 7 prefill workers, TP4/EP4
+# gen: 1 decode worker, TP16/EP16, enable_attention_dp=true, max_batch=32
+# concurrency: 615
+
+model:
+ path: "nvidia/GLM5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb200"
+
+ prefill_nodes: 7
+ prefill_workers: 7
+ gpus_per_prefill: 4
+
+ decode_workers: 1
+ decode_nodes: 4
+ gpus_per_decode: 16
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 4
+ moe_expert_parallel_size: 4
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 2
+ max_num_tokens: 16640
+ max_seq_len: 8232
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: CUTEDSL
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.6
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ speculative_config:
+ decoding_type: MTP
+ num_nextn_predict_layers: 2
+
+ decode:
+ tensor_parallel_size: 16
+ moe_expert_parallel_size: 16
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ enable_lm_head_tp_in_adp: true
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 32
+ max_num_tokens: 96
+ max_seq_len: 9256
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ - 8
+ - 16
+ - 24
+ - 32
+ moe_config:
+ backend: CUTEDSL
+ use_low_precision_moe_combine: true
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.7
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+ speculative_config:
+ decoding_type: MTP
+ num_nextn_predict_layers: 2
+
+benchmark:
+ type: "sa-bench"
+ isl: 8192
+ osl: 1024
+ concurrencies: "615"
+ req_rate: "inf"
+ custom_tokenizer: "glm_moe_dsa"
+ use_chat_template: false
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/MTP/ctx7dep4_gen1dep8_batch128_eplb0_mtp1.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/MTP/ctx7dep4_gen1dep8_batch128_eplb0_mtp1.yaml
new file mode 100644
index 00000000..804e89b5
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/MTP/ctx7dep4_gen1dep8_batch128_eplb0_mtp1.yaml
@@ -0,0 +1,147 @@
+name: "glm5_nvfp4_ISL8K_OSL1K_ctx7dep4_gen1dep8_batch128_eplb0_mtp1"
+
+# ctx: 7 prefill workers, TP4/EP4
+# gen: 1 decode worker, TP8/EP8, enable_attention_dp=true, max_batch=128
+# concurrency: 1076
+
+model:
+ path: "nvidia/GLM5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb200"
+
+ prefill_nodes: 7
+ prefill_workers: 7
+ gpus_per_prefill: 4
+
+ decode_workers: 1
+ decode_nodes: 2
+ gpus_per_decode: 8
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 4
+ moe_expert_parallel_size: 4
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 2
+ max_num_tokens: 16640
+ max_seq_len: 8232
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: CUTEDSL
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.6
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ speculative_config:
+ decoding_type: MTP
+ num_nextn_predict_layers: 1
+
+ decode:
+ tensor_parallel_size: 8
+ moe_expert_parallel_size: 8
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ enable_lm_head_tp_in_adp: true
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 128
+ max_num_tokens: 256
+ max_seq_len: 9256
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ - 8
+ - 16
+ - 24
+ - 32
+ - 40
+ - 48
+ - 56
+ - 64
+ - 72
+ - 80
+ - 88
+ - 96
+ - 104
+ - 112
+ - 120
+ - 128
+ moe_config:
+ backend: CUTEDSL
+ use_low_precision_moe_combine: true
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.8
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+ speculative_config:
+ decoding_type: MTP
+ num_nextn_predict_layers: 1
+
+benchmark:
+ type: "sa-bench"
+ isl: 8192
+ osl: 1024
+ concurrencies: "1076"
+ req_rate: "inf"
+ custom_tokenizer: "glm_moe_dsa"
+ use_chat_template: false
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/STP/ctx10dep4_gen1dep16_batch128_eplb0_mtp0.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/STP/ctx10dep4_gen1dep16_batch128_eplb0_mtp0.yaml
new file mode 100644
index 00000000..0fa8566d
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/STP/ctx10dep4_gen1dep16_batch128_eplb0_mtp0.yaml
@@ -0,0 +1,141 @@
+name: "glm5_nvfp4_ISL8K_OSL1K_ctx10dep4_gen1dep16_batch128_eplb0_mtp0"
+
+# ctx: 10 prefill workers, TP4/EP4
+# gen: 1 decode worker, TP16/EP16, enable_attention_dp=true, max_batch=128
+# concurrency: 2253
+
+model:
+ path: "nvidia/GLM5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb200"
+
+ prefill_nodes: 10
+ prefill_workers: 10
+ gpus_per_prefill: 4
+
+ decode_workers: 1
+ decode_nodes: 4
+ gpus_per_decode: 16
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 4
+ moe_expert_parallel_size: 4
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 2
+ max_num_tokens: 16640
+ max_seq_len: 8232
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: CUTEDSL
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.6
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+
+ decode:
+ tensor_parallel_size: 16
+ moe_expert_parallel_size: 16
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ enable_lm_head_tp_in_adp: false
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 128
+ max_num_tokens: 128
+ max_seq_len: 9256
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ - 8
+ - 16
+ - 24
+ - 32
+ - 40
+ - 48
+ - 56
+ - 64
+ - 72
+ - 80
+ - 88
+ - 96
+ - 104
+ - 112
+ - 120
+ - 128
+ moe_config:
+ backend: CUTEDSL
+ use_low_precision_moe_combine: true
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.8
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+
+benchmark:
+ type: "sa-bench"
+ isl: 8192
+ osl: 1024
+ concurrencies: "2253"
+ req_rate: "inf"
+ custom_tokenizer: "glm_moe_dsa"
+ use_chat_template: false
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/STP/ctx1dep4_gen2tep8_batch32_eplb0_mtp0.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/STP/ctx1dep4_gen2tep8_batch32_eplb0_mtp0.yaml
new file mode 100644
index 00000000..478f6203
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/STP/ctx1dep4_gen2tep8_batch32_eplb0_mtp0.yaml
@@ -0,0 +1,130 @@
+name: "glm5_nvfp4_ISL8K_OSL1K_ctx1dep4_gen2tep8_batch32_eplb0_mtp0"
+
+# ctx: 1 prefill worker, TP4/EP4
+# gen: 2 decode workers, TP8/EP8, enable_attention_dp=false, max_batch=32
+# concurrency: 84
+
+model:
+ path: "nvidia/GLM5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb200"
+
+ prefill_nodes: 1
+ prefill_workers: 1
+ gpus_per_prefill: 4
+
+ decode_workers: 2
+ decode_nodes: 4
+ gpus_per_decode: 8
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 4
+ moe_expert_parallel_size: 4
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 2
+ max_num_tokens: 16640
+ max_seq_len: 8232
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: CUTEDSL
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.6
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+
+ decode:
+ allreduce_strategy: MNNVL
+ tensor_parallel_size: 8
+ moe_expert_parallel_size: 8
+ pipeline_parallel_size: 1
+ enable_attention_dp: false
+ enable_lm_head_tp_in_adp: false
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 32
+ max_num_tokens: 32
+ max_seq_len: 9256
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ - 8
+ - 16
+ - 24
+ - 32
+ moe_config:
+ backend: TRTLLM
+ use_low_precision_moe_combine: true
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.9
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+
+benchmark:
+ type: "sa-bench"
+ isl: 8192
+ osl: 1024
+ concurrencies: "84"
+ req_rate: "inf"
+ custom_tokenizer: "glm_moe_dsa"
+ use_chat_template: false
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/STP/ctx1dep4_gen3tep4_batch32_eplb0_mtp0.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/STP/ctx1dep4_gen3tep4_batch32_eplb0_mtp0.yaml
new file mode 100644
index 00000000..462401b6
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/STP/ctx1dep4_gen3tep4_batch32_eplb0_mtp0.yaml
@@ -0,0 +1,129 @@
+name: "glm5_nvfp4_ISL8K_OSL1K_ctx1dep4_gen3tep4_batch32_eplb0_mtp0"
+
+# ctx: 1 prefill worker, TP4/EP4
+# gen: 3 decode workers, TP4/EP4, enable_attention_dp=false, max_batch=32
+# concurrency: 117
+
+model:
+ path: "nvidia/GLM5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb200"
+
+ prefill_nodes: 1
+ prefill_workers: 1
+ gpus_per_prefill: 4
+
+ decode_workers: 3
+ decode_nodes: 3
+ gpus_per_decode: 4
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 4
+ moe_expert_parallel_size: 4
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 2
+ max_num_tokens: 16640
+ max_seq_len: 8232
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: CUTEDSL
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.6
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+
+ decode:
+ tensor_parallel_size: 4
+ moe_expert_parallel_size: 4
+ pipeline_parallel_size: 1
+ enable_attention_dp: false
+ enable_lm_head_tp_in_adp: false
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 32
+ max_num_tokens: 32
+ max_seq_len: 9256
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ - 8
+ - 16
+ - 24
+ - 32
+ moe_config:
+ backend: TRTLLM
+ use_low_precision_moe_combine: true
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.9
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+
+benchmark:
+ type: "sa-bench"
+ isl: 8192
+ osl: 1024
+ concurrencies: "117"
+ req_rate: "inf"
+ custom_tokenizer: "glm_moe_dsa"
+ use_chat_template: false
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/STP/ctx1dep4_gen5tep4_batch8_allconc_eplb0_mtp0.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/STP/ctx1dep4_gen5tep4_batch8_allconc_eplb0_mtp0.yaml
new file mode 100644
index 00000000..90e62af3
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/STP/ctx1dep4_gen5tep4_batch8_allconc_eplb0_mtp0.yaml
@@ -0,0 +1,126 @@
+name: "glm5_nvfp4_ISL8K_OSL1K_ctx1dep4_gen5tep4_batch8_allconc_eplb0_mtp0"
+
+# ctx: 1 prefill worker, TP4/EP4
+# gen: 5 decode workers, TP4/EP4, enable_attention_dp=false, max_batch=8
+# Merged concurrencies: batch1(5), batch2(10), batch4(25), batch8(50)
+
+model:
+ path: "nvidia/GLM5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb200"
+
+ prefill_nodes: 1
+ prefill_workers: 1
+ gpus_per_prefill: 4
+
+ decode_workers: 5
+ decode_nodes: 5
+ gpus_per_decode: 4
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 4
+ moe_expert_parallel_size: 4
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 2
+ max_num_tokens: 16640
+ max_seq_len: 8232
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: CUTEDSL
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.6
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+
+ decode:
+ tensor_parallel_size: 4
+ moe_expert_parallel_size: 4
+ pipeline_parallel_size: 1
+ enable_attention_dp: false
+ enable_lm_head_tp_in_adp: false
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 8
+ max_num_tokens: 8
+ max_seq_len: 9256
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ - 8
+ moe_config:
+ backend: TRTLLM
+ use_low_precision_moe_combine: true
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.9
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+
+benchmark:
+ type: "sa-bench"
+ isl: 8192
+ osl: 1024
+ concurrencies: "5x10x25x50"
+ req_rate: "inf"
+ custom_tokenizer: "glm_moe_dsa"
+ use_chat_template: false
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/STP/ctx5dep4_gen1dep32_batch16_eplb0_mtp0.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/STP/ctx5dep4_gen1dep32_batch16_eplb0_mtp0.yaml
new file mode 100644
index 00000000..7a6ece31
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/STP/ctx5dep4_gen1dep32_batch16_eplb0_mtp0.yaml
@@ -0,0 +1,127 @@
+name: "glm5_nvfp4_ISL8K_OSL1K_ctx5dep4_gen1dep32_batch16_eplb0_mtp0"
+
+# ctx: 5 prefill workers, TP4/EP4
+# gen: 1 decode worker, TP32/EP32, enable_attention_dp=true, max_batch=16
+# concurrency: 615
+
+model:
+ path: "nvidia/GLM5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb200"
+
+ prefill_nodes: 5
+ prefill_workers: 5
+ gpus_per_prefill: 4
+
+ decode_workers: 1
+ decode_nodes: 8
+ gpus_per_decode: 32
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 4
+ moe_expert_parallel_size: 4
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 2
+ max_num_tokens: 16640
+ max_seq_len: 8232
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: CUTEDSL
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.6
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+
+ decode:
+ tensor_parallel_size: 32
+ moe_expert_parallel_size: 32
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ enable_lm_head_tp_in_adp: false
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 16
+ max_num_tokens: 16
+ max_seq_len: 9256
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ - 8
+ - 16
+ moe_config:
+ backend: CUTEDSL
+ use_low_precision_moe_combine: true
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.75
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+
+benchmark:
+ type: "sa-bench"
+ isl: 8192
+ osl: 1024
+ concurrencies: "615"
+ req_rate: "inf"
+ custom_tokenizer: "glm_moe_dsa"
+ use_chat_template: false
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/STP/ctx8dep4_gen1dep32_batch32_eplb0_mtp0.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/STP/ctx8dep4_gen1dep32_batch32_eplb0_mtp0.yaml
new file mode 100644
index 00000000..7e34b6d9
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/STP/ctx8dep4_gen1dep32_batch32_eplb0_mtp0.yaml
@@ -0,0 +1,129 @@
+name: "glm5_nvfp4_ISL8K_OSL1K_ctx8dep4_gen1dep32_batch32_eplb0_mtp0"
+
+# ctx: 8 prefill workers, TP4/EP4
+# gen: 1 decode worker, TP32/EP32, enable_attention_dp=true, max_batch=32
+# concurrency: 1229
+
+model:
+ path: "nvidia/GLM5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb200"
+
+ prefill_nodes: 8
+ prefill_workers: 8
+ gpus_per_prefill: 4
+
+ decode_workers: 1
+ decode_nodes: 8
+ gpus_per_decode: 32
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 4
+ moe_expert_parallel_size: 4
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 2
+ max_num_tokens: 16640
+ max_seq_len: 8232
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: CUTEDSL
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.6
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+
+ decode:
+ tensor_parallel_size: 32
+ moe_expert_parallel_size: 32
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ enable_lm_head_tp_in_adp: false
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 32
+ max_num_tokens: 32
+ max_seq_len: 9256
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ - 8
+ - 16
+ - 24
+ - 32
+ moe_config:
+ backend: CUTEDSL
+ use_low_precision_moe_combine: true
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.75
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+
+benchmark:
+ type: "sa-bench"
+ isl: 8192
+ osl: 1024
+ concurrencies: "1229"
+ req_rate: "inf"
+ custom_tokenizer: "glm_moe_dsa"
+ use_chat_template: false
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/MTP/ctx1dep2_gen1dep32_batch8_eplb0_mtp3.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/MTP/ctx1dep2_gen1dep32_batch8_eplb0_mtp3.yaml
new file mode 100644
index 00000000..80aacc6a
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/MTP/ctx1dep2_gen1dep32_batch8_eplb0_mtp3.yaml
@@ -0,0 +1,132 @@
+name: "glm5_nvfp4_ISL1K_OSL1K_ctx1dep2_gen1dep32_batch8_eplb0_mtp3"
+
+# ctx: 1 prefill worker, TP2/EP2
+# gen: 1 decode worker, TP32/EP32, enable_attention_dp=true, max_batch=8
+# concurrency: 333
+
+model:
+ path: "nvidia/GLM5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb300"
+
+ prefill_nodes: 1
+ prefill_workers: 1
+ gpus_per_prefill: 2
+
+ decode_workers: 1
+ decode_nodes: 8
+ gpus_per_decode: 32
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 2
+ moe_expert_parallel_size: 2
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 16
+ max_num_tokens: 16384
+ max_seq_len: 1064
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: CUTEDSL
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.6
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ speculative_config:
+ decoding_type: MTP
+ num_nextn_predict_layers: 3
+
+ decode:
+ tensor_parallel_size: 32
+ moe_expert_parallel_size: 32
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ enable_lm_head_tp_in_adp: true
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 8
+ max_num_tokens: 32
+ max_seq_len: 2088
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ - 8
+ moe_config:
+ backend: CUTEDSL
+ use_low_precision_moe_combine: true
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.6
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+ speculative_config:
+ decoding_type: MTP
+ num_nextn_predict_layers: 3
+
+benchmark:
+ type: "sa-bench"
+ isl: 1024
+ osl: 1024
+ concurrencies: "333"
+ req_rate: "inf"
+ custom_tokenizer: "glm_moe_dsa"
+ use_chat_template: false
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/MTP/ctx1dep2_gen4tep8_batch16_allconc_eplb0_mtp3.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/MTP/ctx1dep2_gen4tep8_batch16_allconc_eplb0_mtp3.yaml
new file mode 100644
index 00000000..648ec949
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/MTP/ctx1dep2_gen4tep8_batch16_allconc_eplb0_mtp3.yaml
@@ -0,0 +1,134 @@
+name: "glm5_nvfp4_ISL1K_OSL1K_ctx1dep2_gen4tep8_batch16_allconc_eplb0_mtp3"
+
+# ctx: 1 prefill worker, TP2/EP2
+# gen: 4 decode workers, TP8/EP8, enable_attention_dp=false, max_batch=16
+# concurrencies: 24 (batch4), 44 (batch8), 92 (batch16)
+
+model:
+ path: "nvidia/GLM5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb300"
+
+ prefill_nodes: 1
+ prefill_workers: 1
+ gpus_per_prefill: 2
+
+ decode_workers: 4
+ decode_nodes: 8
+ gpus_per_decode: 8
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 2
+ moe_expert_parallel_size: 2
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 16
+ max_num_tokens: 16384
+ max_seq_len: 1064
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: CUTEDSL
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.6
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ speculative_config:
+ decoding_type: MTP
+ num_nextn_predict_layers: 3
+
+ decode:
+ allreduce_strategy: MNNVL
+ tensor_parallel_size: 8
+ moe_expert_parallel_size: 8
+ pipeline_parallel_size: 1
+ enable_attention_dp: false
+ enable_lm_head_tp_in_adp: false
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 16
+ max_num_tokens: 64
+ max_seq_len: 2088
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ - 8
+ - 16
+ moe_config:
+ backend: TRTLLM
+ use_low_precision_moe_combine: true
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.9
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+ speculative_config:
+ decoding_type: MTP
+ num_nextn_predict_layers: 3
+
+benchmark:
+ type: "sa-bench"
+ isl: 1024
+ osl: 1024
+ concurrencies: "24x44x92"
+ req_rate: "inf"
+ custom_tokenizer: "glm_moe_dsa"
+ use_chat_template: false
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/MTP/ctx1dep2_gen4tep8_batch32_eplb0_mtp2.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/MTP/ctx1dep2_gen4tep8_batch32_eplb0_mtp2.yaml
new file mode 100644
index 00000000..823624ac
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/MTP/ctx1dep2_gen4tep8_batch32_eplb0_mtp2.yaml
@@ -0,0 +1,136 @@
+name: "glm5_nvfp4_ISL1K_OSL1K_ctx1dep2_gen4tep8_batch32_eplb0_mtp2"
+
+# ctx: 1 prefill worker, TP2/EP2
+# gen: 4 decode workers, TP8/EP8, enable_attention_dp=false, max_batch=32
+# concurrency: 180
+
+model:
+ path: "nvidia/GLM5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb300"
+
+ prefill_nodes: 1
+ prefill_workers: 1
+ gpus_per_prefill: 2
+
+ decode_workers: 4
+ decode_nodes: 8
+ gpus_per_decode: 8
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 2
+ moe_expert_parallel_size: 2
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 16
+ max_num_tokens: 16384
+ max_seq_len: 1064
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: CUTEDSL
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.6
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ speculative_config:
+ decoding_type: MTP
+ num_nextn_predict_layers: 2
+
+ decode:
+ allreduce_strategy: MNNVL
+ tensor_parallel_size: 8
+ moe_expert_parallel_size: 8
+ pipeline_parallel_size: 1
+ enable_attention_dp: false
+ enable_lm_head_tp_in_adp: false
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 32
+ max_num_tokens: 96
+ max_seq_len: 2088
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ - 8
+ - 16
+ - 24
+ - 32
+ moe_config:
+ backend: TRTLLM
+ use_low_precision_moe_combine: true
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.9
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+ speculative_config:
+ decoding_type: MTP
+ num_nextn_predict_layers: 2
+
+benchmark:
+ type: "sa-bench"
+ isl: 1024
+ osl: 1024
+ concurrencies: "180"
+ req_rate: "inf"
+ custom_tokenizer: "glm_moe_dsa"
+ use_chat_template: false
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/MTP/ctx1dep2_gen5tep4_batch1_eplb0_mtp3.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/MTP/ctx1dep2_gen5tep4_batch1_eplb0_mtp3.yaml
new file mode 100644
index 00000000..64b61b9f
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/MTP/ctx1dep2_gen5tep4_batch1_eplb0_mtp3.yaml
@@ -0,0 +1,131 @@
+name: "glm5_nvfp4_ISL1K_OSL1K_ctx1dep2_gen5tep4_batch1_eplb0_mtp3"
+
+# ctx: 1 prefill worker, TP2/EP2
+# gen: 5 decode workers, TP4/EP4, enable_attention_dp=false, max_batch=1
+# concurrency: 10
+
+model:
+ path: "nvidia/GLM5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb300"
+
+ prefill_nodes: 1
+ prefill_workers: 1
+ gpus_per_prefill: 2
+
+ decode_workers: 5
+ decode_nodes: 5
+ gpus_per_decode: 4
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 2
+ moe_expert_parallel_size: 2
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 16
+ max_num_tokens: 16384
+ max_seq_len: 1064
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: CUTEDSL
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.6
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ speculative_config:
+ decoding_type: MTP
+ num_nextn_predict_layers: 3
+
+ decode:
+ tensor_parallel_size: 4
+ moe_expert_parallel_size: 4
+ pipeline_parallel_size: 1
+ enable_attention_dp: false
+ enable_lm_head_tp_in_adp: false
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 1
+ max_num_tokens: 4
+ max_seq_len: 2088
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ moe_config:
+ backend: TRTLLM
+ use_low_precision_moe_combine: true
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.85
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+ speculative_config:
+ decoding_type: MTP
+ num_nextn_predict_layers: 3
+
+benchmark:
+ type: "sa-bench"
+ isl: 1024
+ osl: 1024
+ concurrencies: "10"
+ req_rate: "inf"
+ custom_tokenizer: "glm_moe_dsa"
+ use_chat_template: false
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/MTP/ctx2dep2_gen1dep16_batch64_eplb0_mtp2.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/MTP/ctx2dep2_gen1dep16_batch64_eplb0_mtp2.yaml
new file mode 100644
index 00000000..66d211aa
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/MTP/ctx2dep2_gen1dep16_batch64_eplb0_mtp2.yaml
@@ -0,0 +1,139 @@
+name: "glm5_nvfp4_ISL1K_OSL1K_ctx2dep2_gen1dep16_batch64_eplb0_mtp2"
+
+# ctx: 2 prefill workers, TP2/EP2
+# gen: 1 decode worker, TP16/EP16, enable_attention_dp=true, max_batch=64
+# concurrency: 1229
+
+model:
+ path: "nvidia/GLM5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb300"
+
+ prefill_nodes: 1
+ prefill_workers: 2
+ gpus_per_prefill: 2
+
+ decode_workers: 1
+ decode_nodes: 4
+ gpus_per_decode: 16
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 2
+ moe_expert_parallel_size: 2
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 16
+ max_num_tokens: 16384
+ max_seq_len: 1064
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: CUTEDSL
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.6
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ speculative_config:
+ decoding_type: MTP
+ num_nextn_predict_layers: 2
+
+ decode:
+ tensor_parallel_size: 16
+ moe_expert_parallel_size: 16
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ enable_lm_head_tp_in_adp: true
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 64
+ max_num_tokens: 192
+ max_seq_len: 2088
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ - 8
+ - 16
+ - 24
+ - 32
+ - 40
+ - 48
+ - 56
+ - 64
+ moe_config:
+ backend: CUTEDSL
+ use_low_precision_moe_combine: true
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.7
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+ speculative_config:
+ decoding_type: MTP
+ num_nextn_predict_layers: 2
+
+benchmark:
+ type: "sa-bench"
+ isl: 1024
+ osl: 1024
+ concurrencies: "1229"
+ req_rate: "inf"
+ custom_tokenizer: "glm_moe_dsa"
+ use_chat_template: false
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/MTP/ctx2dep2_gen1dep32_batch16_eplb0_mtp3.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/MTP/ctx2dep2_gen1dep32_batch16_eplb0_mtp3.yaml
new file mode 100644
index 00000000..fe754372
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/MTP/ctx2dep2_gen1dep32_batch16_eplb0_mtp3.yaml
@@ -0,0 +1,133 @@
+name: "glm5_nvfp4_ISL1K_OSL1K_ctx2dep2_gen1dep32_batch16_eplb0_mtp3"
+
+# ctx: 2 prefill workers, TP2/EP2
+# gen: 1 decode worker, TP32/EP32, enable_attention_dp=true, max_batch=16
+# concurrency: 666
+
+model:
+ path: "nvidia/GLM5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb300"
+
+ prefill_nodes: 1
+ prefill_workers: 2
+ gpus_per_prefill: 2
+
+ decode_workers: 1
+ decode_nodes: 8
+ gpus_per_decode: 32
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 2
+ moe_expert_parallel_size: 2
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 16
+ max_num_tokens: 16384
+ max_seq_len: 1064
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: CUTEDSL
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.6
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ speculative_config:
+ decoding_type: MTP
+ num_nextn_predict_layers: 3
+
+ decode:
+ tensor_parallel_size: 32
+ moe_expert_parallel_size: 32
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ enable_lm_head_tp_in_adp: true
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 16
+ max_num_tokens: 64
+ max_seq_len: 2088
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ - 8
+ - 16
+ moe_config:
+ backend: CUTEDSL
+ use_low_precision_moe_combine: true
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.6
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+ speculative_config:
+ decoding_type: MTP
+ num_nextn_predict_layers: 3
+
+benchmark:
+ type: "sa-bench"
+ isl: 1024
+ osl: 1024
+ concurrencies: "666"
+ req_rate: "inf"
+ custom_tokenizer: "glm_moe_dsa"
+ use_chat_template: false
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/MTP/ctx3dep2_gen1dep32_batch32_eplb0_mtp2.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/MTP/ctx3dep2_gen1dep32_batch32_eplb0_mtp2.yaml
new file mode 100644
index 00000000..70821f3e
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/MTP/ctx3dep2_gen1dep32_batch32_eplb0_mtp2.yaml
@@ -0,0 +1,135 @@
+name: "glm5_nvfp4_ISL1K_OSL1K_ctx3dep2_gen1dep32_batch32_eplb0_mtp2"
+
+# ctx: 3 prefill workers, TP2/EP2
+# gen: 1 decode worker, TP32/EP32, enable_attention_dp=true, max_batch=32
+# concurrency: 1229
+
+model:
+ path: "nvidia/GLM5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb300"
+
+ prefill_nodes: 2
+ prefill_workers: 3
+ gpus_per_prefill: 2
+
+ decode_workers: 1
+ decode_nodes: 8
+ gpus_per_decode: 32
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 2
+ moe_expert_parallel_size: 2
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 16
+ max_num_tokens: 16384
+ max_seq_len: 1064
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: CUTEDSL
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.6
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ speculative_config:
+ decoding_type: MTP
+ num_nextn_predict_layers: 2
+
+ decode:
+ tensor_parallel_size: 32
+ moe_expert_parallel_size: 32
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ enable_lm_head_tp_in_adp: true
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 32
+ max_num_tokens: 96
+ max_seq_len: 2088
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ - 8
+ - 16
+ - 24
+ - 32
+ moe_config:
+ backend: CUTEDSL
+ use_low_precision_moe_combine: true
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.6
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+ speculative_config:
+ decoding_type: MTP
+ num_nextn_predict_layers: 2
+
+benchmark:
+ type: "sa-bench"
+ isl: 1024
+ osl: 1024
+ concurrencies: "1229"
+ req_rate: "inf"
+ custom_tokenizer: "glm_moe_dsa"
+ use_chat_template: false
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/MTP/ctx4dep2_gen1dep16_batch256_eplb256_mtp1.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/MTP/ctx4dep2_gen1dep16_batch256_eplb256_mtp1.yaml
new file mode 100644
index 00000000..bf3183b7
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/MTP/ctx4dep2_gen1dep16_batch256_eplb256_mtp1.yaml
@@ -0,0 +1,166 @@
+name: "glm5_nvfp4_ISL1K_OSL1K_ctx4dep2_gen1dep16_batch256_eplb256_mtp1"
+
+# ctx: 4 prefill workers, TP2/EP2, EPLB: num_slots=256, max_batch=256
+# gen: 1 decode worker, TP16/EP16, enable_attention_dp=true, max_batch=256
+# concurrency: 4301
+
+model:
+ path: "nvidia/GLM5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb300"
+
+ prefill_nodes: 2
+ prefill_workers: 4
+ gpus_per_prefill: 2
+
+ decode_workers: 1
+ decode_nodes: 4
+ gpus_per_decode: 16
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 2
+ moe_expert_parallel_size: 2
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 16
+ max_num_tokens: 16384
+ max_seq_len: 1064
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: CUTEDSL
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.6
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ speculative_config:
+ decoding_type: MTP
+ num_nextn_predict_layers: 1
+
+ decode:
+ tensor_parallel_size: 16
+ moe_expert_parallel_size: 16
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ enable_lm_head_tp_in_adp: true
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 256
+ max_num_tokens: 512
+ max_seq_len: 2088
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ - 8
+ - 16
+ - 24
+ - 32
+ - 40
+ - 48
+ - 56
+ - 64
+ - 72
+ - 80
+ - 88
+ - 96
+ - 104
+ - 112
+ - 120
+ - 128
+ - 136
+ - 144
+ - 152
+ - 160
+ - 168
+ - 176
+ - 184
+ - 192
+ - 200
+ - 208
+ - 216
+ - 224
+ - 232
+ - 240
+ - 248
+ - 256
+ moe_config:
+ backend: CUTEDSL
+ use_low_precision_moe_combine: true
+ load_balancer:
+ layer_updates_per_iter: 1
+ num_slots: 256
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.7
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+ speculative_config:
+ decoding_type: MTP
+ num_nextn_predict_layers: 1
+
+benchmark:
+ type: "sa-bench"
+ isl: 1024
+ osl: 1024
+ concurrencies: "4301"
+ req_rate: "inf"
+ custom_tokenizer: "glm_moe_dsa"
+ use_chat_template: false
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/MTP/ctx5dep2_gen2dep8_batch512_eplb0_mtp1.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/MTP/ctx5dep2_gen2dep8_batch512_eplb0_mtp1.yaml
new file mode 100644
index 00000000..1d9f4f10
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/MTP/ctx5dep2_gen2dep8_batch512_eplb0_mtp1.yaml
@@ -0,0 +1,195 @@
+name: "glm5_nvfp4_ISL1K_OSL1K_ctx5dep2_gen2dep8_batch512_eplb0_mtp1"
+
+# ctx: 5 prefill workers, TP2/EP2
+# gen: 2 decode workers, TP8/EP8, enable_attention_dp=true, max_batch=512
+# concurrency: 8602
+
+model:
+ path: "nvidia/GLM5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb300"
+
+ prefill_nodes: 3
+ prefill_workers: 5
+ gpus_per_prefill: 2
+
+ decode_workers: 2
+ decode_nodes: 4
+ gpus_per_decode: 8
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 2
+ moe_expert_parallel_size: 2
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 16
+ max_num_tokens: 16384
+ max_seq_len: 1064
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: CUTEDSL
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.6
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ speculative_config:
+ decoding_type: MTP
+ num_nextn_predict_layers: 1
+
+ decode:
+ tensor_parallel_size: 8
+ moe_expert_parallel_size: 8
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ enable_lm_head_tp_in_adp: true
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 512
+ max_num_tokens: 1024
+ max_seq_len: 2088
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ - 8
+ - 16
+ - 24
+ - 32
+ - 40
+ - 48
+ - 56
+ - 64
+ - 72
+ - 80
+ - 88
+ - 96
+ - 104
+ - 112
+ - 120
+ - 128
+ - 136
+ - 144
+ - 152
+ - 160
+ - 168
+ - 176
+ - 184
+ - 192
+ - 200
+ - 208
+ - 216
+ - 224
+ - 232
+ - 240
+ - 248
+ - 256
+ - 264
+ - 272
+ - 280
+ - 288
+ - 296
+ - 304
+ - 312
+ - 320
+ - 328
+ - 336
+ - 344
+ - 352
+ - 360
+ - 368
+ - 376
+ - 384
+ - 392
+ - 400
+ - 408
+ - 416
+ - 424
+ - 432
+ - 440
+ - 448
+ - 456
+ - 464
+ - 472
+ - 480
+ - 488
+ - 496
+ - 504
+ - 512
+ moe_config:
+ backend: CUTEDSL
+ use_low_precision_moe_combine: true
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.8
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+ speculative_config:
+ decoding_type: MTP
+ num_nextn_predict_layers: 1
+
+benchmark:
+ type: "sa-bench"
+ isl: 1024
+ osl: 1024
+ concurrencies: "8602"
+ req_rate: "inf"
+ custom_tokenizer: "glm_moe_dsa"
+ use_chat_template: false
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/MTP/ctx6dep2_gen1dep32_batch128_eplb288_mtp1.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/MTP/ctx6dep2_gen1dep32_batch128_eplb288_mtp1.yaml
new file mode 100644
index 00000000..44b81b3c
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/MTP/ctx6dep2_gen1dep32_batch128_eplb288_mtp1.yaml
@@ -0,0 +1,150 @@
+name: "glm5_nvfp4_ISL1K_OSL1K_ctx6dep2_gen1dep32_batch128_eplb288_mtp1"
+
+# ctx: 6 prefill workers, TP2/EP2, EPLB: num_slots=288, max_batch=128
+# gen: 1 decode worker, TP32/EP32, enable_attention_dp=true, max_batch=128
+# concurrency: 4301
+
+model:
+ path: "nvidia/GLM5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb300"
+
+ prefill_nodes: 3
+ prefill_workers: 6
+ gpus_per_prefill: 2
+
+ decode_workers: 1
+ decode_nodes: 8
+ gpus_per_decode: 32
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 2
+ moe_expert_parallel_size: 2
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 16
+ max_num_tokens: 16384
+ max_seq_len: 1064
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: CUTEDSL
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.6
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ speculative_config:
+ decoding_type: MTP
+ num_nextn_predict_layers: 1
+
+ decode:
+ tensor_parallel_size: 32
+ moe_expert_parallel_size: 32
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ enable_lm_head_tp_in_adp: true
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 128
+ max_num_tokens: 256
+ max_seq_len: 2088
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ - 8
+ - 16
+ - 24
+ - 32
+ - 40
+ - 48
+ - 56
+ - 64
+ - 72
+ - 80
+ - 88
+ - 96
+ - 104
+ - 112
+ - 120
+ - 128
+ moe_config:
+ backend: CUTEDSL
+ use_low_precision_moe_combine: true
+ load_balancer:
+ layer_updates_per_iter: 1
+ num_slots: 288
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.6
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+ speculative_config:
+ decoding_type: MTP
+ num_nextn_predict_layers: 1
+
+benchmark:
+ type: "sa-bench"
+ isl: 1024
+ osl: 1024
+ concurrencies: "4301"
+ req_rate: "inf"
+ custom_tokenizer: "glm_moe_dsa"
+ use_chat_template: false
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/STP/ctx1dep2_gen1dep32_batch16_eplb0_mtp0.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/STP/ctx1dep2_gen1dep32_batch16_eplb0_mtp0.yaml
new file mode 100644
index 00000000..0410623b
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/STP/ctx1dep2_gen1dep32_batch16_eplb0_mtp0.yaml
@@ -0,0 +1,127 @@
+name: "glm5_nvfp4_ISL1K_OSL1K_ctx1dep2_gen1dep32_batch16_eplb0_mtp0"
+
+# ctx: 1 prefill worker, TP2/EP2
+# gen: 1 decode worker, TP32/EP32, enable_attention_dp=true, max_batch=16
+# concurrency: 615
+
+model:
+ path: "nvidia/GLM5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb300"
+
+ prefill_nodes: 1
+ prefill_workers: 1
+ gpus_per_prefill: 2
+
+ decode_workers: 1
+ decode_nodes: 8
+ gpus_per_decode: 32
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 2
+ moe_expert_parallel_size: 2
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 16
+ max_num_tokens: 16384
+ max_seq_len: 1064
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: CUTEDSL
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.6
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+
+ decode:
+ tensor_parallel_size: 32
+ moe_expert_parallel_size: 32
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ enable_lm_head_tp_in_adp: false
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 16
+ max_num_tokens: 16
+ max_seq_len: 2088
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ - 8
+ - 16
+ moe_config:
+ backend: CUTEDSL
+ use_low_precision_moe_combine: true
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.7
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+
+benchmark:
+ type: "sa-bench"
+ isl: 1024
+ osl: 1024
+ concurrencies: "615"
+ req_rate: "inf"
+ custom_tokenizer: "glm_moe_dsa"
+ use_chat_template: false
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/STP/ctx1dep2_gen4tep8_batch64_allconc_eplb0_mtp0.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/STP/ctx1dep2_gen4tep8_batch64_allconc_eplb0_mtp0.yaml
new file mode 100644
index 00000000..d967e3b2
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/STP/ctx1dep2_gen4tep8_batch64_allconc_eplb0_mtp0.yaml
@@ -0,0 +1,134 @@
+name: "glm5_nvfp4_ISL1K_OSL1K_ctx1dep2_gen4tep8_batch64_allconc_eplb0_mtp0"
+
+# ctx: 1 prefill worker, TP2/EP2
+# gen: 4 decode workers, TP8/EP8, enable_attention_dp=false, max_batch=64
+# Merged concurrencies: batch16(84), batch32(180), batch64(336)
+
+model:
+ path: "nvidia/GLM5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb300"
+
+ prefill_nodes: 1
+ prefill_workers: 1
+ gpus_per_prefill: 2
+
+ decode_workers: 4
+ decode_nodes: 8
+ gpus_per_decode: 8
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 2
+ moe_expert_parallel_size: 2
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 16
+ max_num_tokens: 16384
+ max_seq_len: 1064
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: CUTEDSL
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.6
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+
+ decode:
+ allreduce_strategy: MNNVL
+ tensor_parallel_size: 8
+ moe_expert_parallel_size: 8
+ pipeline_parallel_size: 1
+ enable_attention_dp: false
+ enable_lm_head_tp_in_adp: false
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 64
+ max_num_tokens: 64
+ max_seq_len: 2088
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ - 8
+ - 16
+ - 24
+ - 32
+ - 40
+ - 48
+ - 56
+ - 64
+ moe_config:
+ backend: TRTLLM
+ use_low_precision_moe_combine: true
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.9
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+
+benchmark:
+ type: "sa-bench"
+ isl: 1024
+ osl: 1024
+ concurrencies: "84x180x336"
+ req_rate: "inf"
+ custom_tokenizer: "glm_moe_dsa"
+ use_chat_template: false
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/STP/ctx1dep2_gen5tep4_batch4_allconc_eplb0_mtp0.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/STP/ctx1dep2_gen5tep4_batch4_allconc_eplb0_mtp0.yaml
new file mode 100644
index 00000000..d9f9ea2f
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/STP/ctx1dep2_gen5tep4_batch4_allconc_eplb0_mtp0.yaml
@@ -0,0 +1,125 @@
+name: "glm5_nvfp4_ISL1K_OSL1K_ctx1dep2_gen5tep4_batch4_allconc_eplb0_mtp0"
+
+# ctx: 1 prefill worker, TP2/EP2
+# gen: 5 decode workers, TP4/EP4, enable_attention_dp=false, max_batch=4
+# Merged concurrencies: batch1(5), batch2(10), batch4(25)
+
+model:
+ path: "nvidia/GLM5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb300"
+
+ prefill_nodes: 1
+ prefill_workers: 1
+ gpus_per_prefill: 2
+
+ decode_workers: 5
+ decode_nodes: 5
+ gpus_per_decode: 4
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 2
+ moe_expert_parallel_size: 2
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 16
+ max_num_tokens: 16384
+ max_seq_len: 1064
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: CUTEDSL
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.6
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+
+ decode:
+ tensor_parallel_size: 4
+ moe_expert_parallel_size: 4
+ pipeline_parallel_size: 1
+ enable_attention_dp: false
+ enable_lm_head_tp_in_adp: false
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 4
+ max_num_tokens: 4
+ max_seq_len: 2088
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ moe_config:
+ backend: TRTLLM
+ use_low_precision_moe_combine: true
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.9
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+
+benchmark:
+ type: "sa-bench"
+ isl: 1024
+ osl: 1024
+ concurrencies: "5x10x25"
+ req_rate: "inf"
+ custom_tokenizer: "glm_moe_dsa"
+ use_chat_template: false
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/STP/ctx2dep2_gen1dep32_batch32_eplb0_mtp0.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/STP/ctx2dep2_gen1dep32_batch32_eplb0_mtp0.yaml
new file mode 100644
index 00000000..26ddd7b1
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/STP/ctx2dep2_gen1dep32_batch32_eplb0_mtp0.yaml
@@ -0,0 +1,129 @@
+name: "glm5_nvfp4_ISL1K_OSL1K_ctx2dep2_gen1dep32_batch32_eplb0_mtp0"
+
+# ctx: 2 prefill workers, TP2/EP2, max_batch=32
+# gen: 1 decode worker, TP32/EP32, enable_attention_dp=true, max_batch=32
+# concurrency: 1229
+
+model:
+ path: "nvidia/GLM5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb300"
+
+ prefill_nodes: 1
+ prefill_workers: 2
+ gpus_per_prefill: 2
+
+ decode_workers: 1
+ decode_nodes: 8
+ gpus_per_decode: 32
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 2
+ moe_expert_parallel_size: 2
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 16
+ max_num_tokens: 16384
+ max_seq_len: 1064
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: CUTEDSL
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.6
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+
+ decode:
+ tensor_parallel_size: 32
+ moe_expert_parallel_size: 32
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ enable_lm_head_tp_in_adp: false
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 32
+ max_num_tokens: 32
+ max_seq_len: 2088
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ - 8
+ - 16
+ - 24
+ - 32
+ moe_config:
+ backend: CUTEDSL
+ use_low_precision_moe_combine: true
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.7
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+
+benchmark:
+ type: "sa-bench"
+ isl: 1024
+ osl: 1024
+ concurrencies: "1229"
+ req_rate: "inf"
+ custom_tokenizer: "glm_moe_dsa"
+ use_chat_template: false
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/STP/ctx3dep2_gen1dep32_batch64_eplb0_mtp0.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/STP/ctx3dep2_gen1dep32_batch64_eplb0_mtp0.yaml
new file mode 100644
index 00000000..081e96da
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/STP/ctx3dep2_gen1dep32_batch64_eplb0_mtp0.yaml
@@ -0,0 +1,133 @@
+name: "glm5_nvfp4_ISL1K_OSL1K_ctx3dep2_gen1dep32_batch64_eplb0_mtp0"
+
+# ctx: 3 prefill workers, TP2/EP2, max_batch=64
+# gen: 1 decode worker, TP32/EP32, enable_attention_dp=true, max_batch=64
+# concurrency: 2253
+
+model:
+ path: "nvidia/GLM5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb300"
+
+ prefill_nodes: 2
+ prefill_workers: 3
+ gpus_per_prefill: 2
+
+ decode_workers: 1
+ decode_nodes: 8
+ gpus_per_decode: 32
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 2
+ moe_expert_parallel_size: 2
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 16
+ max_num_tokens: 16384
+ max_seq_len: 1064
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: CUTEDSL
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.6
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+
+ decode:
+ tensor_parallel_size: 32
+ moe_expert_parallel_size: 32
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ enable_lm_head_tp_in_adp: false
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 64
+ max_num_tokens: 64
+ max_seq_len: 2088
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ - 8
+ - 16
+ - 24
+ - 32
+ - 40
+ - 48
+ - 56
+ - 64
+ moe_config:
+ backend: CUTEDSL
+ use_low_precision_moe_combine: true
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.7
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+
+benchmark:
+ type: "sa-bench"
+ isl: 1024
+ osl: 1024
+ concurrencies: "2253"
+ req_rate: "inf"
+ custom_tokenizer: "glm_moe_dsa"
+ use_chat_template: false
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/STP/ctx4dep2_gen1dep16_batch512_eplb256_mtp0.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/STP/ctx4dep2_gen1dep16_batch512_eplb256_mtp0.yaml
new file mode 100644
index 00000000..dbca4fd5
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/STP/ctx4dep2_gen1dep16_batch512_eplb256_mtp0.yaml
@@ -0,0 +1,191 @@
+name: "glm5_nvfp4_ISL1K_OSL1K_ctx4dep2_gen1dep16_batch512_eplb256_mtp0"
+
+# ctx: 4 prefill workers, TP2/EP2
+# gen: 1 decode worker, TP16/EP16, EPLB: num_slots=256, max_batch=512, concurrency: 8192
+
+model:
+ path: "nvidia/GLM5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb300"
+
+ prefill_nodes: 2
+ prefill_workers: 4
+ gpus_per_prefill: 2
+
+ decode_workers: 1
+ decode_nodes: 4
+ gpus_per_decode: 16
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 2
+ moe_expert_parallel_size: 2
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 16
+ max_num_tokens: 16384
+ max_seq_len: 1064
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: CUTEDSL
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.6
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+
+ decode:
+ tensor_parallel_size: 16
+ moe_expert_parallel_size: 16
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ enable_lm_head_tp_in_adp: false
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 512
+ max_num_tokens: 512
+ max_seq_len: 2088
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ - 8
+ - 16
+ - 24
+ - 32
+ - 40
+ - 48
+ - 56
+ - 64
+ - 72
+ - 80
+ - 88
+ - 96
+ - 104
+ - 112
+ - 120
+ - 128
+ - 136
+ - 144
+ - 152
+ - 160
+ - 168
+ - 176
+ - 184
+ - 192
+ - 200
+ - 208
+ - 216
+ - 224
+ - 232
+ - 240
+ - 248
+ - 256
+ - 264
+ - 272
+ - 280
+ - 288
+ - 296
+ - 304
+ - 312
+ - 320
+ - 328
+ - 336
+ - 344
+ - 352
+ - 360
+ - 368
+ - 376
+ - 384
+ - 392
+ - 400
+ - 408
+ - 416
+ - 424
+ - 432
+ - 440
+ - 448
+ - 456
+ - 464
+ - 472
+ - 480
+ - 488
+ - 496
+ - 504
+ - 512
+ moe_config:
+ backend: CUTEDSL
+ use_low_precision_moe_combine: true
+ load_balancer:
+ layer_updates_per_iter: 1
+ num_slots: 256
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.75
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+
+benchmark:
+ type: "sa-bench"
+ isl: 1024
+ osl: 1024
+ concurrencies: "8192"
+ req_rate: "inf"
+ custom_tokenizer: "glm_moe_dsa"
+ use_chat_template: false
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/STP/ctx4dep2_gen1dep32_batch128_eplb0_mtp0.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/STP/ctx4dep2_gen1dep32_batch128_eplb0_mtp0.yaml
new file mode 100644
index 00000000..1c8d2d78
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/STP/ctx4dep2_gen1dep32_batch128_eplb0_mtp0.yaml
@@ -0,0 +1,141 @@
+name: "glm5_nvfp4_ISL1K_OSL1K_ctx4dep2_gen1dep32_batch128_eplb0_mtp0"
+
+# ctx: 4 prefill workers, TP2/EP2, max_batch=128
+# gen: 1 decode worker, TP32/EP32, enable_attention_dp=true, max_batch=128
+# concurrency: 4301
+
+model:
+ path: "nvidia/GLM5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb300"
+
+ prefill_nodes: 2
+ prefill_workers: 4
+ gpus_per_prefill: 2
+
+ decode_workers: 1
+ decode_nodes: 8
+ gpus_per_decode: 32
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 2
+ moe_expert_parallel_size: 2
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 16
+ max_num_tokens: 16384
+ max_seq_len: 1064
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: CUTEDSL
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.6
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+
+ decode:
+ tensor_parallel_size: 32
+ moe_expert_parallel_size: 32
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ enable_lm_head_tp_in_adp: false
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 128
+ max_num_tokens: 128
+ max_seq_len: 2088
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ - 8
+ - 16
+ - 24
+ - 32
+ - 40
+ - 48
+ - 56
+ - 64
+ - 72
+ - 80
+ - 88
+ - 96
+ - 104
+ - 112
+ - 120
+ - 128
+ moe_config:
+ backend: CUTEDSL
+ use_low_precision_moe_combine: true
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.7
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+
+benchmark:
+ type: "sa-bench"
+ isl: 1024
+ osl: 1024
+ concurrencies: "4301"
+ req_rate: "inf"
+ custom_tokenizer: "glm_moe_dsa"
+ use_chat_template: false
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/STP/ctx6dep2_gen1dep32_batch256_eplb288_mtp0.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/STP/ctx6dep2_gen1dep32_batch256_eplb288_mtp0.yaml
new file mode 100644
index 00000000..0d6870ff
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/STP/ctx6dep2_gen1dep32_batch256_eplb288_mtp0.yaml
@@ -0,0 +1,160 @@
+name: "glm5_nvfp4_ISL1K_OSL1K_ctx6dep2_gen1dep32_batch256_eplb288_mtp0"
+
+# ctx: 6 prefill workers, EPLB: num_slots=288, TP2/EP2, max_batch=256
+# gen: 1 decode worker, TP32/EP32, enable_attention_dp=true, max_batch=256
+# concurrency: 8192
+
+model:
+ path: "nvidia/GLM5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb300"
+
+ prefill_nodes: 3
+ prefill_workers: 6
+ gpus_per_prefill: 2
+
+ decode_workers: 1
+ decode_nodes: 8
+ gpus_per_decode: 32
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 2
+ moe_expert_parallel_size: 2
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 16
+ max_num_tokens: 16384
+ max_seq_len: 1064
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: CUTEDSL
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.6
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+
+ decode:
+ tensor_parallel_size: 32
+ moe_expert_parallel_size: 32
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ enable_lm_head_tp_in_adp: false
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 256
+ max_num_tokens: 256
+ max_seq_len: 2088
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ - 8
+ - 16
+ - 24
+ - 32
+ - 40
+ - 48
+ - 56
+ - 64
+ - 72
+ - 80
+ - 88
+ - 96
+ - 104
+ - 112
+ - 120
+ - 128
+ - 136
+ - 144
+ - 152
+ - 160
+ - 168
+ - 176
+ - 184
+ - 192
+ - 200
+ - 208
+ - 216
+ - 224
+ - 232
+ - 240
+ - 248
+ - 256
+ moe_config:
+ backend: CUTEDSL
+ use_low_precision_moe_combine: true
+ load_balancer:
+ layer_updates_per_iter: 1
+ num_slots: 288
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.7
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+
+benchmark:
+ type: "sa-bench"
+ isl: 1024
+ osl: 1024
+ concurrencies: "8192"
+ req_rate: "inf"
+ custom_tokenizer: "glm_moe_dsa"
+ use_chat_template: false
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/MTP/ctx12dep2_gen1dep16_batch32_eplb0_mtp2.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/MTP/ctx12dep2_gen1dep16_batch32_eplb0_mtp2.yaml
new file mode 100644
index 00000000..8940ea72
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/MTP/ctx12dep2_gen1dep16_batch32_eplb0_mtp2.yaml
@@ -0,0 +1,135 @@
+name: "glm5_nvfp4_ISL8K_OSL1K_ctx12dep2_gen1dep16_batch32_eplb0_mtp2"
+
+# ctx: 12 prefill workers, TP2/EP2
+# gen: 1 decode worker, TP16/EP16, enable_attention_dp=true, max_batch=32
+# concurrency: 666
+
+model:
+ path: "nvidia/GLM5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb300"
+
+ prefill_nodes: 6
+ prefill_workers: 12
+ gpus_per_prefill: 2
+
+ decode_workers: 1
+ decode_nodes: 4
+ gpus_per_decode: 16
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 2
+ moe_expert_parallel_size: 2
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 2
+ max_num_tokens: 16640
+ max_seq_len: 8232
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: CUTEDSL
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.6
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ speculative_config:
+ decoding_type: MTP
+ num_nextn_predict_layers: 2
+
+ decode:
+ tensor_parallel_size: 16
+ moe_expert_parallel_size: 16
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ enable_lm_head_tp_in_adp: true
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 32
+ max_num_tokens: 96
+ max_seq_len: 9256
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ - 8
+ - 16
+ - 24
+ - 32
+ moe_config:
+ backend: CUTEDSL
+ use_low_precision_moe_combine: true
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.7
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+ speculative_config:
+ decoding_type: MTP
+ num_nextn_predict_layers: 2
+
+benchmark:
+ type: "sa-bench"
+ isl: 8192
+ osl: 1024
+ concurrencies: "666"
+ req_rate: "inf"
+ custom_tokenizer: "glm_moe_dsa"
+ use_chat_template: false
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/MTP/ctx13dep2_gen1dep8_batch128_eplb0_mtp1.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/MTP/ctx13dep2_gen1dep8_batch128_eplb0_mtp1.yaml
new file mode 100644
index 00000000..29eba0b3
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/MTP/ctx13dep2_gen1dep8_batch128_eplb0_mtp1.yaml
@@ -0,0 +1,147 @@
+name: "glm5_nvfp4_ISL8K_OSL1K_ctx13dep2_gen1dep8_batch128_eplb0_mtp1"
+
+# ctx: 13 prefill workers, TP2/EP2
+# gen: 1 decode worker, TP8/EP8, enable_attention_dp=true, max_batch=128
+# concurrency: 1076
+
+model:
+ path: "nvidia/GLM5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb300"
+
+ prefill_nodes: 7
+ prefill_workers: 13
+ gpus_per_prefill: 2
+
+ decode_workers: 1
+ decode_nodes: 2
+ gpus_per_decode: 8
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 2
+ moe_expert_parallel_size: 2
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 2
+ max_num_tokens: 16640
+ max_seq_len: 8232
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: CUTEDSL
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.6
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ speculative_config:
+ decoding_type: MTP
+ num_nextn_predict_layers: 1
+
+ decode:
+ tensor_parallel_size: 8
+ moe_expert_parallel_size: 8
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ enable_lm_head_tp_in_adp: true
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 128
+ max_num_tokens: 256
+ max_seq_len: 9256
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ - 8
+ - 16
+ - 24
+ - 32
+ - 40
+ - 48
+ - 56
+ - 64
+ - 72
+ - 80
+ - 88
+ - 96
+ - 104
+ - 112
+ - 120
+ - 128
+ moe_config:
+ backend: CUTEDSL
+ use_low_precision_moe_combine: true
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.8
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+ speculative_config:
+ decoding_type: MTP
+ num_nextn_predict_layers: 1
+
+benchmark:
+ type: "sa-bench"
+ isl: 8192
+ osl: 1024
+ concurrencies: "1076"
+ req_rate: "inf"
+ custom_tokenizer: "glm_moe_dsa"
+ use_chat_template: false
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/MTP/ctx15dep2_gen1dep32_batch16_eplb0_mtp3.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/MTP/ctx15dep2_gen1dep32_batch16_eplb0_mtp3.yaml
new file mode 100644
index 00000000..f8fcdac9
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/MTP/ctx15dep2_gen1dep32_batch16_eplb0_mtp3.yaml
@@ -0,0 +1,133 @@
+name: "glm5_nvfp4_ISL8K_OSL1K_ctx15dep2_gen1dep32_batch16_eplb0_mtp3"
+
+# ctx: 15 prefill workers, TP2/EP2
+# gen: 1 decode worker, TP32/EP32, enable_attention_dp=true, max_batch=16
+# concurrency: 666
+
+model:
+ path: "nvidia/GLM5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb300"
+
+ prefill_nodes: 8
+ prefill_workers: 15
+ gpus_per_prefill: 2
+
+ decode_workers: 1
+ decode_nodes: 8
+ gpus_per_decode: 32
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 2
+ moe_expert_parallel_size: 2
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 2
+ max_num_tokens: 16640
+ max_seq_len: 8232
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: CUTEDSL
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.6
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ speculative_config:
+ decoding_type: MTP
+ num_nextn_predict_layers: 3
+
+ decode:
+ tensor_parallel_size: 32
+ moe_expert_parallel_size: 32
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ enable_lm_head_tp_in_adp: true
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 16
+ max_num_tokens: 64
+ max_seq_len: 9256
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ - 8
+ - 16
+ moe_config:
+ backend: CUTEDSL
+ use_low_precision_moe_combine: true
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.6
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+ speculative_config:
+ decoding_type: MTP
+ num_nextn_predict_layers: 3
+
+benchmark:
+ type: "sa-bench"
+ isl: 8192
+ osl: 1024
+ concurrencies: "666"
+ req_rate: "inf"
+ custom_tokenizer: "glm_moe_dsa"
+ use_chat_template: false
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/MTP/ctx18dep2_gen1dep16_batch64_eplb0_mtp1.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/MTP/ctx18dep2_gen1dep16_batch64_eplb0_mtp1.yaml
new file mode 100644
index 00000000..775fa68f
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/MTP/ctx18dep2_gen1dep16_batch64_eplb0_mtp1.yaml
@@ -0,0 +1,139 @@
+name: "glm5_nvfp4_ISL8K_OSL1K_ctx18dep2_gen1dep16_batch64_eplb0_mtp1"
+
+# ctx: 18 prefill workers, TP2/EP2
+# gen: 1 decode worker, TP16/EP16, enable_attention_dp=true, max_batch=64
+# concurrency: 1229
+
+model:
+ path: "nvidia/GLM5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb300"
+
+ prefill_nodes: 9
+ prefill_workers: 18
+ gpus_per_prefill: 2
+
+ decode_workers: 1
+ decode_nodes: 4
+ gpus_per_decode: 16
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 2
+ moe_expert_parallel_size: 2
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 2
+ max_num_tokens: 16640
+ max_seq_len: 8232
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: CUTEDSL
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.6
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ speculative_config:
+ decoding_type: MTP
+ num_nextn_predict_layers: 1
+
+ decode:
+ tensor_parallel_size: 16
+ moe_expert_parallel_size: 16
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ enable_lm_head_tp_in_adp: true
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 64
+ max_num_tokens: 128
+ max_seq_len: 9256
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ - 8
+ - 16
+ - 24
+ - 32
+ - 40
+ - 48
+ - 56
+ - 64
+ moe_config:
+ backend: CUTEDSL
+ use_low_precision_moe_combine: true
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.7
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+ speculative_config:
+ decoding_type: MTP
+ num_nextn_predict_layers: 1
+
+benchmark:
+ type: "sa-bench"
+ isl: 8192
+ osl: 1024
+ concurrencies: "1229"
+ req_rate: "inf"
+ custom_tokenizer: "glm_moe_dsa"
+ use_chat_template: false
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/MTP/ctx1dep2_gen1tep8_batch16_eplb0_mtp3.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/MTP/ctx1dep2_gen1tep8_batch16_eplb0_mtp3.yaml
new file mode 100644
index 00000000..c457cce0
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/MTP/ctx1dep2_gen1tep8_batch16_eplb0_mtp3.yaml
@@ -0,0 +1,134 @@
+name: "glm5_nvfp4_ISL8K_OSL1K_ctx1dep2_gen1tep8_batch16_eplb0_mtp3"
+
+# ctx: 1 prefill worker, TP2/EP2
+# gen: 1 decode worker, TP8/EP8 (MNNVL), max_batch=16
+# concurrency: 24
+
+model:
+ path: "nvidia/GLM5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb300"
+
+ prefill_nodes: 1
+ prefill_workers: 1
+ gpus_per_prefill: 2
+
+ decode_workers: 1
+ decode_nodes: 2
+ gpus_per_decode: 8
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 2
+ moe_expert_parallel_size: 2
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 2
+ max_num_tokens: 16640
+ max_seq_len: 8232
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: CUTEDSL
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.6
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ speculative_config:
+ decoding_type: MTP
+ num_nextn_predict_layers: 3
+
+ decode:
+ allreduce_strategy: MNNVL
+ tensor_parallel_size: 8
+ moe_expert_parallel_size: 8
+ pipeline_parallel_size: 1
+ enable_attention_dp: false
+ enable_lm_head_tp_in_adp: false
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 16
+ max_num_tokens: 64
+ max_seq_len: 9256
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ - 8
+ - 16
+ moe_config:
+ backend: TRTLLM
+ use_low_precision_moe_combine: true
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.9
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+ speculative_config:
+ decoding_type: MTP
+ num_nextn_predict_layers: 3
+
+benchmark:
+ type: "sa-bench"
+ isl: 8192
+ osl: 1024
+ concurrencies: "24"
+ req_rate: "inf"
+ custom_tokenizer: "glm_moe_dsa"
+ use_chat_template: false
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/MTP/ctx1dep2_gen2tep8_batch8_eplb0_mtp3.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/MTP/ctx1dep2_gen2tep8_batch8_eplb0_mtp3.yaml
new file mode 100644
index 00000000..517cf361
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/MTP/ctx1dep2_gen2tep8_batch8_eplb0_mtp3.yaml
@@ -0,0 +1,133 @@
+name: "glm5_nvfp4_ISL8K_OSL1K_ctx1dep2_gen2tep8_batch8_eplb0_mtp3"
+
+# ctx: 1 prefill worker, TP2/EP2
+# gen: 2 decode workers, TP8/EP8 (MNNVL), max_batch=8
+# concurrency: 22
+
+model:
+ path: "nvidia/GLM5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb300"
+
+ prefill_nodes: 1
+ prefill_workers: 1
+ gpus_per_prefill: 2
+
+ decode_workers: 2
+ decode_nodes: 4
+ gpus_per_decode: 8
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 2
+ moe_expert_parallel_size: 2
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 2
+ max_num_tokens: 16640
+ max_seq_len: 8232
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: CUTEDSL
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.6
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ speculative_config:
+ decoding_type: MTP
+ num_nextn_predict_layers: 3
+
+ decode:
+ allreduce_strategy: MNNVL
+ tensor_parallel_size: 8
+ moe_expert_parallel_size: 8
+ pipeline_parallel_size: 1
+ enable_attention_dp: false
+ enable_lm_head_tp_in_adp: false
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 8
+ max_num_tokens: 32
+ max_seq_len: 9256
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ - 8
+ moe_config:
+ backend: TRTLLM
+ use_low_precision_moe_combine: true
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.9
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+ speculative_config:
+ decoding_type: MTP
+ num_nextn_predict_layers: 3
+
+benchmark:
+ type: "sa-bench"
+ isl: 8192
+ osl: 1024
+ concurrencies: "22"
+ req_rate: "inf"
+ custom_tokenizer: "glm_moe_dsa"
+ use_chat_template: false
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/MTP/ctx1dep2_gen4tep8_batch4_allconc_eplb0_mtp3.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/MTP/ctx1dep2_gen4tep8_batch4_allconc_eplb0_mtp3.yaml
new file mode 100644
index 00000000..20599c3f
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/MTP/ctx1dep2_gen4tep8_batch4_allconc_eplb0_mtp3.yaml
@@ -0,0 +1,132 @@
+name: "glm5_nvfp4_ISL8K_OSL1K_ctx1dep2_gen4tep8_batch4_allconc_eplb0_mtp3"
+
+# ctx: 1 prefill worker, TP2/EP2
+# gen: 4 decode workers, TP8/EP8 (MNNVL), max_batch=4
+# concurrencies: 4 (batch1), 24 (batch4)
+
+model:
+ path: "nvidia/GLM5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb300"
+
+ prefill_nodes: 1
+ prefill_workers: 1
+ gpus_per_prefill: 2
+
+ decode_workers: 4
+ decode_nodes: 8
+ gpus_per_decode: 8
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 2
+ moe_expert_parallel_size: 2
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 2
+ max_num_tokens: 16640
+ max_seq_len: 8232
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: CUTEDSL
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.6
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ speculative_config:
+ decoding_type: MTP
+ num_nextn_predict_layers: 3
+
+ decode:
+ allreduce_strategy: MNNVL
+ tensor_parallel_size: 8
+ moe_expert_parallel_size: 8
+ pipeline_parallel_size: 1
+ enable_attention_dp: false
+ enable_lm_head_tp_in_adp: false
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 4
+ max_num_tokens: 16
+ max_seq_len: 9256
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ moe_config:
+ backend: TRTLLM
+ use_low_precision_moe_combine: true
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.9
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+ speculative_config:
+ decoding_type: MTP
+ num_nextn_predict_layers: 3
+
+benchmark:
+ type: "sa-bench"
+ isl: 8192
+ osl: 1024
+ concurrencies: "4x24"
+ req_rate: "inf"
+ custom_tokenizer: "glm_moe_dsa"
+ use_chat_template: false
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/MTP/ctx1dep2_gen5tep4_batch1_eplb0_mtp3.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/MTP/ctx1dep2_gen5tep4_batch1_eplb0_mtp3.yaml
new file mode 100644
index 00000000..0037f722
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/MTP/ctx1dep2_gen5tep4_batch1_eplb0_mtp3.yaml
@@ -0,0 +1,131 @@
+name: "glm5_nvfp4_ISL8K_OSL1K_ctx1dep2_gen5tep4_batch1_eplb0_mtp3"
+
+# ctx: 1 prefill worker, TP2/EP2
+# gen: 5 decode workers, TP4/EP4, max_batch=1
+# concurrency: 5
+
+model:
+ path: "nvidia/GLM5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb300"
+
+ prefill_nodes: 1
+ prefill_workers: 1
+ gpus_per_prefill: 2
+
+ decode_workers: 5
+ decode_nodes: 5
+ gpus_per_decode: 4
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 2
+ moe_expert_parallel_size: 2
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 2
+ max_num_tokens: 16640
+ max_seq_len: 8232
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: CUTEDSL
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.6
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ speculative_config:
+ decoding_type: MTP
+ num_nextn_predict_layers: 3
+
+ decode:
+ tensor_parallel_size: 4
+ moe_expert_parallel_size: 4
+ pipeline_parallel_size: 1
+ enable_attention_dp: false
+ enable_lm_head_tp_in_adp: false
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 1
+ max_num_tokens: 4
+ max_seq_len: 9256
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ moe_config:
+ backend: TRTLLM
+ use_low_precision_moe_combine: true
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.85
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+ speculative_config:
+ decoding_type: MTP
+ num_nextn_predict_layers: 3
+
+benchmark:
+ type: "sa-bench"
+ isl: 8192
+ osl: 1024
+ concurrencies: "5"
+ req_rate: "inf"
+ custom_tokenizer: "glm_moe_dsa"
+ use_chat_template: false
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/MTP/ctx5dep2_gen1dep32_batch4_eplb0_mtp3.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/MTP/ctx5dep2_gen1dep32_batch4_eplb0_mtp3.yaml
new file mode 100644
index 00000000..6e233408
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/MTP/ctx5dep2_gen1dep32_batch4_eplb0_mtp3.yaml
@@ -0,0 +1,131 @@
+name: "glm5_nvfp4_ISL8K_OSL1K_ctx5dep2_gen1dep32_batch4_eplb0_mtp3"
+
+# ctx: 5 prefill workers, TP2/EP2
+# gen: 1 decode worker, TP32/EP32, enable_attention_dp=true, enable_lm_head_tp_in_adp=true, max_batch=4
+# concurrency: 180
+
+model:
+ path: "nvidia/GLM5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb300"
+
+ prefill_nodes: 3
+ prefill_workers: 5
+ gpus_per_prefill: 2
+
+ decode_workers: 1
+ decode_nodes: 8
+ gpus_per_decode: 32
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 2
+ moe_expert_parallel_size: 2
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 2
+ max_num_tokens: 16640
+ max_seq_len: 8232
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: CUTEDSL
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.6
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ speculative_config:
+ decoding_type: MTP
+ num_nextn_predict_layers: 3
+
+ decode:
+ tensor_parallel_size: 32
+ moe_expert_parallel_size: 32
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ enable_lm_head_tp_in_adp: true
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 4
+ max_num_tokens: 16
+ max_seq_len: 9256
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ moe_config:
+ backend: CUTEDSL
+ use_low_precision_moe_combine: true
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.6
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+ speculative_config:
+ decoding_type: MTP
+ num_nextn_predict_layers: 3
+
+benchmark:
+ type: "sa-bench"
+ isl: 8192
+ osl: 1024
+ concurrencies: "180"
+ req_rate: "inf"
+ custom_tokenizer: "glm_moe_dsa"
+ use_chat_template: false
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/MTP/ctx9dep2_gen1dep32_batch8_eplb0_mtp3.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/MTP/ctx9dep2_gen1dep32_batch8_eplb0_mtp3.yaml
new file mode 100644
index 00000000..bd1cb583
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/MTP/ctx9dep2_gen1dep32_batch8_eplb0_mtp3.yaml
@@ -0,0 +1,132 @@
+name: "glm5_nvfp4_ISL8K_OSL1K_ctx9dep2_gen1dep32_batch8_eplb0_mtp3"
+
+# ctx: 9 prefill workers, TP2/EP2
+# gen: 1 decode worker, TP32/EP32, enable_attention_dp=true, max_batch=8
+# concurrency: 333
+
+model:
+ path: "nvidia/GLM5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb300"
+
+ prefill_nodes: 5
+ prefill_workers: 9
+ gpus_per_prefill: 2
+
+ decode_workers: 1
+ decode_nodes: 8
+ gpus_per_decode: 32
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 2
+ moe_expert_parallel_size: 2
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 2
+ max_num_tokens: 16640
+ max_seq_len: 8232
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: CUTEDSL
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.6
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ speculative_config:
+ decoding_type: MTP
+ num_nextn_predict_layers: 3
+
+ decode:
+ tensor_parallel_size: 32
+ moe_expert_parallel_size: 32
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ enable_lm_head_tp_in_adp: true
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 8
+ max_num_tokens: 32
+ max_seq_len: 9256
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ - 8
+ moe_config:
+ backend: CUTEDSL
+ use_low_precision_moe_combine: true
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.6
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+ speculative_config:
+ decoding_type: MTP
+ num_nextn_predict_layers: 3
+
+benchmark:
+ type: "sa-bench"
+ isl: 8192
+ osl: 1024
+ concurrencies: "333"
+ req_rate: "inf"
+ custom_tokenizer: "glm_moe_dsa"
+ use_chat_template: false
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/STP/ctx12dep2_gen1dep16_batch64_eplb0_mtp0.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/STP/ctx12dep2_gen1dep16_batch64_eplb0_mtp0.yaml
new file mode 100644
index 00000000..611aebb6
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/STP/ctx12dep2_gen1dep16_batch64_eplb0_mtp0.yaml
@@ -0,0 +1,133 @@
+name: "glm5_nvfp4_ISL8K_OSL1K_ctx12dep2_gen1dep16_batch64_eplb0_mtp0"
+
+# ctx: 12 prefill workers, TP2/EP2
+# gen: 1 decode worker, TP16/EP16, enable_attention_dp=true, max_batch=64
+# concurrency: 1127
+
+model:
+ path: "nvidia/GLM5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb300"
+
+ prefill_nodes: 6
+ prefill_workers: 12
+ gpus_per_prefill: 2
+
+ decode_workers: 1
+ decode_nodes: 4
+ gpus_per_decode: 16
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 2
+ moe_expert_parallel_size: 2
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 2
+ max_num_tokens: 16640
+ max_seq_len: 8232
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: CUTEDSL
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.6
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+
+ decode:
+ tensor_parallel_size: 16
+ moe_expert_parallel_size: 16
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ enable_lm_head_tp_in_adp: false
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 64
+ max_num_tokens: 64
+ max_seq_len: 9256
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ - 8
+ - 16
+ - 24
+ - 32
+ - 40
+ - 48
+ - 56
+ - 64
+ moe_config:
+ backend: CUTEDSL
+ use_low_precision_moe_combine: true
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.8
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+
+benchmark:
+ type: "sa-bench"
+ isl: 8192
+ osl: 1024
+ concurrencies: "1127"
+ req_rate: "inf"
+ custom_tokenizer: "glm_moe_dsa"
+ use_chat_template: false
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/STP/ctx15dep2_gen1dep32_batch32_eplb0_mtp0.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/STP/ctx15dep2_gen1dep32_batch32_eplb0_mtp0.yaml
new file mode 100644
index 00000000..831e703d
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/STP/ctx15dep2_gen1dep32_batch32_eplb0_mtp0.yaml
@@ -0,0 +1,129 @@
+name: "glm5_nvfp4_ISL8K_OSL1K_ctx15dep2_gen1dep32_batch32_eplb0_mtp0"
+
+# ctx: 15 prefill workers, TP2/EP2
+# gen: 1 decode worker, TP32/EP32, enable_attention_dp=true, max_batch=32
+# concurrency: 1229
+
+model:
+ path: "nvidia/GLM5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb300"
+
+ prefill_nodes: 8
+ prefill_workers: 15
+ gpus_per_prefill: 2
+
+ decode_workers: 1
+ decode_nodes: 8
+ gpus_per_decode: 32
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 2
+ moe_expert_parallel_size: 2
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 2
+ max_num_tokens: 16640
+ max_seq_len: 8232
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: CUTEDSL
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.6
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+
+ decode:
+ tensor_parallel_size: 32
+ moe_expert_parallel_size: 32
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ enable_lm_head_tp_in_adp: false
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 32
+ max_num_tokens: 32
+ max_seq_len: 9256
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ - 8
+ - 16
+ - 24
+ - 32
+ moe_config:
+ backend: CUTEDSL
+ use_low_precision_moe_combine: true
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.75
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+
+benchmark:
+ type: "sa-bench"
+ isl: 8192
+ osl: 1024
+ concurrencies: "1229"
+ req_rate: "inf"
+ custom_tokenizer: "glm_moe_dsa"
+ use_chat_template: false
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/STP/ctx1dep2_gen2tep8_batch16_eplb0_mtp0.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/STP/ctx1dep2_gen2tep8_batch16_eplb0_mtp0.yaml
new file mode 100644
index 00000000..8ff2f420
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/STP/ctx1dep2_gen2tep8_batch16_eplb0_mtp0.yaml
@@ -0,0 +1,127 @@
+name: "glm5_nvfp4_ISL8K_OSL1K_ctx1dep2_gen2tep8_batch16_eplb0_mtp0"
+
+# ctx: 1 prefill worker, TP2/EP2
+# gen: 2 decode workers, TP8/EP8 (MNNVL), max_batch=16
+# concurrency: 42
+model:
+ path: "nvidia/GLM5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb300"
+
+ prefill_nodes: 1
+ prefill_workers: 1
+ gpus_per_prefill: 2
+
+ decode_workers: 2
+ decode_nodes: 4
+ gpus_per_decode: 8
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 2
+ moe_expert_parallel_size: 2
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 2
+ max_num_tokens: 16640
+ max_seq_len: 8232
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: CUTEDSL
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.6
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+
+ decode:
+ allreduce_strategy: MNNVL
+ tensor_parallel_size: 8
+ moe_expert_parallel_size: 8
+ pipeline_parallel_size: 1
+ enable_attention_dp: false
+ enable_lm_head_tp_in_adp: false
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 16
+ max_num_tokens: 16
+ max_seq_len: 9256
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ - 8
+ - 16
+ moe_config:
+ backend: TRTLLM
+ use_low_precision_moe_combine: true
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.9
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+
+benchmark:
+ type: "sa-bench"
+ isl: 8192
+ osl: 1024
+ concurrencies: "42"
+ req_rate: "inf"
+ custom_tokenizer: "glm_moe_dsa"
+ use_chat_template: false
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/STP/ctx1dep2_gen4tep8_batch1_eplb0_mtp0.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/STP/ctx1dep2_gen4tep8_batch1_eplb0_mtp0.yaml
new file mode 100644
index 00000000..cc8faa11
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/STP/ctx1dep2_gen4tep8_batch1_eplb0_mtp0.yaml
@@ -0,0 +1,126 @@
+name: "glm5_nvfp4_ISL8K_OSL1K_ctx1dep2_gen4tep8_batch1_eplb0_mtp0"
+
+# ctx: 1 prefill worker, TP2/EP2
+# gen: 4 decode workers, TP8/EP8 (MNNVL), max_batch=1
+# concurrency: 4
+
+model:
+ path: "nvidia/GLM5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb300"
+
+ prefill_nodes: 1
+ prefill_workers: 1
+ gpus_per_prefill: 2
+
+ decode_workers: 4
+ decode_nodes: 8
+ gpus_per_decode: 8
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 2
+ moe_expert_parallel_size: 2
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 2
+ max_num_tokens: 16640
+ max_seq_len: 8232
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: CUTEDSL
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.6
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+
+ decode:
+ allreduce_strategy: MNNVL
+ tensor_parallel_size: 8
+ moe_expert_parallel_size: 8
+ pipeline_parallel_size: 1
+ enable_attention_dp: false
+ enable_lm_head_tp_in_adp: false
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 1
+ max_num_tokens: 1
+ max_seq_len: 9256
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ moe_config:
+ backend: TRTLLM
+ use_low_precision_moe_combine: true
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.9
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+
+benchmark:
+ type: "sa-bench"
+ isl: 8192
+ osl: 1024
+ concurrencies: "4"
+ req_rate: "inf"
+ custom_tokenizer: "glm_moe_dsa"
+ use_chat_template: false
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/STP/ctx1dep2_gen5tep4_batch4_allconc_eplb0_mtp0.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/STP/ctx1dep2_gen5tep4_batch4_allconc_eplb0_mtp0.yaml
new file mode 100644
index 00000000..06d02024
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/STP/ctx1dep2_gen5tep4_batch4_allconc_eplb0_mtp0.yaml
@@ -0,0 +1,125 @@
+name: "glm5_nvfp4_ISL8K_OSL1K_ctx1dep2_gen5tep4_batch4_allconc_eplb0_mtp0"
+
+# ctx: 1 prefill worker, TP2/EP2
+# gen: 5 decode workers, TP4/EP4, max_batch=4
+# concurrencies: 5 (batch1), 10 (batch2), 25 (batch4) — merged as 5x10x25
+
+model:
+ path: "nvidia/GLM5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb300"
+
+ prefill_nodes: 1
+ prefill_workers: 1
+ gpus_per_prefill: 2
+
+ decode_workers: 5
+ decode_nodes: 5
+ gpus_per_decode: 4
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 2
+ moe_expert_parallel_size: 2
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 2
+ max_num_tokens: 16640
+ max_seq_len: 8232
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: CUTEDSL
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.6
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+
+ decode:
+ tensor_parallel_size: 4
+ moe_expert_parallel_size: 4
+ pipeline_parallel_size: 1
+ enable_attention_dp: false
+ enable_lm_head_tp_in_adp: false
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 4
+ max_num_tokens: 4
+ max_seq_len: 9256
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ moe_config:
+ backend: TRTLLM
+ use_low_precision_moe_combine: true
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.9
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+
+benchmark:
+ type: "sa-bench"
+ isl: 8192
+ osl: 1024
+ concurrencies: "5x10x25"
+ req_rate: "inf"
+ custom_tokenizer: "glm_moe_dsa"
+ use_chat_template: false
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/STP/ctx20dep2_gen1dep16_batch128_eplb0_mtp0.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/STP/ctx20dep2_gen1dep16_batch128_eplb0_mtp0.yaml
new file mode 100644
index 00000000..ead937c9
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/STP/ctx20dep2_gen1dep16_batch128_eplb0_mtp0.yaml
@@ -0,0 +1,141 @@
+name: "glm5_nvfp4_ISL8K_OSL1K_ctx20dep2_gen1dep16_batch128_eplb0_mtp0"
+
+# ctx: 20 prefill workers, TP2/EP2
+# gen: 1 decode worker, TP16/EP16, enable_attention_dp=true, max_batch=128
+# concurrency: 2151
+
+model:
+ path: "nvidia/GLM5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb300"
+
+ prefill_nodes: 10
+ prefill_workers: 20
+ gpus_per_prefill: 2
+
+ decode_workers: 1
+ decode_nodes: 4
+ gpus_per_decode: 16
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 2
+ moe_expert_parallel_size: 2
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 2
+ max_num_tokens: 16640
+ max_seq_len: 8232
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: CUTEDSL
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.6
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+
+ decode:
+ tensor_parallel_size: 16
+ moe_expert_parallel_size: 16
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ enable_lm_head_tp_in_adp: false
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 128
+ max_num_tokens: 128
+ max_seq_len: 9256
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ - 8
+ - 16
+ - 24
+ - 32
+ - 40
+ - 48
+ - 56
+ - 64
+ - 72
+ - 80
+ - 88
+ - 96
+ - 104
+ - 112
+ - 120
+ - 128
+ moe_config:
+ backend: CUTEDSL
+ use_low_precision_moe_combine: true
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.8
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+
+benchmark:
+ type: "sa-bench"
+ isl: 8192
+ osl: 1024
+ concurrencies: "2151"
+ req_rate: "inf"
+ custom_tokenizer: "glm_moe_dsa"
+ use_chat_template: false
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/STP/ctx2dep2_gen3tep8_batch32_eplb0_mtp0.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/STP/ctx2dep2_gen3tep8_batch32_eplb0_mtp0.yaml
new file mode 100644
index 00000000..e06ea268
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/STP/ctx2dep2_gen3tep8_batch32_eplb0_mtp0.yaml
@@ -0,0 +1,130 @@
+name: "glm5_nvfp4_ISL8K_OSL1K_ctx2dep2_gen3tep8_batch32_eplb0_mtp0"
+
+# ctx: 2 prefill workers, TP2/EP2
+# gen: 3 decode workers, TP8/EP8 (MNNVL), max_batch=32
+# concurrency: 117
+
+model:
+ path: "nvidia/GLM5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb300"
+
+ prefill_nodes: 1
+ prefill_workers: 2
+ gpus_per_prefill: 2
+
+ decode_workers: 3
+ decode_nodes: 6
+ gpus_per_decode: 8
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 2
+ moe_expert_parallel_size: 2
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 2
+ max_num_tokens: 16640
+ max_seq_len: 8232
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: CUTEDSL
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.6
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+
+ decode:
+ allreduce_strategy: MNNVL
+ tensor_parallel_size: 8
+ moe_expert_parallel_size: 8
+ pipeline_parallel_size: 1
+ enable_attention_dp: false
+ enable_lm_head_tp_in_adp: false
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 32
+ max_num_tokens: 32
+ max_seq_len: 9256
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ - 8
+ - 16
+ - 24
+ - 32
+ moe_config:
+ backend: TRTLLM
+ use_low_precision_moe_combine: true
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.9
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+
+benchmark:
+ type: "sa-bench"
+ isl: 8192
+ osl: 1024
+ concurrencies: "117"
+ req_rate: "inf"
+ custom_tokenizer: "glm_moe_dsa"
+ use_chat_template: false
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/STP/ctx4dep2_gen3tep8_batch64_eplb0_mtp0.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/STP/ctx4dep2_gen3tep8_batch64_eplb0_mtp0.yaml
new file mode 100644
index 00000000..f4b3cc09
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/STP/ctx4dep2_gen3tep8_batch64_eplb0_mtp0.yaml
@@ -0,0 +1,134 @@
+name: "glm5_nvfp4_ISL8K_OSL1K_ctx4dep2_gen3tep8_batch64_eplb0_mtp0"
+
+# ctx: 4 prefill workers, TP2/EP2
+# gen: 3 decode workers, TP8/EP8 (MNNVL), max_batch=64
+# concurrency: 231
+
+model:
+ path: "nvidia/GLM5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb300"
+
+ prefill_nodes: 2
+ prefill_workers: 4
+ gpus_per_prefill: 2
+
+ decode_workers: 3
+ decode_nodes: 6
+ gpus_per_decode: 8
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 2
+ moe_expert_parallel_size: 2
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 2
+ max_num_tokens: 16640
+ max_seq_len: 8232
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: CUTEDSL
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.6
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+
+ decode:
+ allreduce_strategy: MNNVL
+ tensor_parallel_size: 8
+ moe_expert_parallel_size: 8
+ pipeline_parallel_size: 1
+ enable_attention_dp: false
+ enable_lm_head_tp_in_adp: false
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 64
+ max_num_tokens: 64
+ max_seq_len: 9256
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ - 8
+ - 16
+ - 24
+ - 32
+ - 40
+ - 48
+ - 56
+ - 64
+ moe_config:
+ backend: TRTLLM
+ use_low_precision_moe_combine: true
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.9
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+
+benchmark:
+ type: "sa-bench"
+ isl: 8192
+ osl: 1024
+ concurrencies: "231"
+ req_rate: "inf"
+ custom_tokenizer: "glm_moe_dsa"
+ use_chat_template: false
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/STP/ctx9dep2_gen1dep32_batch16_eplb0_mtp0.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/STP/ctx9dep2_gen1dep32_batch16_eplb0_mtp0.yaml
new file mode 100644
index 00000000..75f56785
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/STP/ctx9dep2_gen1dep32_batch16_eplb0_mtp0.yaml
@@ -0,0 +1,127 @@
+name: "glm5_nvfp4_ISL8K_OSL1K_ctx9dep2_gen1dep32_batch16_eplb0_mtp0"
+
+# ctx: 9 prefill workers, TP2/EP2
+# gen: 1 decode worker, TP32/EP32, enable_attention_dp=true, max_batch=16
+# concurrency: 615
+
+model:
+ path: "nvidia/GLM5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb300"
+
+ prefill_nodes: 5
+ prefill_workers: 9
+ gpus_per_prefill: 2
+
+ decode_workers: 1
+ decode_nodes: 8
+ gpus_per_decode: 32
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ MIMALLOC_PURGE_DELAY: "0"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 2
+ moe_expert_parallel_size: 2
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 2
+ max_num_tokens: 16640
+ max_seq_len: 8232
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: CUTEDSL
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.6
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+
+ decode:
+ tensor_parallel_size: 32
+ moe_expert_parallel_size: 32
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ enable_lm_head_tp_in_adp: false
+ trust_remote_code: true
+ custom_tokenizer: "glm_moe_dsa"
+ max_batch_size: 16
+ max_num_tokens: 16
+ max_seq_len: 9256
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ - 8
+ - 16
+ moe_config:
+ backend: CUTEDSL
+ use_low_precision_moe_combine: true
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.75
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+
+benchmark:
+ type: "sa-bench"
+ isl: 8192
+ osl: 1024
+ concurrencies: "615"
+ req_rate: "inf"
+ custom_tokenizer: "glm_moe_dsa"
+ use_chat_template: false
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen1dep16_batch32_eplb0_mtp3.yaml b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen1dep16_batch32_eplb0_mtp3.yaml
new file mode 100644
index 00000000..03462b07
--- /dev/null
+++ b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen1dep16_batch32_eplb0_mtp3.yaml
@@ -0,0 +1,136 @@
+name: "kimi_k25_nvfp4_ISL1K_OSL1K_ctx1dep4_gen1dep16_batch32_eplb0_mtp3"
+
+# ctx: 1 prefill worker, TP4/EP4
+# gen: 1 decode worker, TP16/EP16, enable_attention_dp=true, max_batch=32
+# MTP (Eagle speculative decoding, max_draft_len=3)
+# concurrency: 666
+
+model:
+ path: "nvidia/Kimi-K2.5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb200"
+
+ prefill_nodes: 1
+ prefill_workers: 1
+ gpus_per_prefill: 4
+
+ decode_workers: 1
+ decode_nodes: 4
+ gpus_per_decode: 16
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 4
+ moe_expert_parallel_size: 4
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ max_batch_size: 16
+ max_num_tokens: 16384
+ max_seq_len: 1064
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: TRTLLM
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.4
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ speculative_config:
+ decoding_type: Eagle
+ max_draft_len: 3
+ speculative_model_dir: "/eagle-model"
+
+ decode:
+ tensor_parallel_size: 16
+ moe_expert_parallel_size: 16
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ enable_lm_head_tp_in_adp: true
+ trust_remote_code: true
+ max_batch_size: 32
+ max_num_tokens: 128
+ max_seq_len: 2088
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ - 8
+ - 16
+ - 24
+ - 32
+ moe_config:
+ backend: TRTLLM
+ use_low_precision_moe_combine: true
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.7
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+ speculative_config:
+ decoding_type: Eagle
+ max_draft_len: 3
+ speculative_model_dir: "/eagle-model"
+
+extra_mount:
+ - "nvidia/Kimi-K2.5-Thinking-Eagle3:/eagle-model"
+
+benchmark:
+ type: "sa-bench"
+ isl: 1024
+ osl: 1024
+ concurrencies: "666"
+ req_rate: "inf"
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen1dep32_batch16_eplb0_mtp3.yaml b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen1dep32_batch16_eplb0_mtp3.yaml
new file mode 100644
index 00000000..6a29059c
--- /dev/null
+++ b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen1dep32_batch16_eplb0_mtp3.yaml
@@ -0,0 +1,134 @@
+name: "kimi_k25_nvfp4_ISL1K_OSL1K_ctx1dep4_gen1dep32_batch16_eplb0_mtp3"
+
+# ctx: 1 prefill worker, TP4/EP4
+# gen: 1 decode worker, TP32/EP32, enable_attention_dp=true, max_batch=16
+# MTP (Eagle speculative decoding, max_draft_len=3)
+# concurrency: 666
+
+model:
+ path: "nvidia/Kimi-K2.5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb200"
+
+ prefill_nodes: 1
+ prefill_workers: 1
+ gpus_per_prefill: 4
+
+ decode_workers: 1
+ decode_nodes: 8
+ gpus_per_decode: 32
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 4
+ moe_expert_parallel_size: 4
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ max_batch_size: 16
+ max_num_tokens: 16384
+ max_seq_len: 1064
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: TRTLLM
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.4
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ speculative_config:
+ decoding_type: Eagle
+ max_draft_len: 3
+ speculative_model_dir: "/eagle-model"
+
+ decode:
+ tensor_parallel_size: 32
+ moe_expert_parallel_size: 32
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ enable_lm_head_tp_in_adp: true
+ trust_remote_code: true
+ max_batch_size: 16
+ max_num_tokens: 64
+ max_seq_len: 2088
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ - 8
+ - 16
+ moe_config:
+ backend: TRTLLM
+ use_low_precision_moe_combine: true
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.6
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+ speculative_config:
+ decoding_type: Eagle
+ max_draft_len: 3
+ speculative_model_dir: "/eagle-model"
+
+extra_mount:
+ - "nvidia/Kimi-K2.5-Thinking-Eagle3:/eagle-model"
+
+benchmark:
+ type: "sa-bench"
+ isl: 1024
+ osl: 1024
+ concurrencies: "666"
+ req_rate: "inf"
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen1dep8_batch512_eplb0_mtp1.yaml b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen1dep8_batch512_eplb0_mtp1.yaml
new file mode 100644
index 00000000..739bd487
--- /dev/null
+++ b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen1dep8_batch512_eplb0_mtp1.yaml
@@ -0,0 +1,196 @@
+name: "kimi_k25_nvfp4_ISL1K_OSL1K_ctx1dep4_gen1dep8_batch512_eplb0_mtp1"
+
+# ctx: 1 prefill worker, TP4/EP4
+# gen: 1 decode worker, TP8/EP8, enable_attention_dp=true, max_batch=512
+# MTP (Eagle speculative decoding, max_draft_len=1)
+# concurrency: 4301
+
+model:
+ path: "nvidia/Kimi-K2.5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb200"
+
+ prefill_nodes: 1
+ prefill_workers: 1
+ gpus_per_prefill: 4
+
+ decode_workers: 1
+ decode_nodes: 2
+ gpus_per_decode: 8
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 4
+ moe_expert_parallel_size: 4
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ max_batch_size: 16
+ max_num_tokens: 16384
+ max_seq_len: 1064
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: TRTLLM
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.4
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ speculative_config:
+ decoding_type: Eagle
+ max_draft_len: 1
+ speculative_model_dir: "/eagle-model"
+
+ decode:
+ tensor_parallel_size: 8
+ moe_expert_parallel_size: 8
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ enable_lm_head_tp_in_adp: true
+ trust_remote_code: true
+ max_batch_size: 512
+ max_num_tokens: 1024
+ max_seq_len: 2088
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ - 8
+ - 16
+ - 24
+ - 32
+ - 40
+ - 48
+ - 56
+ - 64
+ - 72
+ - 80
+ - 88
+ - 96
+ - 104
+ - 112
+ - 120
+ - 128
+ - 136
+ - 144
+ - 152
+ - 160
+ - 168
+ - 176
+ - 184
+ - 192
+ - 200
+ - 208
+ - 216
+ - 224
+ - 232
+ - 240
+ - 248
+ - 256
+ - 264
+ - 272
+ - 280
+ - 288
+ - 296
+ - 304
+ - 312
+ - 320
+ - 328
+ - 336
+ - 344
+ - 352
+ - 360
+ - 368
+ - 376
+ - 384
+ - 392
+ - 400
+ - 408
+ - 416
+ - 424
+ - 432
+ - 440
+ - 448
+ - 456
+ - 464
+ - 472
+ - 480
+ - 488
+ - 496
+ - 504
+ - 512
+ moe_config:
+ backend: TRTLLM
+ use_low_precision_moe_combine: true
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.8
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+ speculative_config:
+ decoding_type: Eagle
+ max_draft_len: 1
+ speculative_model_dir: "/eagle-model"
+
+extra_mount:
+ - "nvidia/Kimi-K2.5-Thinking-Eagle3:/eagle-model"
+
+benchmark:
+ type: "sa-bench"
+ isl: 1024
+ osl: 1024
+ concurrencies: "4301"
+ req_rate: "inf"
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen4tep8_batch64_allconc_eplb0_mtp3.yaml b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen4tep8_batch64_allconc_eplb0_mtp3.yaml
new file mode 100644
index 00000000..a768bec4
--- /dev/null
+++ b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen4tep8_batch64_allconc_eplb0_mtp3.yaml
@@ -0,0 +1,141 @@
+name: "kimi_k25_nvfp4_ISL1K_OSL1K_ctx1dep4_gen4tep8_batch64_allconc_eplb0_mtp3"
+
+# ctx: 1 prefill worker, TP4/EP4
+# gen: 4 decode workers, TP8/EP8, allreduce_strategy=MNNVL, max_batch=64
+# MTP (Eagle speculative decoding, max_draft_len=3)
+# Covers all gen4tep8 concurrencies: 8, 48, 92, 192, 336
+
+model:
+ path: "nvidia/Kimi-K2.5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb200"
+
+ prefill_nodes: 1
+ prefill_workers: 1
+ gpus_per_prefill: 4
+
+ decode_workers: 4
+ decode_nodes: 8
+ gpus_per_decode: 8
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 4
+ moe_expert_parallel_size: 4
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ max_batch_size: 16
+ max_num_tokens: 16384
+ max_seq_len: 1064
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: TRTLLM
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.4
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ speculative_config:
+ decoding_type: Eagle
+ max_draft_len: 3
+ speculative_model_dir: "/eagle-model"
+
+ decode:
+ tensor_parallel_size: 8
+ moe_expert_parallel_size: 8
+ pipeline_parallel_size: 1
+ allreduce_strategy: MNNVL
+ enable_attention_dp: false
+ enable_lm_head_tp_in_adp: false
+ trust_remote_code: true
+ max_batch_size: 64
+ max_num_tokens: 256
+ max_seq_len: 2088
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ - 8
+ - 16
+ - 24
+ - 32
+ - 40
+ - 48
+ - 56
+ - 64
+ moe_config:
+ backend: TRTLLM
+ use_low_precision_moe_combine: true
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.9
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+ speculative_config:
+ decoding_type: Eagle
+ max_draft_len: 3
+ speculative_model_dir: "/eagle-model"
+
+extra_mount:
+ - "nvidia/Kimi-K2.5-Thinking-Eagle3:/eagle-model"
+
+benchmark:
+ type: "sa-bench"
+ isl: 1024
+ osl: 1024
+ concurrencies: "8x48x92x192x336"
+ req_rate: "inf"
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen5tep4_batch2_allconc_eplb0_mtp3.yaml b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen5tep4_batch2_allconc_eplb0_mtp3.yaml
new file mode 100644
index 00000000..c2e24b41
--- /dev/null
+++ b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen5tep4_batch2_allconc_eplb0_mtp3.yaml
@@ -0,0 +1,132 @@
+name: "kimi_k25_nvfp4_ISL1K_OSL1K_ctx1dep4_gen5tep4_batch2_allconc_eplb0_mtp3"
+
+# ctx: 1 prefill worker, TP4/EP4
+# gen: 5 decode workers, TP4/EP4, max_batch=2
+# MTP (Eagle speculative decoding, max_draft_len=3)
+# Covers all gen5tep4 concurrencies: 10, 15
+
+model:
+ path: "nvidia/Kimi-K2.5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb200"
+
+ prefill_nodes: 1
+ prefill_workers: 1
+ gpus_per_prefill: 4
+
+ decode_workers: 5
+ decode_nodes: 5
+ gpus_per_decode: 4
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 4
+ moe_expert_parallel_size: 4
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ max_batch_size: 16
+ max_num_tokens: 16384
+ max_seq_len: 1064
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: TRTLLM
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.4
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ speculative_config:
+ decoding_type: Eagle
+ max_draft_len: 3
+ speculative_model_dir: "/eagle-model"
+
+ decode:
+ tensor_parallel_size: 4
+ moe_expert_parallel_size: 4
+ pipeline_parallel_size: 1
+ enable_attention_dp: false
+ enable_lm_head_tp_in_adp: false
+ trust_remote_code: true
+ max_batch_size: 2
+ max_num_tokens: 8
+ max_seq_len: 2088
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ moe_config:
+ backend: TRTLLM
+ use_low_precision_moe_combine: true
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.85
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+ speculative_config:
+ decoding_type: Eagle
+ max_draft_len: 3
+ speculative_model_dir: "/eagle-model"
+
+extra_mount:
+ - "nvidia/Kimi-K2.5-Thinking-Eagle3:/eagle-model"
+
+benchmark:
+ type: "sa-bench"
+ isl: 1024
+ osl: 1024
+ concurrencies: "10x15"
+ req_rate: "inf"
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/MTP/ctx2dep4_gen1dep16_batch128_eplb0_mtp1.yaml b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/MTP/ctx2dep4_gen1dep16_batch128_eplb0_mtp1.yaml
new file mode 100644
index 00000000..68d7dd06
--- /dev/null
+++ b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/MTP/ctx2dep4_gen1dep16_batch128_eplb0_mtp1.yaml
@@ -0,0 +1,148 @@
+name: "kimi_k25_nvfp4_ISL1K_OSL1K_ctx2dep4_gen1dep16_batch128_eplb0_mtp1"
+
+# ctx: 2 prefill workers, TP4/EP4
+# gen: 1 decode worker, TP16/EP16, enable_attention_dp=true, max_batch=128
+# MTP (Eagle speculative decoding, max_draft_len=1)
+# concurrency: 2253
+
+model:
+ path: "nvidia/Kimi-K2.5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb200"
+
+ prefill_nodes: 2
+ prefill_workers: 2
+ gpus_per_prefill: 4
+
+ decode_workers: 1
+ decode_nodes: 4
+ gpus_per_decode: 16
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 4
+ moe_expert_parallel_size: 4
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ max_batch_size: 16
+ max_num_tokens: 16384
+ max_seq_len: 1064
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: TRTLLM
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.4
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ speculative_config:
+ decoding_type: Eagle
+ max_draft_len: 1
+ speculative_model_dir: "/eagle-model"
+
+ decode:
+ tensor_parallel_size: 16
+ moe_expert_parallel_size: 16
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ enable_lm_head_tp_in_adp: true
+ trust_remote_code: true
+ max_batch_size: 128
+ max_num_tokens: 256
+ max_seq_len: 2088
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ - 8
+ - 16
+ - 24
+ - 32
+ - 40
+ - 48
+ - 56
+ - 64
+ - 72
+ - 80
+ - 88
+ - 96
+ - 104
+ - 112
+ - 120
+ - 128
+ moe_config:
+ backend: TRTLLM
+ use_low_precision_moe_combine: true
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.7
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+ speculative_config:
+ decoding_type: Eagle
+ max_draft_len: 1
+ speculative_model_dir: "/eagle-model"
+
+extra_mount:
+ - "nvidia/Kimi-K2.5-Thinking-Eagle3:/eagle-model"
+
+benchmark:
+ type: "sa-bench"
+ isl: 1024
+ osl: 1024
+ concurrencies: "2253"
+ req_rate: "inf"
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/MTP/ctx2dep4_gen1dep32_batch64_eplb0_mtp1.yaml b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/MTP/ctx2dep4_gen1dep32_batch64_eplb0_mtp1.yaml
new file mode 100644
index 00000000..1cb17478
--- /dev/null
+++ b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/MTP/ctx2dep4_gen1dep32_batch64_eplb0_mtp1.yaml
@@ -0,0 +1,140 @@
+name: "kimi_k25_nvfp4_ISL1K_OSL1K_ctx2dep4_gen1dep32_batch64_eplb0_mtp1"
+
+# ctx: 2 prefill workers, TP4/EP4
+# gen: 1 decode worker, TP32/EP32, enable_attention_dp=true, max_batch=64
+# MTP (Eagle speculative decoding, max_draft_len=1)
+# concurrency: 2253
+
+model:
+ path: "nvidia/Kimi-K2.5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb200"
+
+ prefill_nodes: 2
+ prefill_workers: 2
+ gpus_per_prefill: 4
+
+ decode_workers: 1
+ decode_nodes: 8
+ gpus_per_decode: 32
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 4
+ moe_expert_parallel_size: 4
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ max_batch_size: 16
+ max_num_tokens: 16384
+ max_seq_len: 1064
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: TRTLLM
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.4
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ speculative_config:
+ decoding_type: Eagle
+ max_draft_len: 1
+ speculative_model_dir: "/eagle-model"
+
+ decode:
+ tensor_parallel_size: 32
+ moe_expert_parallel_size: 32
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ enable_lm_head_tp_in_adp: true
+ trust_remote_code: true
+ max_batch_size: 64
+ max_num_tokens: 128
+ max_seq_len: 2088
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ - 8
+ - 16
+ - 24
+ - 32
+ - 40
+ - 48
+ - 56
+ - 64
+ moe_config:
+ backend: TRTLLM
+ use_low_precision_moe_combine: true
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.6
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+ speculative_config:
+ decoding_type: Eagle
+ max_draft_len: 1
+ speculative_model_dir: "/eagle-model"
+
+extra_mount:
+ - "nvidia/Kimi-K2.5-Thinking-Eagle3:/eagle-model"
+
+benchmark:
+ type: "sa-bench"
+ isl: 1024
+ osl: 1024
+ concurrencies: "2253"
+ req_rate: "inf"
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/MTP/ctx2dep4_gen3dep8_batch256_eplb0_mtp1.yaml b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/MTP/ctx2dep4_gen3dep8_batch256_eplb0_mtp1.yaml
new file mode 100644
index 00000000..eb43aab7
--- /dev/null
+++ b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/MTP/ctx2dep4_gen3dep8_batch256_eplb0_mtp1.yaml
@@ -0,0 +1,164 @@
+name: "kimi_k25_nvfp4_ISL1K_OSL1K_ctx2dep4_gen3dep8_batch256_eplb0_mtp1"
+
+# ctx: 2 prefill workers, TP4/EP4
+# gen: 3 decode workers, TP8/EP8, enable_attention_dp=true, max_batch=256
+# MTP (Eagle speculative decoding, max_draft_len=1)
+# concurrency: 6759
+
+model:
+ path: "nvidia/Kimi-K2.5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb200"
+
+ prefill_nodes: 2
+ prefill_workers: 2
+ gpus_per_prefill: 4
+
+ decode_workers: 3
+ decode_nodes: 6
+ gpus_per_decode: 8
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 4
+ moe_expert_parallel_size: 4
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ max_batch_size: 16
+ max_num_tokens: 16384
+ max_seq_len: 1064
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: TRTLLM
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.4
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ speculative_config:
+ decoding_type: Eagle
+ max_draft_len: 1
+ speculative_model_dir: "/eagle-model"
+
+ decode:
+ tensor_parallel_size: 8
+ moe_expert_parallel_size: 8
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ enable_lm_head_tp_in_adp: true
+ trust_remote_code: true
+ max_batch_size: 256
+ max_num_tokens: 512
+ max_seq_len: 2088
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ - 8
+ - 16
+ - 24
+ - 32
+ - 40
+ - 48
+ - 56
+ - 64
+ - 72
+ - 80
+ - 88
+ - 96
+ - 104
+ - 112
+ - 120
+ - 128
+ - 136
+ - 144
+ - 152
+ - 160
+ - 168
+ - 176
+ - 184
+ - 192
+ - 200
+ - 208
+ - 216
+ - 224
+ - 232
+ - 240
+ - 248
+ - 256
+ moe_config:
+ backend: TRTLLM
+ use_low_precision_moe_combine: true
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.8
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+ speculative_config:
+ decoding_type: Eagle
+ max_draft_len: 1
+ speculative_model_dir: "/eagle-model"
+
+extra_mount:
+ - "nvidia/Kimi-K2.5-Thinking-Eagle3:/eagle-model"
+
+benchmark:
+ type: "sa-bench"
+ isl: 1024
+ osl: 1024
+ concurrencies: "6759"
+ req_rate: "inf"
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen1dep16_batch32_eplb0_mtp0.yaml b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen1dep16_batch32_eplb0_mtp0.yaml
new file mode 100644
index 00000000..ce3eff43
--- /dev/null
+++ b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen1dep16_batch32_eplb0_mtp0.yaml
@@ -0,0 +1,125 @@
+name: "kimi_k25_nvfp4_ISL1K_OSL1K_ctx1dep4_gen1dep16_batch32_eplb0_mtp0"
+
+# ctx: 1 prefill worker, TP4/EP4
+# gen: 1 decode worker, TP16/EP16, enable_attention_dp=true, max_batch=32
+# STP (no speculative decoding)
+# concurrency: 666
+
+model:
+ path: "nvidia/Kimi-K2.5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb200"
+
+ prefill_nodes: 1
+ prefill_workers: 1
+ gpus_per_prefill: 4
+
+ decode_workers: 1
+ decode_nodes: 4
+ gpus_per_decode: 16
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 4
+ moe_expert_parallel_size: 4
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ max_batch_size: 16
+ max_num_tokens: 16384
+ max_seq_len: 1064
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: TRTLLM
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.4
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+
+ decode:
+ tensor_parallel_size: 16
+ moe_expert_parallel_size: 16
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ enable_lm_head_tp_in_adp: false
+ trust_remote_code: true
+ max_batch_size: 32
+ max_num_tokens: 32
+ max_seq_len: 2088
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ - 8
+ - 16
+ - 24
+ - 32
+ moe_config:
+ backend: TRTLLM
+ use_low_precision_moe_combine: true
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.75
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+
+benchmark:
+ type: "sa-bench"
+ isl: 1024
+ osl: 1024
+ concurrencies: "666"
+ req_rate: "inf"
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen1dep32_batch64_eplb0_mtp0.yaml b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen1dep32_batch64_eplb0_mtp0.yaml
new file mode 100644
index 00000000..105b84bf
--- /dev/null
+++ b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen1dep32_batch64_eplb0_mtp0.yaml
@@ -0,0 +1,129 @@
+name: "kimi_k25_nvfp4_ISL1K_OSL1K_ctx1dep4_gen1dep32_batch64_eplb0_mtp0"
+
+# ctx: 1 prefill worker, TP4/EP4
+# gen: 1 decode worker, TP32/EP32, enable_attention_dp=true, max_batch=64
+# STP (no speculative decoding)
+# concurrency: 2253
+
+model:
+ path: "nvidia/Kimi-K2.5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb200"
+
+ prefill_nodes: 1
+ prefill_workers: 1
+ gpus_per_prefill: 4
+
+ decode_workers: 1
+ decode_nodes: 8
+ gpus_per_decode: 32
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 4
+ moe_expert_parallel_size: 4
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ max_batch_size: 16
+ max_num_tokens: 16384
+ max_seq_len: 1064
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: TRTLLM
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.4
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+
+ decode:
+ tensor_parallel_size: 32
+ moe_expert_parallel_size: 32
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ enable_lm_head_tp_in_adp: false
+ trust_remote_code: true
+ max_batch_size: 64
+ max_num_tokens: 64
+ max_seq_len: 2088
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ - 8
+ - 16
+ - 24
+ - 32
+ - 40
+ - 48
+ - 56
+ - 64
+ moe_config:
+ backend: TRTLLM
+ use_low_precision_moe_combine: true
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.7
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+
+benchmark:
+ type: "sa-bench"
+ isl: 1024
+ osl: 1024
+ concurrencies: "2253"
+ req_rate: "inf"
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: true
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen1dep8_batch768_allconc_eplb0_mtp0.yaml b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen1dep8_batch768_allconc_eplb0_mtp0.yaml
new file mode 100644
index 00000000..9fb194dd
--- /dev/null
+++ b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen1dep8_batch768_allconc_eplb0_mtp0.yaml
@@ -0,0 +1,217 @@
+name: "kimi_k25_nvfp4_ISL1K_OSL1K_ctx1dep4_gen1dep8_batch768_allconc_eplb0_mtp0"
+
+# ctx: 1 prefill worker, TP4/EP4
+# gen: 1 decode worker, TP8/EP8, enable_attention_dp=true, max_batch=768
+# STP (no speculative decoding)
+# Covers all dep8 concurrencies: 4301, 6452
+
+model:
+ path: "nvidia/Kimi-K2.5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb200"
+
+ prefill_nodes: 1
+ prefill_workers: 1
+ gpus_per_prefill: 4
+
+ decode_workers: 1
+ decode_nodes: 2
+ gpus_per_decode: 8
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 4
+ moe_expert_parallel_size: 4
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ max_batch_size: 16
+ max_num_tokens: 16384
+ max_seq_len: 1064
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: TRTLLM
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.4
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+
+ decode:
+ tensor_parallel_size: 8
+ moe_expert_parallel_size: 8
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ enable_lm_head_tp_in_adp: false
+ trust_remote_code: true
+ max_batch_size: 768
+ max_num_tokens: 768
+ max_seq_len: 2088
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ - 8
+ - 16
+ - 24
+ - 32
+ - 40
+ - 48
+ - 56
+ - 64
+ - 72
+ - 80
+ - 88
+ - 96
+ - 104
+ - 112
+ - 120
+ - 128
+ - 136
+ - 144
+ - 152
+ - 160
+ - 168
+ - 176
+ - 184
+ - 192
+ - 200
+ - 208
+ - 216
+ - 224
+ - 232
+ - 240
+ - 248
+ - 256
+ - 264
+ - 272
+ - 280
+ - 288
+ - 296
+ - 304
+ - 312
+ - 320
+ - 328
+ - 336
+ - 344
+ - 352
+ - 360
+ - 368
+ - 376
+ - 384
+ - 392
+ - 400
+ - 408
+ - 416
+ - 424
+ - 432
+ - 440
+ - 448
+ - 456
+ - 464
+ - 472
+ - 480
+ - 488
+ - 496
+ - 504
+ - 512
+ - 520
+ - 528
+ - 536
+ - 544
+ - 552
+ - 560
+ - 568
+ - 576
+ - 584
+ - 592
+ - 600
+ - 608
+ - 616
+ - 624
+ - 632
+ - 640
+ - 648
+ - 656
+ - 664
+ - 672
+ - 680
+ - 688
+ - 696
+ - 704
+ - 712
+ - 720
+ - 728
+ - 736
+ - 744
+ - 752
+ - 760
+ - 768
+ moe_config:
+ backend: TRTLLM
+ use_low_precision_moe_combine: true
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.8
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+
+benchmark:
+ type: "sa-bench"
+ isl: 1024
+ osl: 1024
+ concurrencies: "4301x6452"
+ req_rate: "inf"
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen4tep8_batch128_allconc_eplb0_mtp0.yaml b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen4tep8_batch128_allconc_eplb0_mtp0.yaml
new file mode 100644
index 00000000..5639da41
--- /dev/null
+++ b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen4tep8_batch128_allconc_eplb0_mtp0.yaml
@@ -0,0 +1,138 @@
+name: "kimi_k25_nvfp4_ISL1K_OSL1K_ctx1dep4_gen4tep8_batch128_allconc_eplb0_mtp0"
+
+# ctx: 1 prefill worker, TP4/EP4
+# gen: 4 decode workers, TP8/EP8, allreduce_strategy=MNNVL, max_batch=128
+# STP (no speculative decoding)
+# Covers all gen4tep8 concurrencies: 4, 192, 360, 668
+
+model:
+ path: "nvidia/Kimi-K2.5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb200"
+
+ prefill_nodes: 1
+ prefill_workers: 1
+ gpus_per_prefill: 4
+
+ decode_workers: 4
+ decode_nodes: 8
+ gpus_per_decode: 8
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 4
+ moe_expert_parallel_size: 4
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ max_batch_size: 16
+ max_num_tokens: 16384
+ max_seq_len: 1064
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: TRTLLM
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.4
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+
+ decode:
+ tensor_parallel_size: 8
+ moe_expert_parallel_size: 8
+ pipeline_parallel_size: 1
+ allreduce_strategy: MNNVL
+ enable_attention_dp: false
+ enable_lm_head_tp_in_adp: false
+ trust_remote_code: true
+ max_batch_size: 128
+ max_num_tokens: 128
+ max_seq_len: 2088
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ - 8
+ - 16
+ - 24
+ - 32
+ - 40
+ - 48
+ - 56
+ - 64
+ - 72
+ - 80
+ - 88
+ - 96
+ - 104
+ - 112
+ - 120
+ - 128
+ moe_config:
+ backend: TRTLLM
+ use_low_precision_moe_combine: true
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.9
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+
+benchmark:
+ type: "sa-bench"
+ isl: 1024
+ osl: 1024
+ concurrencies: "4x192x360x668"
+ req_rate: "inf"
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen5tep4_batch8_allconc_eplb0_mtp0.yaml b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen5tep4_batch8_allconc_eplb0_mtp0.yaml
new file mode 100644
index 00000000..f9496feb
--- /dev/null
+++ b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen5tep4_batch8_allconc_eplb0_mtp0.yaml
@@ -0,0 +1,122 @@
+name: "kimi_k25_nvfp4_ISL1K_OSL1K_ctx1dep4_gen5tep4_batch8_allconc_eplb0_mtp0"
+
+# ctx: 1 prefill worker, TP4/EP4
+# gen: 5 decode workers, TP4/EP4, max_batch=8
+# STP (no speculative decoding)
+# Covers all gen5tep4 concurrencies: 5, 15, 30, 55
+
+model:
+ path: "nvidia/Kimi-K2.5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb200"
+
+ prefill_nodes: 1
+ prefill_workers: 1
+ gpus_per_prefill: 4
+
+ decode_workers: 5
+ decode_nodes: 5
+ gpus_per_decode: 4
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 4
+ moe_expert_parallel_size: 4
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ max_batch_size: 16
+ max_num_tokens: 16384
+ max_seq_len: 1064
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: TRTLLM
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.4
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+
+ decode:
+ tensor_parallel_size: 4
+ moe_expert_parallel_size: 4
+ pipeline_parallel_size: 1
+ enable_attention_dp: false
+ enable_lm_head_tp_in_adp: false
+ trust_remote_code: true
+ max_batch_size: 8
+ max_num_tokens: 8
+ max_seq_len: 2088
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ - 8
+ moe_config:
+ backend: TRTLLM
+ use_low_precision_moe_combine: true
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.9
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+
+benchmark:
+ type: "sa-bench"
+ isl: 1024
+ osl: 1024
+ concurrencies: "5x15x30x55"
+ req_rate: "inf"
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/STP/ctx2dep4_gen1dep16_batch256_eplb0_mtp0.yaml b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/STP/ctx2dep4_gen1dep16_batch256_eplb0_mtp0.yaml
new file mode 100644
index 00000000..71b016c4
--- /dev/null
+++ b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/STP/ctx2dep4_gen1dep16_batch256_eplb0_mtp0.yaml
@@ -0,0 +1,153 @@
+name: "kimi_k25_nvfp4_ISL1K_OSL1K_ctx2dep4_gen1dep16_batch256_eplb0_mtp0"
+
+# ctx: 2 prefill workers, TP4/EP4
+# gen: 1 decode worker, TP16/EP16, enable_attention_dp=true, max_batch=256
+# STP (no speculative decoding)
+# concurrency: 4301
+
+model:
+ path: "nvidia/Kimi-K2.5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb200"
+
+ prefill_nodes: 2
+ prefill_workers: 2
+ gpus_per_prefill: 4
+
+ decode_workers: 1
+ decode_nodes: 4
+ gpus_per_decode: 16
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 4
+ moe_expert_parallel_size: 4
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ max_batch_size: 16
+ max_num_tokens: 16384
+ max_seq_len: 1064
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: TRTLLM
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.4
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+
+ decode:
+ tensor_parallel_size: 16
+ moe_expert_parallel_size: 16
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ enable_lm_head_tp_in_adp: false
+ trust_remote_code: true
+ max_batch_size: 256
+ max_num_tokens: 256
+ max_seq_len: 2088
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ - 8
+ - 16
+ - 24
+ - 32
+ - 40
+ - 48
+ - 56
+ - 64
+ - 72
+ - 80
+ - 88
+ - 96
+ - 104
+ - 112
+ - 120
+ - 128
+ - 136
+ - 144
+ - 152
+ - 160
+ - 168
+ - 176
+ - 184
+ - 192
+ - 200
+ - 208
+ - 216
+ - 224
+ - 232
+ - 240
+ - 248
+ - 256
+ moe_config:
+ backend: TRTLLM
+ use_low_precision_moe_combine: true
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.75
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+
+benchmark:
+ type: "sa-bench"
+ isl: 1024
+ osl: 1024
+ concurrencies: "4301"
+ req_rate: "inf"
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/STP/ctx2dep4_gen1dep32_batch128_eplb0_mtp0.yaml b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/STP/ctx2dep4_gen1dep32_batch128_eplb0_mtp0.yaml
new file mode 100644
index 00000000..52b75bb4
--- /dev/null
+++ b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/STP/ctx2dep4_gen1dep32_batch128_eplb0_mtp0.yaml
@@ -0,0 +1,137 @@
+name: "kimi_k25_nvfp4_ISL1K_OSL1K_ctx2dep4_gen1dep32_batch128_eplb0_mtp0"
+
+# ctx: 2 prefill workers, TP4/EP4
+# gen: 1 decode worker, TP32/EP32, enable_attention_dp=true, max_batch=128
+# STP (no speculative decoding)
+# concurrency: 4301
+
+model:
+ path: "nvidia/Kimi-K2.5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb200"
+
+ prefill_nodes: 2
+ prefill_workers: 2
+ gpus_per_prefill: 4
+
+ decode_workers: 1
+ decode_nodes: 8
+ gpus_per_decode: 32
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 4
+ moe_expert_parallel_size: 4
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ max_batch_size: 16
+ max_num_tokens: 16384
+ max_seq_len: 1064
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: TRTLLM
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.4
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+
+ decode:
+ tensor_parallel_size: 32
+ moe_expert_parallel_size: 32
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ enable_lm_head_tp_in_adp: false
+ trust_remote_code: true
+ max_batch_size: 128
+ max_num_tokens: 128
+ max_seq_len: 2088
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ - 8
+ - 16
+ - 24
+ - 32
+ - 40
+ - 48
+ - 56
+ - 64
+ - 72
+ - 80
+ - 88
+ - 96
+ - 104
+ - 112
+ - 120
+ - 128
+ moe_config:
+ backend: TRTLLM
+ use_low_precision_moe_combine: true
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.7
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+
+benchmark:
+ type: "sa-bench"
+ isl: 1024
+ osl: 1024
+ concurrencies: "4301"
+ req_rate: "inf"
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/MTP/ctx1dep4_gen2tep8_batch32_eplb0_mtp3.yaml b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/MTP/ctx1dep4_gen2tep8_batch32_eplb0_mtp3.yaml
new file mode 100644
index 00000000..bb3f8d1e
--- /dev/null
+++ b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/MTP/ctx1dep4_gen2tep8_batch32_eplb0_mtp3.yaml
@@ -0,0 +1,137 @@
+name: "kimi_k25_nvfp4_ISL8K_OSL1K_ctx1dep4_gen2tep8_batch32_eplb0_mtp3"
+
+# ctx: 1 prefill worker, TP4/EP4
+# gen: 2 decode workers, TP8/EP8, allreduce_strategy=MNNVL, max_batch=32
+# MTP Eagle speculative decoding, max_draft_len=3
+# concurrency: 90
+
+model:
+ path: "nvidia/Kimi-K2.5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb200"
+
+ prefill_nodes: 1
+ prefill_workers: 1
+ gpus_per_prefill: 4
+
+ decode_workers: 2
+ decode_nodes: 4
+ gpus_per_decode: 8
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 4
+ moe_expert_parallel_size: 4
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ max_batch_size: 2
+ max_num_tokens: 16384
+ max_seq_len: 8232
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: TRTLLM
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.4
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ speculative_config:
+ decoding_type: Eagle
+ max_draft_len: 3
+ speculative_model_dir: "/eagle-model"
+
+ decode:
+ tensor_parallel_size: 8
+ moe_expert_parallel_size: 8
+ pipeline_parallel_size: 1
+ allreduce_strategy: MNNVL
+ enable_attention_dp: false
+ enable_lm_head_tp_in_adp: false
+ trust_remote_code: true
+ max_batch_size: 32
+ max_num_tokens: 128
+ max_seq_len: 9256
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ - 8
+ - 16
+ - 24
+ - 32
+ moe_config:
+ backend: TRTLLM
+ use_low_precision_moe_combine: true
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.9
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+ speculative_config:
+ decoding_type: Eagle
+ max_draft_len: 3
+ speculative_model_dir: "/eagle-model"
+
+extra_mount:
+ - "nvidia/Kimi-K2.5-Thinking-Eagle3:/eagle-model"
+
+benchmark:
+ type: "sa-bench"
+ isl: 8192
+ osl: 1024
+ concurrencies: "90"
+ req_rate: "inf"
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/MTP/ctx1dep4_gen4tep8_batch1_eplb0_mtp3.yaml b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/MTP/ctx1dep4_gen4tep8_batch1_eplb0_mtp3.yaml
new file mode 100644
index 00000000..8b7f02d6
--- /dev/null
+++ b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/MTP/ctx1dep4_gen4tep8_batch1_eplb0_mtp3.yaml
@@ -0,0 +1,133 @@
+name: "kimi_k25_nvfp4_ISL8K_OSL1K_ctx1dep4_gen4tep8_batch1_eplb0_mtp3"
+
+# ctx: 1 prefill worker, TP4/EP4
+# gen: 4 decode workers, TP8/EP8, allreduce_strategy=MNNVL, max_batch=1
+# MTP Eagle speculative decoding, max_draft_len=3
+# concurrency: 8
+
+model:
+ path: "nvidia/Kimi-K2.5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb200"
+
+ prefill_nodes: 1
+ prefill_workers: 1
+ gpus_per_prefill: 4
+
+ decode_workers: 4
+ decode_nodes: 8
+ gpus_per_decode: 8
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 4
+ moe_expert_parallel_size: 4
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ max_batch_size: 2
+ max_num_tokens: 16384
+ max_seq_len: 8232
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: TRTLLM
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.4
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ speculative_config:
+ decoding_type: Eagle
+ max_draft_len: 3
+ speculative_model_dir: "/eagle-model"
+
+ decode:
+ tensor_parallel_size: 8
+ moe_expert_parallel_size: 8
+ pipeline_parallel_size: 1
+ allreduce_strategy: MNNVL
+ enable_attention_dp: false
+ enable_lm_head_tp_in_adp: false
+ trust_remote_code: true
+ max_batch_size: 1
+ max_num_tokens: 4
+ max_seq_len: 9256
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ moe_config:
+ backend: TRTLLM
+ use_low_precision_moe_combine: true
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.9
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+ speculative_config:
+ decoding_type: Eagle
+ max_draft_len: 3
+ speculative_model_dir: "/eagle-model"
+
+extra_mount:
+ - "nvidia/Kimi-K2.5-Thinking-Eagle3:/eagle-model"
+
+benchmark:
+ type: "sa-bench"
+ isl: 8192
+ osl: 1024
+ concurrencies: "8"
+ req_rate: "inf"
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/MTP/ctx1dep4_gen5tep4_batch8_allconc_eplb0_mtp3.yaml b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/MTP/ctx1dep4_gen5tep4_batch8_allconc_eplb0_mtp3.yaml
new file mode 100644
index 00000000..1883e739
--- /dev/null
+++ b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/MTP/ctx1dep4_gen5tep4_batch8_allconc_eplb0_mtp3.yaml
@@ -0,0 +1,133 @@
+name: "kimi_k25_nvfp4_ISL8K_OSL1K_ctx1dep4_gen5tep4_batch8_allconc_eplb0_mtp3"
+
+# ctx: 1 prefill worker, TP4/EP4
+# gen: 5 decode workers, TP4/EP4, max_batch=8
+# MTP Eagle speculative decoding, max_draft_len=3
+# Covers all gen5tep4 concurrencies: 10, 15, 60
+
+model:
+ path: "nvidia/Kimi-K2.5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb200"
+
+ prefill_nodes: 1
+ prefill_workers: 1
+ gpus_per_prefill: 4
+
+ decode_workers: 5
+ decode_nodes: 5
+ gpus_per_decode: 4
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 4
+ moe_expert_parallel_size: 4
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ max_batch_size: 2
+ max_num_tokens: 16384
+ max_seq_len: 8232
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: TRTLLM
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.4
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ speculative_config:
+ decoding_type: Eagle
+ max_draft_len: 3
+ speculative_model_dir: "/eagle-model"
+
+ decode:
+ tensor_parallel_size: 4
+ moe_expert_parallel_size: 4
+ pipeline_parallel_size: 1
+ enable_attention_dp: false
+ enable_lm_head_tp_in_adp: false
+ trust_remote_code: true
+ max_batch_size: 8
+ max_num_tokens: 32
+ max_seq_len: 9256
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ - 8
+ moe_config:
+ backend: TRTLLM
+ use_low_precision_moe_combine: true
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.85
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+ speculative_config:
+ decoding_type: Eagle
+ max_draft_len: 3
+ speculative_model_dir: "/eagle-model"
+
+extra_mount:
+ - "nvidia/Kimi-K2.5-Thinking-Eagle3:/eagle-model"
+
+benchmark:
+ type: "sa-bench"
+ isl: 8192
+ osl: 1024
+ concurrencies: "10x15x60"
+ req_rate: "inf"
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/MTP/ctx2dep4_gen1dep16_batch8_eplb0_mtp3.yaml b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/MTP/ctx2dep4_gen1dep16_batch8_eplb0_mtp3.yaml
new file mode 100644
index 00000000..5aced422
--- /dev/null
+++ b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/MTP/ctx2dep4_gen1dep16_batch8_eplb0_mtp3.yaml
@@ -0,0 +1,133 @@
+name: "kimi_k25_nvfp4_ISL8K_OSL1K_ctx2dep4_gen1dep16_batch8_eplb0_mtp3"
+
+# ctx: 2 prefill workers, TP4/EP4
+# gen: 1 decode worker, TP16/EP16, enable_attention_dp=true, max_batch=8
+# MTP Eagle speculative decoding, max_draft_len=3
+# concurrency: 180
+
+model:
+ path: "nvidia/Kimi-K2.5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb200"
+
+ prefill_nodes: 2
+ prefill_workers: 2
+ gpus_per_prefill: 4
+
+ decode_workers: 1
+ decode_nodes: 4
+ gpus_per_decode: 16
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 4
+ moe_expert_parallel_size: 4
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ max_batch_size: 2
+ max_num_tokens: 16384
+ max_seq_len: 8232
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: TRTLLM
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.4
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ speculative_config:
+ decoding_type: Eagle
+ max_draft_len: 3
+ speculative_model_dir: "/eagle-model"
+
+ decode:
+ tensor_parallel_size: 16
+ moe_expert_parallel_size: 16
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ enable_lm_head_tp_in_adp: true
+ trust_remote_code: true
+ max_batch_size: 8
+ max_num_tokens: 32
+ max_seq_len: 9256
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ - 8
+ moe_config:
+ backend: TRTLLM
+ use_low_precision_moe_combine: true
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.8
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+ speculative_config:
+ decoding_type: Eagle
+ max_draft_len: 3
+ speculative_model_dir: "/eagle-model"
+
+extra_mount:
+ - "nvidia/Kimi-K2.5-Thinking-Eagle3:/eagle-model"
+
+benchmark:
+ type: "sa-bench"
+ isl: 8192
+ osl: 1024
+ concurrencies: "180"
+ req_rate: "inf"
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/MTP/ctx5dep4_gen1dep32_batch16_eplb0_mtp3.yaml b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/MTP/ctx5dep4_gen1dep32_batch16_eplb0_mtp3.yaml
new file mode 100644
index 00000000..764f2d46
--- /dev/null
+++ b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/MTP/ctx5dep4_gen1dep32_batch16_eplb0_mtp3.yaml
@@ -0,0 +1,134 @@
+name: "kimi_k25_nvfp4_ISL8K_OSL1K_ctx5dep4_gen1dep32_batch16_eplb0_mtp3"
+
+# ctx: 5 prefill workers, TP4/EP4
+# gen: 1 decode worker, TP32/EP32, enable_attention_dp=true, max_batch=16
+# MTP Eagle speculative decoding, max_draft_len=3
+# concurrency: 666
+
+model:
+ path: "nvidia/Kimi-K2.5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb200"
+
+ prefill_nodes: 5
+ prefill_workers: 5
+ gpus_per_prefill: 4
+
+ decode_workers: 1
+ decode_nodes: 8
+ gpus_per_decode: 32
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 4
+ moe_expert_parallel_size: 4
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ max_batch_size: 2
+ max_num_tokens: 16384
+ max_seq_len: 8232
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: TRTLLM
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.4
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ speculative_config:
+ decoding_type: Eagle
+ max_draft_len: 3
+ speculative_model_dir: "/eagle-model"
+
+ decode:
+ tensor_parallel_size: 32
+ moe_expert_parallel_size: 32
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ enable_lm_head_tp_in_adp: true
+ trust_remote_code: true
+ max_batch_size: 16
+ max_num_tokens: 64
+ max_seq_len: 9256
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ - 8
+ - 16
+ moe_config:
+ backend: TRTLLM
+ use_low_precision_moe_combine: true
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.75
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+ speculative_config:
+ decoding_type: Eagle
+ max_draft_len: 3
+ speculative_model_dir: "/eagle-model"
+
+extra_mount:
+ - "nvidia/Kimi-K2.5-Thinking-Eagle3:/eagle-model"
+
+benchmark:
+ type: "sa-bench"
+ isl: 8192
+ osl: 1024
+ concurrencies: "666"
+ req_rate: "inf"
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/MTP/ctx5dep4_gen1dep8_batch256_allconc_eplb0_mtp1.yaml b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/MTP/ctx5dep4_gen1dep8_batch256_allconc_eplb0_mtp1.yaml
new file mode 100644
index 00000000..31308fe6
--- /dev/null
+++ b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/MTP/ctx5dep4_gen1dep8_batch256_allconc_eplb0_mtp1.yaml
@@ -0,0 +1,164 @@
+name: "kimi_k25_nvfp4_ISL8K_OSL1K_ctx5dep4_gen1dep8_batch256_allconc_eplb0_mtp1"
+
+# ctx: 5 prefill workers, TP4/EP4
+# gen: 1 decode worker, TP8/EP8, enable_attention_dp=true, max_batch=256
+# MTP Eagle speculative decoding, max_draft_len=1
+# Covers all dep8 mtp1 concurrencies: 1229, 2253
+
+model:
+ path: "nvidia/Kimi-K2.5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb200"
+
+ prefill_nodes: 5
+ prefill_workers: 5
+ gpus_per_prefill: 4
+
+ decode_workers: 1
+ decode_nodes: 2
+ gpus_per_decode: 8
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 4
+ moe_expert_parallel_size: 4
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ max_batch_size: 2
+ max_num_tokens: 16384
+ max_seq_len: 8232
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: TRTLLM
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.4
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ speculative_config:
+ decoding_type: Eagle
+ max_draft_len: 1
+ speculative_model_dir: "/eagle-model"
+
+ decode:
+ tensor_parallel_size: 8
+ moe_expert_parallel_size: 8
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ enable_lm_head_tp_in_adp: true
+ trust_remote_code: true
+ max_batch_size: 256
+ max_num_tokens: 512
+ max_seq_len: 9256
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ - 8
+ - 16
+ - 24
+ - 32
+ - 40
+ - 48
+ - 56
+ - 64
+ - 72
+ - 80
+ - 88
+ - 96
+ - 104
+ - 112
+ - 120
+ - 128
+ - 136
+ - 144
+ - 152
+ - 160
+ - 168
+ - 176
+ - 184
+ - 192
+ - 200
+ - 208
+ - 216
+ - 224
+ - 232
+ - 240
+ - 248
+ - 256
+ moe_config:
+ backend: TRTLLM
+ use_low_precision_moe_combine: true
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.9
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+ speculative_config:
+ decoding_type: Eagle
+ max_draft_len: 1
+ speculative_model_dir: "/eagle-model"
+
+extra_mount:
+ - "nvidia/Kimi-K2.5-Thinking-Eagle3:/eagle-model"
+
+benchmark:
+ type: "sa-bench"
+ isl: 8192
+ osl: 1024
+ concurrencies: "1229x2253"
+ req_rate: "inf"
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/MTP/ctx8dep4_gen1dep32_batch32_eplb0_mtp3.yaml b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/MTP/ctx8dep4_gen1dep32_batch32_eplb0_mtp3.yaml
new file mode 100644
index 00000000..9bd03c05
--- /dev/null
+++ b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/MTP/ctx8dep4_gen1dep32_batch32_eplb0_mtp3.yaml
@@ -0,0 +1,136 @@
+name: "kimi_k25_nvfp4_ISL8K_OSL1K_ctx8dep4_gen1dep32_batch32_eplb0_mtp3"
+
+# ctx: 8 prefill workers, TP4/EP4
+# gen: 1 decode worker, TP32/EP32, enable_attention_dp=true, max_batch=32
+# MTP Eagle speculative decoding, max_draft_len=3
+# concurrency: 1229
+
+model:
+ path: "nvidia/Kimi-K2.5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb200"
+
+ prefill_nodes: 8
+ prefill_workers: 8
+ gpus_per_prefill: 4
+
+ decode_workers: 1
+ decode_nodes: 8
+ gpus_per_decode: 32
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 4
+ moe_expert_parallel_size: 4
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ max_batch_size: 2
+ max_num_tokens: 16384
+ max_seq_len: 8232
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: TRTLLM
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.4
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ speculative_config:
+ decoding_type: Eagle
+ max_draft_len: 3
+ speculative_model_dir: "/eagle-model"
+
+ decode:
+ tensor_parallel_size: 32
+ moe_expert_parallel_size: 32
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ enable_lm_head_tp_in_adp: true
+ trust_remote_code: true
+ max_batch_size: 32
+ max_num_tokens: 128
+ max_seq_len: 9256
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ - 8
+ - 16
+ - 24
+ - 32
+ moe_config:
+ backend: TRTLLM
+ use_low_precision_moe_combine: true
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.75
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+ speculative_config:
+ decoding_type: Eagle
+ max_draft_len: 3
+ speculative_model_dir: "/eagle-model"
+
+extra_mount:
+ - "nvidia/Kimi-K2.5-Thinking-Eagle3:/eagle-model"
+
+benchmark:
+ type: "sa-bench"
+ isl: 8192
+ osl: 1024
+ concurrencies: "1229"
+ req_rate: "inf"
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/STP/ctx1dep4_gen4tep4_batch32_allconc_eplb0_mtp0.yaml b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/STP/ctx1dep4_gen4tep4_batch32_allconc_eplb0_mtp0.yaml
new file mode 100644
index 00000000..8c1f0aa8
--- /dev/null
+++ b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/STP/ctx1dep4_gen4tep4_batch32_allconc_eplb0_mtp0.yaml
@@ -0,0 +1,126 @@
+name: "kimi_k25_nvfp4_ISL8K_OSL1K_ctx1dep4_gen4tep4_batch32_allconc_eplb0_mtp0"
+
+# ctx: 1 prefill worker, TP4/EP4
+# gen: 4 decode workers, TP4/EP4, max_batch=32
+# Single concurrency point: 156
+
+model:
+ path: "nvidia/Kimi-K2.5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb200"
+
+ # Prefill: 1 worker x TP4 = 4 GPUs = 1 node
+ prefill_nodes: 1
+ prefill_workers: 1
+ gpus_per_prefill: 4
+
+ # Decode: 4 workers x TP4 = 16 GPUs = 4 nodes
+ decode_workers: 4
+ decode_nodes: 4
+ gpus_per_decode: 4
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 4
+ moe_expert_parallel_size: 4
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ max_batch_size: 2
+ max_num_tokens: 16384
+ max_seq_len: 8232
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: TRTLLM
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.4
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+
+ decode:
+ tensor_parallel_size: 4
+ moe_expert_parallel_size: 4
+ pipeline_parallel_size: 1
+ enable_attention_dp: false
+ enable_lm_head_tp_in_adp: false
+ trust_remote_code: true
+ max_batch_size: 32
+ max_num_tokens: 32
+ max_seq_len: 9256
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ - 8
+ - 16
+ - 24
+ - 32
+ moe_config:
+ backend: TRTLLM
+ use_low_precision_moe_combine: true
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.9
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+
+benchmark:
+ type: "sa-bench"
+ isl: 8192
+ osl: 1024
+ concurrencies: "156"
+ req_rate: "inf"
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/STP/ctx1dep4_gen4tep8_batch1_allconc_eplb0_mtp0.yaml b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/STP/ctx1dep4_gen4tep8_batch1_allconc_eplb0_mtp0.yaml
new file mode 100644
index 00000000..d4c5086b
--- /dev/null
+++ b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/STP/ctx1dep4_gen4tep8_batch1_allconc_eplb0_mtp0.yaml
@@ -0,0 +1,123 @@
+name: "kimi_k25_nvfp4_ISL8K_OSL1K_ctx1dep4_gen4tep8_batch1_allconc_eplb0_mtp0"
+
+# ctx: 1 prefill worker, TP4/EP4
+# gen: 4 decode workers, TP8/EP8, allreduce_strategy=MNNVL, max_batch=1
+# Single concurrency point: 4
+
+model:
+ path: "nvidia/Kimi-K2.5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb200"
+
+ # Prefill: 1 worker x TP4 = 4 GPUs = 1 node
+ prefill_nodes: 1
+ prefill_workers: 1
+ gpus_per_prefill: 4
+
+ # Decode: 4 workers x TP8 = 32 GPUs = 8 nodes
+ decode_workers: 4
+ decode_nodes: 8
+ gpus_per_decode: 8
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 4
+ moe_expert_parallel_size: 4
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ max_batch_size: 2
+ max_num_tokens: 16384
+ max_seq_len: 8232
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: TRTLLM
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.4
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+
+ decode:
+ tensor_parallel_size: 8
+ moe_expert_parallel_size: 8
+ pipeline_parallel_size: 1
+ allreduce_strategy: MNNVL
+ enable_attention_dp: false
+ enable_lm_head_tp_in_adp: false
+ trust_remote_code: true
+ max_batch_size: 1
+ max_num_tokens: 1
+ max_seq_len: 9256
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ moe_config:
+ backend: TRTLLM
+ use_low_precision_moe_combine: true
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.9
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+
+benchmark:
+ type: "sa-bench"
+ isl: 8192
+ osl: 1024
+ concurrencies: "4"
+ req_rate: "inf"
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/STP/ctx1dep4_gen5tep4_batch16_allconc_eplb0_mtp0.yaml b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/STP/ctx1dep4_gen5tep4_batch16_allconc_eplb0_mtp0.yaml
new file mode 100644
index 00000000..8f6ea063
--- /dev/null
+++ b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/STP/ctx1dep4_gen5tep4_batch16_allconc_eplb0_mtp0.yaml
@@ -0,0 +1,126 @@
+name: "kimi_k25_nvfp4_ISL8K_OSL1K_ctx1dep4_gen5tep4_batch16_allconc_eplb0_mtp0"
+
+# ctx: 1 prefill worker, TP4/EP4
+# gen: 5 decode workers, TP4/EP4, max_batch=16
+# Covers all concurrencies: 5, 15, 30, 60, 105
+
+model:
+ path: "nvidia/Kimi-K2.5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb200"
+
+ # Prefill: 1 worker x TP4 = 4 GPUs = 1 node
+ prefill_nodes: 1
+ prefill_workers: 1
+ gpus_per_prefill: 4
+
+ # Decode: 5 workers x TP4 = 20 GPUs = 5 nodes
+ decode_workers: 5
+ decode_nodes: 5
+ gpus_per_decode: 4
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 4
+ moe_expert_parallel_size: 4
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ max_batch_size: 2
+ max_num_tokens: 16384
+ max_seq_len: 8232
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: TRTLLM
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.4
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+
+ decode:
+ tensor_parallel_size: 4
+ moe_expert_parallel_size: 4
+ pipeline_parallel_size: 1
+ enable_attention_dp: false
+ enable_lm_head_tp_in_adp: false
+ trust_remote_code: true
+ # max_batch_size=16 covers all concs: 5, 15, 30, 60, 105
+ # cuda_graph pre-compiles graphs for each batch size up to the max
+ max_batch_size: 16
+ max_num_tokens: 16
+ max_seq_len: 9256
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ - 8
+ - 16
+ moe_config:
+ backend: TRTLLM
+ use_low_precision_moe_combine: true
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.9
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+
+benchmark:
+ type: "sa-bench"
+ isl: 8192
+ osl: 1024
+ concurrencies: "5x15x30x60x105"
+ req_rate: "inf"
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/STP/ctx2dep4_gen1dep16_batch16_eplb0_mtp0.yaml b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/STP/ctx2dep4_gen1dep16_batch16_eplb0_mtp0.yaml
new file mode 100644
index 00000000..4bfaa0e2
--- /dev/null
+++ b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/STP/ctx2dep4_gen1dep16_batch16_eplb0_mtp0.yaml
@@ -0,0 +1,124 @@
+name: "kimi_k25_nvfp4_ISL8K_OSL1K_ctx2dep4_gen1dep16_batch16_eplb0_mtp0"
+
+# ctx: 2 prefill workers, TP4/EP4
+# gen: 1 decode worker, TP16/EP16, enable_attention_dp=true, max_batch=16
+# concurrency: 333
+
+model:
+ path: "nvidia/Kimi-K2.5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb200"
+
+ # Prefill: 2 workers x TP4 = 8 GPUs = 2 nodes
+ prefill_nodes: 2
+ prefill_workers: 2
+ gpus_per_prefill: 4
+
+ # Decode: 1 worker x TP16 = 16 GPUs = 4 nodes
+ decode_workers: 1
+ decode_nodes: 4
+ gpus_per_decode: 16
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 4
+ moe_expert_parallel_size: 4
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ max_batch_size: 2
+ max_num_tokens: 16384
+ max_seq_len: 8232
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: TRTLLM
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.4
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+
+ decode:
+ tensor_parallel_size: 16
+ moe_expert_parallel_size: 16
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ enable_lm_head_tp_in_adp: false
+ trust_remote_code: true
+ max_batch_size: 16
+ max_num_tokens: 16
+ max_seq_len: 9256
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ - 8
+ - 16
+ moe_config:
+ backend: TRTLLM
+ use_low_precision_moe_combine: true
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.8
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+
+benchmark:
+ type: "sa-bench"
+ isl: 8192
+ osl: 1024
+ concurrencies: "333"
+ req_rate: "inf"
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/STP/ctx3dep4_gen1dep16_batch32_eplb0_mtp0.yaml b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/STP/ctx3dep4_gen1dep16_batch32_eplb0_mtp0.yaml
new file mode 100644
index 00000000..d7d51627
--- /dev/null
+++ b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/STP/ctx3dep4_gen1dep16_batch32_eplb0_mtp0.yaml
@@ -0,0 +1,126 @@
+name: "kimi_k25_nvfp4_ISL8K_OSL1K_ctx3dep4_gen1dep16_batch32_eplb0_mtp0"
+
+# ctx: 3 prefill workers, TP4/EP4
+# gen: 1 decode worker, TP16/EP16, enable_attention_dp=true, max_batch=32
+# concurrency: 615
+
+model:
+ path: "nvidia/Kimi-K2.5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb200"
+
+ # Prefill: 3 workers x TP4 = 12 GPUs = 3 nodes
+ prefill_nodes: 3
+ prefill_workers: 3
+ gpus_per_prefill: 4
+
+ # Decode: 1 worker x TP16 = 16 GPUs = 4 nodes
+ decode_workers: 1
+ decode_nodes: 4
+ gpus_per_decode: 16
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 4
+ moe_expert_parallel_size: 4
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ max_batch_size: 2
+ max_num_tokens: 16384
+ max_seq_len: 8232
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: TRTLLM
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.4
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+
+ decode:
+ tensor_parallel_size: 16
+ moe_expert_parallel_size: 16
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ enable_lm_head_tp_in_adp: false
+ trust_remote_code: true
+ max_batch_size: 32
+ max_num_tokens: 32
+ max_seq_len: 9256
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ - 8
+ - 16
+ - 24
+ - 32
+ moe_config:
+ backend: TRTLLM
+ use_low_precision_moe_combine: true
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.8
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+
+benchmark:
+ type: "sa-bench"
+ isl: 8192
+ osl: 1024
+ concurrencies: "615"
+ req_rate: "inf"
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/STP/ctx5dep4_gen1dep8_batch256_allconc_eplb0_mtp0.yaml b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/STP/ctx5dep4_gen1dep8_batch256_allconc_eplb0_mtp0.yaml
new file mode 100644
index 00000000..e8df1179
--- /dev/null
+++ b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/STP/ctx5dep4_gen1dep8_batch256_allconc_eplb0_mtp0.yaml
@@ -0,0 +1,155 @@
+name: "kimi_k25_nvfp4_ISL8K_OSL1K_ctx5dep4_gen1dep8_batch256_allconc_eplb0_mtp0"
+
+# ctx: 5 prefill workers, TP4/EP4
+# gen: 1 decode worker, TP8/EP8, enable_attention_dp=true, max_batch=256
+# Single concurrency point: 2151
+
+model:
+ path: "nvidia/Kimi-K2.5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb200"
+
+ # Prefill: 5 workers x TP4 = 20 GPUs = 5 nodes
+ prefill_nodes: 5
+ prefill_workers: 5
+ gpus_per_prefill: 4
+
+ # Decode: 1 worker x TP8 = 8 GPUs = 2 nodes
+ decode_workers: 1
+ decode_nodes: 2
+ gpus_per_decode: 8
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 4
+ moe_expert_parallel_size: 4
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ max_batch_size: 2
+ max_num_tokens: 16384
+ max_seq_len: 8232
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: TRTLLM
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.4
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+
+ decode:
+ tensor_parallel_size: 8
+ moe_expert_parallel_size: 8
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ enable_lm_head_tp_in_adp: false
+ trust_remote_code: true
+ # max_batch_size=256, cuda_graph pre-compiles graphs for all batch sizes up to 256
+ max_batch_size: 256
+ max_num_tokens: 256
+ max_seq_len: 9256
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ - 8
+ - 16
+ - 24
+ - 32
+ - 40
+ - 48
+ - 56
+ - 64
+ - 72
+ - 80
+ - 88
+ - 96
+ - 104
+ - 112
+ - 120
+ - 128
+ - 136
+ - 144
+ - 152
+ - 160
+ - 168
+ - 176
+ - 184
+ - 192
+ - 200
+ - 208
+ - 216
+ - 224
+ - 232
+ - 240
+ - 248
+ - 256
+ moe_config:
+ backend: TRTLLM
+ use_low_precision_moe_combine: true
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.8
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+
+benchmark:
+ type: "sa-bench"
+ isl: 8192
+ osl: 1024
+ concurrencies: "2151"
+ req_rate: "inf"
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/STP/ctx7dep4_gen1dep16_batch128_eplb0_mtp0.yaml b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/STP/ctx7dep4_gen1dep16_batch128_eplb0_mtp0.yaml
new file mode 100644
index 00000000..db177892
--- /dev/null
+++ b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/STP/ctx7dep4_gen1dep16_batch128_eplb0_mtp0.yaml
@@ -0,0 +1,138 @@
+name: "kimi_k25_nvfp4_ISL8K_OSL1K_ctx7dep4_gen1dep16_batch128_eplb0_mtp0"
+
+# ctx: 7 prefill workers, TP4/EP4
+# gen: 1 decode worker, TP16/EP16, enable_attention_dp=true, max_batch=128
+# concurrency: 2253
+
+model:
+ path: "nvidia/Kimi-K2.5-NVFP4"
+ container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2"
+ precision: "fp4"
+
+resources:
+ gpu_type: "gb200"
+
+ # Prefill: 7 workers x TP4 = 28 GPUs = 7 nodes
+ prefill_nodes: 7
+ prefill_workers: 7
+ gpus_per_prefill: 4
+
+ # Decode: 1 worker x TP16 = 16 GPUs = 4 nodes
+ decode_workers: 1
+ decode_nodes: 4
+ gpus_per_decode: 16
+
+ gpus_per_node: 4
+
+backend:
+ type: trtllm
+
+ prefill_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ decode_environment:
+ ENROOT_ALLOW_DEV: "yes"
+ NCCL_GRAPH_MIXING_SUPPORT: "0"
+ TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1"
+ TLLM_LOG_LEVEL: "INFO"
+ TRTLLM_ENABLE_PDL: "1"
+ TRTLLM_SERVER_DISABLE_GC: "1"
+ TRTLLM_WORKER_DISABLE_GC: "1"
+
+ trtllm_config:
+ prefill:
+ tensor_parallel_size: 4
+ moe_expert_parallel_size: 4
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ disable_overlap_scheduler: true
+ trust_remote_code: true
+ max_batch_size: 2
+ max_num_tokens: 16384
+ max_seq_len: 8232
+ print_iter_log: true
+ cuda_graph_config: null
+ moe_config:
+ backend: TRTLLM
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.4
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+
+ decode:
+ tensor_parallel_size: 16
+ moe_expert_parallel_size: 16
+ pipeline_parallel_size: 1
+ enable_attention_dp: true
+ enable_lm_head_tp_in_adp: false
+ trust_remote_code: true
+ max_batch_size: 128
+ max_num_tokens: 128
+ max_seq_len: 9256
+ print_iter_log: true
+ stream_interval: 100
+ num_postprocess_workers: 4
+ cuda_graph_config:
+ enable_padding: true
+ batch_sizes:
+ - 1
+ - 2
+ - 4
+ - 8
+ - 16
+ - 24
+ - 32
+ - 40
+ - 48
+ - 56
+ - 64
+ - 72
+ - 80
+ - 88
+ - 96
+ - 104
+ - 112
+ - 120
+ - 128
+ moe_config:
+ backend: TRTLLM
+ use_low_precision_moe_combine: true
+ kv_cache_config:
+ dtype: fp8
+ enable_block_reuse: false
+ free_gpu_memory_fraction: 0.8
+ cache_transceiver_config:
+ backend: UCX
+ max_tokens_in_buffer: 16384
+ nvfp4_gemm_config:
+ allowed_backends:
+ - cutlass
+ - cublaslt
+ - cutedsl
+ - cuda_core
+
+benchmark:
+ type: "sa-bench"
+ isl: 8192
+ osl: 1024
+ concurrencies: "2253"
+ req_rate: "inf"
+
+frontend:
+ type: "dynamo"
+ enable_multiple_frontends: false
+
+health_check:
+ max_attempts: 360
+ interval_seconds: 10
+
+dynamo:
+ install: false
diff --git a/recipes/vllm/deepseek-v4-pro/8k1k/disagg-gb200-1p1d-dep8-dep8.yaml b/recipes/vllm/deepseek-v4-pro/8k1k/disagg-gb200-1p1d-dep8-dep8.yaml
new file mode 100644
index 00000000..10d038a5
--- /dev/null
+++ b/recipes/vllm/deepseek-v4-pro/8k1k/disagg-gb200-1p1d-dep8-dep8.yaml
@@ -0,0 +1,88 @@
+name: "svf-vllm-disagg-gb200-1p1d-dep8-dep8"
+model:
+ path: "deepseekv4-fp4"
+ container: "vllm/vllm-openai:deepseekv4-cu130"
+ precision: "fp4"
+dynamo:
+ hash: 6a159fedd8e4a1563aa647c31f622aedbf254b5b
+setup_script: vllm-container-deps.sh
+resources:
+ gpu_type: "gb200"
+ gpus_per_node: 4
+ prefill_nodes: 2
+ decode_nodes: 2
+ prefill_workers: 1
+ decode_workers: 1
+ gpus_per_prefill: 8
+ gpus_per_decode: 8
+frontend:
+ type: dynamo
+ enable_multiple_frontends: false
+backend:
+ type: vllm
+ connector: null
+ prefill_environment:
+ TILELANG_CLEANUP_TEMP_FILES: "1"
+ VLLM_USE_NCCL_SYMM_MEM: "1"
+ NCCL_CUMEM_ENABLE: "1"
+ NCCL_MNNVL_ENABLE: "1"
+ NCCL_NVLS_ENABLE: "1"
+ VLLM_SERVER_DEV_MODE: "1"
+ decode_environment:
+ TILELANG_CLEANUP_TEMP_FILES: "1"
+ VLLM_USE_NCCL_SYMM_MEM: "1"
+ NCCL_CUMEM_ENABLE: "1"
+ NCCL_MNNVL_ENABLE: "1"
+ NCCL_NVLS_ENABLE: "1"
+ VLLM_SERVER_DEV_MODE: "1"
+ vllm_config:
+ prefill:
+ kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}'
+ served-model-name: "deepseek-ai/DeepSeek-V4-Pro"
+ kv-cache-dtype: "fp8"
+ tensor-parallel-size: 1
+ pipeline-parallel-size: 1
+ data-parallel-size: 8
+ data-parallel-rpc-port: 13345
+ enable-expert-parallel: true
+ enforce-eager: true
+ max-model-len: auto
+ max-num-seqs: 4
+ max-num-batched-tokens: 16384
+ trust-remote-code: true
+ no-enable-prefix-caching: true
+ no-enable-flashinfer-autotune: true
+ no-async-scheduling: true
+ block-size: 256
+ gpu-memory-utilization: 0.9
+ no-disable-hybrid-kv-cache-manager: true
+ enable-sleep-mode: true
+ decode:
+ kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}'
+ served-model-name: "deepseek-ai/DeepSeek-V4-Pro"
+ kv-cache-dtype: "fp8"
+ tensor-parallel-size: 1
+ pipeline-parallel-size: 1
+ data-parallel-size: 8
+ data-parallel-rpc-port: 13345
+ enable-expert-parallel: true
+ max-model-len: auto
+ max-num-seqs: 64
+ max-cudagraph-capture-size: 64
+ max-num-batched-tokens: 64
+ trust-remote-code: true
+ no-enable-prefix-caching: true
+ block-size: 256
+ compilation-config: '{"cudagraph_mode":"FULL_DECODE_ONLY","mode":0}'
+ gpu-memory-utilization: 0.9
+ stream-interval: 50
+ no-disable-hybrid-kv-cache-manager: true
+ enable-sleep-mode: true
+benchmark:
+ type: "sa-bench"
+ isl: 8192
+ osl: 1024
+ concurrencies: "4x8x16x32x64x256"
+ req_rate: "inf"
+ custom_tokenizer: "deepseek_v4"
+ use_chat_template: false
diff --git a/recipes/vllm/deepseek-v4-pro/8k1k/disagg-gb200-2p1d-dep8-dep16.yaml b/recipes/vllm/deepseek-v4-pro/8k1k/disagg-gb200-2p1d-dep8-dep16.yaml
new file mode 100644
index 00000000..a46d9bf7
--- /dev/null
+++ b/recipes/vllm/deepseek-v4-pro/8k1k/disagg-gb200-2p1d-dep8-dep16.yaml
@@ -0,0 +1,88 @@
+name: "svf-vllm-disagg-gb200-2p1d-dep8-dep16"
+model:
+ path: "deepseekv4-fp4"
+ container: "vllm/vllm-openai:deepseekv4-cu130"
+ precision: "fp4"
+dynamo:
+ hash: 6a159fedd8e4a1563aa647c31f622aedbf254b5b
+setup_script: vllm-container-deps.sh
+resources:
+ gpu_type: "gb200"
+ gpus_per_node: 4
+ prefill_nodes: 2
+ decode_nodes: 2
+ prefill_workers: 2
+ decode_workers: 1
+ gpus_per_prefill: 8
+ gpus_per_decode: 8
+frontend:
+ type: dynamo
+ enable_multiple_frontends: false
+backend:
+ type: vllm
+ connector: null
+ prefill_environment:
+ TILELANG_CLEANUP_TEMP_FILES: "1"
+ VLLM_USE_NCCL_SYMM_MEM: "1"
+ NCCL_CUMEM_ENABLE: "1"
+ NCCL_MNNVL_ENABLE: "1"
+ NCCL_NVLS_ENABLE: "1"
+ VLLM_SERVER_DEV_MODE: "1"
+ decode_environment:
+ TILELANG_CLEANUP_TEMP_FILES: "1"
+ VLLM_USE_NCCL_SYMM_MEM: "1"
+ NCCL_CUMEM_ENABLE: "1"
+ NCCL_MNNVL_ENABLE: "1"
+ NCCL_NVLS_ENABLE: "1"
+ VLLM_SERVER_DEV_MODE: "1"
+ vllm_config:
+ prefill:
+ kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}'
+ served-model-name: "deepseek-ai/DeepSeek-V4-Pro"
+ kv-cache-dtype: "fp8"
+ tensor-parallel-size: 1
+ pipeline-parallel-size: 1
+ data-parallel-size: 8
+ data-parallel-rpc-port: 13345
+ enable-expert-parallel: true
+ enforce-eager: true
+ max-model-len: auto
+ max-num-seqs: 4
+ max-num-batched-tokens: 16384
+ trust-remote-code: true
+ no-enable-prefix-caching: true
+ no-enable-flashinfer-autotune: true
+ no-async-scheduling: true
+ block-size: 256
+ gpu-memory-utilization: 0.9
+ no-disable-hybrid-kv-cache-manager: true
+ enable-sleep-mode: true
+ decode:
+ kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}'
+ served-model-name: "deepseek-ai/DeepSeek-V4-Pro"
+ kv-cache-dtype: "fp8"
+ tensor-parallel-size: 1
+ pipeline-parallel-size: 1
+ data-parallel-size: 16
+ data-parallel-rpc-port: 13345
+ enable-expert-parallel: true
+ max-model-len: auto
+ max-num-seqs: 64
+ max-cudagraph-capture-size: 64
+ max-num-batched-tokens: 64
+ trust-remote-code: true
+ no-enable-prefix-caching: true
+ block-size: 256
+ compilation-config: '{"cudagraph_mode":"FULL_DECODE_ONLY","mode":0}'
+ gpu-memory-utilization: 0.9
+ stream-interval: 50
+ no-disable-hybrid-kv-cache-manager: true
+ enable-sleep-mode: true
+benchmark:
+ type: "sa-bench"
+ isl: 8192
+ osl: 1024
+ concurrencies: "1024"
+ req_rate: "inf"
+ custom_tokenizer: "deepseek_v4"
+ use_chat_template: false
diff --git a/recipes/vllm/deepseek-v4-pro/8k1k/disagg-gb200-4p1d-dep8-dep16.yaml b/recipes/vllm/deepseek-v4-pro/8k1k/disagg-gb200-4p1d-dep8-dep16.yaml
new file mode 100644
index 00000000..32089c84
--- /dev/null
+++ b/recipes/vllm/deepseek-v4-pro/8k1k/disagg-gb200-4p1d-dep8-dep16.yaml
@@ -0,0 +1,88 @@
+name: "svf-vllm-disagg-gb200-4p1d-dep8-dep16"
+model:
+ path: "deepseekv4-fp4"
+ container: "vllm/vllm-openai:deepseekv4-cu130"
+ precision: "fp4"
+dynamo:
+ hash: 6a159fedd8e4a1563aa647c31f622aedbf254b5b
+setup_script: vllm-container-deps.sh
+resources:
+ gpu_type: "gb200"
+ gpus_per_node: 4
+ prefill_nodes: 2
+ decode_nodes: 2
+ prefill_workers: 4
+ decode_workers: 1
+ gpus_per_prefill: 8
+ gpus_per_decode: 8
+frontend:
+ type: dynamo
+ enable_multiple_frontends: false
+backend:
+ type: vllm
+ connector: null
+ prefill_environment:
+ TILELANG_CLEANUP_TEMP_FILES: "1"
+ VLLM_USE_NCCL_SYMM_MEM: "1"
+ NCCL_CUMEM_ENABLE: "1"
+ NCCL_MNNVL_ENABLE: "1"
+ NCCL_NVLS_ENABLE: "1"
+ VLLM_SERVER_DEV_MODE: "1"
+ decode_environment:
+ TILELANG_CLEANUP_TEMP_FILES: "1"
+ VLLM_USE_NCCL_SYMM_MEM: "1"
+ NCCL_CUMEM_ENABLE: "1"
+ NCCL_MNNVL_ENABLE: "1"
+ NCCL_NVLS_ENABLE: "1"
+ VLLM_SERVER_DEV_MODE: "1"
+ vllm_config:
+ prefill:
+ kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}'
+ served-model-name: "deepseek-ai/DeepSeek-V4-Pro"
+ kv-cache-dtype: "fp8"
+ tensor-parallel-size: 1
+ pipeline-parallel-size: 1
+ data-parallel-size: 8
+ data-parallel-rpc-port: 13345
+ enable-expert-parallel: true
+ enforce-eager: true
+ max-model-len: auto
+ max-num-seqs: 4
+ max-num-batched-tokens: 16384
+ trust-remote-code: true
+ no-enable-prefix-caching: true
+ no-enable-flashinfer-autotune: true
+ no-async-scheduling: true
+ block-size: 256
+ gpu-memory-utilization: 0.9
+ no-disable-hybrid-kv-cache-manager: true
+ enable-sleep-mode: true
+ decode:
+ kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}'
+ served-model-name: "deepseek-ai/DeepSeek-V4-Pro"
+ kv-cache-dtype: "fp8"
+ tensor-parallel-size: 1
+ pipeline-parallel-size: 1
+ data-parallel-size: 16
+ data-parallel-rpc-port: 13345
+ enable-expert-parallel: true
+ max-model-len: auto
+ max-num-seqs: 256
+ max-cudagraph-capture-size: 256
+ max-num-batched-tokens: 256
+ trust-remote-code: true
+ no-enable-prefix-caching: true
+ block-size: 256
+ compilation-config: '{"cudagraph_mode":"FULL_DECODE_ONLY","mode":0}'
+ gpu-memory-utilization: 0.9
+ stream-interval: 50
+ no-disable-hybrid-kv-cache-manager: true
+ enable-sleep-mode: true
+benchmark:
+ type: "sa-bench"
+ isl: 8192
+ osl: 1024
+ concurrencies: "2048"
+ req_rate: "inf"
+ custom_tokenizer: "deepseek_v4"
+ use_chat_template: false
diff --git a/recipes/vllm/deepseek-v4-pro/8k1k/disagg-gb200-7p1d-dep8-dep16.yaml b/recipes/vllm/deepseek-v4-pro/8k1k/disagg-gb200-7p1d-dep8-dep16.yaml
new file mode 100644
index 00000000..1568e492
--- /dev/null
+++ b/recipes/vllm/deepseek-v4-pro/8k1k/disagg-gb200-7p1d-dep8-dep16.yaml
@@ -0,0 +1,87 @@
+name: "svf-vllm-disagg-gb200-7p1d-dep8-dep16"
+model:
+ path: "deepseekv4-fp4"
+ container: "vllm/vllm-openai:deepseekv4-cu130"
+ precision: "fp4"
+dynamo:
+ hash: 6a159fedd8e4a1563aa647c31f622aedbf254b5b
+setup_script: vllm-container-deps.sh
+resources:
+ gpu_type: "gb200"
+ gpus_per_node: 4
+ prefill_nodes: 14
+ decode_nodes: 4
+ prefill_workers: 7
+ decode_workers: 1
+ gpus_per_prefill: 8
+ gpus_per_decode: 16
+frontend:
+ type: dynamo
+ enable_multiple_frontends: false
+backend:
+ type: vllm
+ connector: null
+ prefill_environment:
+ TILELANG_CLEANUP_TEMP_FILES: "1"
+ VLLM_USE_NCCL_SYMM_MEM: "1"
+ NCCL_CUMEM_ENABLE: "1"
+ NCCL_MNNVL_ENABLE: "1"
+ NCCL_NVLS_ENABLE: "1"
+ VLLM_SERVER_DEV_MODE: "1"
+ decode_environment:
+ TILELANG_CLEANUP_TEMP_FILES: "1"
+ VLLM_USE_NCCL_SYMM_MEM: "1"
+ NCCL_CUMEM_ENABLE: "1"
+ NCCL_MNNVL_ENABLE: "1"
+ NCCL_NVLS_ENABLE: "1"
+ VLLM_SERVER_DEV_MODE: "1"
+ vllm_config:
+ prefill:
+ kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}'
+ served-model-name: "deepseek-ai/DeepSeek-V4-Pro"
+ kv-cache-dtype: "fp8"
+ tensor-parallel-size: 1
+ pipeline-parallel-size: 1
+ data-parallel-size: 8
+ data-parallel-rpc-port: 13345
+ enable-expert-parallel: true
+ enforce-eager: true
+ max-model-len: auto
+ max-num-seqs: 2
+ max-num-batched-tokens: 16384
+ trust-remote-code: true
+ no-enable-prefix-caching: true
+ no-enable-flashinfer-autotune: true
+ block-size: 256
+ gpu-memory-utilization: 0.88
+ no-disable-hybrid-kv-cache-manager: true
+ enable-sleep-mode: true
+ decode:
+ kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}'
+ served-model-name: "deepseek-ai/DeepSeek-V4-Pro"
+ kv-cache-dtype: "fp8"
+ tensor-parallel-size: 1
+ pipeline-parallel-size: 1
+ data-parallel-size: 16
+ data-parallel-rpc-port: 13345
+ enable-expert-parallel: true
+ max-model-len: auto
+ max-num-seqs: 256
+ max-cudagraph-capture-size: 256
+ max-num-batched-tokens: 256
+ trust-remote-code: true
+ no-enable-prefix-caching: true
+ block-size: 256
+ compilation-config: '{"cudagraph_mode":"FULL_DECODE_ONLY","mode":0}'
+ gpu-memory-utilization: 0.9
+ stream-interval: 50
+ no-disable-hybrid-kv-cache-manager: true
+ enable-sleep-mode: true
+benchmark:
+ type: "sa-bench"
+ isl: 8192
+ osl: 1024
+ concurrencies: "4096"
+ req_rate: "inf"
+ custom_tokenizer: "deepseek_v4"
+ use_chat_template: false
diff --git a/recipes/vllm/kimi-k2.5/1k1k/disagg-gb200-1p1d-dep4-dep16.yaml b/recipes/vllm/kimi-k2.5/1k1k/disagg-gb200-1p1d-dep4-dep16.yaml
new file mode 100644
index 00000000..ecdc9233
--- /dev/null
+++ b/recipes/vllm/kimi-k2.5/1k1k/disagg-gb200-1p1d-dep4-dep16.yaml
@@ -0,0 +1,101 @@
+name: "kimi-vllm-disagg-gb200-1p1d-dep4-dep16"
+
+model:
+ path: "kimi-k2.5-nvfp4"
+ container: "vllm/vllm-openai:v0.18.0-cu130"
+ precision: "fp4"
+
+dynamo:
+ version: 1.0.1
+ install: true
+
+setup_script: vllm-container-deps.sh
+
+resources:
+ gpu_type: "gb200"
+ gpus_per_node: 4
+ prefill_nodes: 1
+ decode_nodes: 4
+ prefill_workers: 1
+ decode_workers: 1
+ gpus_per_prefill: 4
+ gpus_per_decode: 16
+
+frontend:
+ type: dynamo
+ enable_multiple_frontends: false
+
+backend:
+ type: vllm
+ connector: null
+
+ prefill_environment:
+ VLLM_USE_FLASHINFER_MOE_FP4: "1"
+ VLLM_USE_NCCL_SYMM_MEM: "1"
+ NCCL_CUMEM_ENABLE: "1"
+ NCCL_MNNVL_ENABLE: "1"
+ NCCL_NVLS_ENABLE: "1"
+
+ decode_environment:
+ VLLM_USE_FLASHINFER_MOE_FP4: "1"
+ VLLM_USE_NCCL_SYMM_MEM: "1"
+ NCCL_CUMEM_ENABLE: "1"
+ NCCL_MNNVL_ENABLE: "1"
+ NCCL_NVLS_ENABLE: "1"
+
+ vllm_config:
+ prefill:
+ kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}'
+ served-model-name: "nvidia/Kimi-K2.5-NVFP4"
+ kv-cache-dtype: "fp8"
+ tensor-parallel-size: 1
+ pipeline-parallel-size: 1
+ data-parallel-size: 4
+ data-parallel-rpc-port: 13345
+ enable-expert-parallel: true
+ max-model-len: 3072
+ max-num-seqs: 4096
+ enforce-eager: true
+ compilation-config: '{"custom_ops":["+quant_fp8","+rms_norm","+rotary_embedding"],"pass_config":{"fuse_attn_quant":true,"fuse_allreduce_rms":true}}'
+ max-num-batched-tokens: 16384
+ safetensors-load-strategy: "prefetch"
+ trust-remote-code: true
+ no-enable-prefix-caching: true
+ no-enable-chunked-prefill: true
+ attention-backend: "FLASHINFER_MLA"
+ block-size: 64
+ attention-config: '{"use_trtllm_ragged_deepseek_prefill": true}'
+ all2all-backend: "flashinfer_nvlink_one_sided"
+ gpu-memory-utilization: 0.9
+
+ decode:
+ kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}'
+ served-model-name: "nvidia/Kimi-K2.5-NVFP4"
+ kv-cache-dtype: "fp8"
+ tensor-parallel-size: 1
+ pipeline-parallel-size: 1
+ data-parallel-size: 16
+ data-parallel-rpc-port: 13345
+ enable-expert-parallel: true
+ max-model-len: 3072
+ max-num-seqs: 4096
+ max-num-batched-tokens: 10240
+ safetensors-load-strategy: "prefetch"
+ trust-remote-code: true
+ no-enable-prefix-caching: true
+ no-enable-chunked-prefill: true
+ async-scheduling: true
+ attention-backend: "FLASHINFER_MLA"
+ block-size: 64
+ all2all-backend: "flashinfer_nvlink_one_sided"
+ compilation-config: '{"cudagraph_mode":"FULL_DECODE_ONLY","custom_ops":["+quant_fp8","+rms_norm","+rotary_embedding"],"pass_config":{"fuse_attn_quant":true,"fuse_allreduce_rms":true}}'
+ gpu-memory-utilization: 0.9
+ stream-interval: 50
+ max-cudagraph-capture-size: 512
+
+benchmark:
+ type: "sa-bench"
+ isl: 1024
+ osl: 1024
+ concurrencies: "256x512x1024x2048x3072x4096"
+ req_rate: "inf"
diff --git a/recipes/vllm/kimi-k2.5/1k1k/disagg-gb200-1p4d-dep4-tep4.yaml b/recipes/vllm/kimi-k2.5/1k1k/disagg-gb200-1p4d-dep4-tep4.yaml
new file mode 100644
index 00000000..43167b5f
--- /dev/null
+++ b/recipes/vllm/kimi-k2.5/1k1k/disagg-gb200-1p4d-dep4-tep4.yaml
@@ -0,0 +1,98 @@
+name: "kimi-vllm-disagg-gb200-1p4d-dep4-tep4"
+
+model:
+ path: "kimi-k2.5-nvfp4"
+ container: "vllm/vllm-openai:v0.18.0-cu130"
+ precision: "fp4"
+
+dynamo:
+ version: 1.0.1
+ install: true
+
+setup_script: vllm-container-deps.sh
+
+resources:
+ gpu_type: "gb200"
+ gpus_per_node: 4
+ prefill_nodes: 1
+ decode_nodes: 4
+ prefill_workers: 1
+ decode_workers: 4
+ gpus_per_prefill: 4
+ gpus_per_decode: 4
+
+frontend:
+ type: dynamo
+ enable_multiple_frontends: false
+
+backend:
+ type: vllm
+ connector: null
+
+ prefill_environment:
+ VLLM_USE_FLASHINFER_MOE_FP4: "1"
+ VLLM_USE_NCCL_SYMM_MEM: "1"
+ NCCL_CUMEM_ENABLE: "1"
+ NCCL_MNNVL_ENABLE: "1"
+ NCCL_NVLS_ENABLE: "1"
+
+ decode_environment:
+ VLLM_USE_FLASHINFER_MOE_FP4: "1"
+ VLLM_USE_NCCL_SYMM_MEM: "1"
+ NCCL_CUMEM_ENABLE: "1"
+ NCCL_MNNVL_ENABLE: "1"
+ NCCL_NVLS_ENABLE: "1"
+
+ vllm_config:
+ prefill:
+ kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}'
+ served-model-name: "nvidia/Kimi-K2.5-NVFP4"
+ kv-cache-dtype: "fp8"
+ tensor-parallel-size: 1
+ pipeline-parallel-size: 1
+ data-parallel-size: 4
+ data-parallel-rpc-port: 13345
+ enable-expert-parallel: true
+ max-model-len: 3072
+ max-num-seqs: 1024
+ enforce-eager: true
+ compilation-config: '{"custom_ops":["+quant_fp8","+rms_norm","+rotary_embedding"],"pass_config":{"fuse_attn_quant":true,"fuse_allreduce_rms":true}}'
+ max-num-batched-tokens: 16384
+ safetensors-load-strategy: "prefetch"
+ trust-remote-code: true
+ no-enable-prefix-caching: true
+ no-enable-chunked-prefill: true
+ attention-backend: "FLASHINFER_MLA"
+ block-size: 64
+ attention-config: '{"use_trtllm_ragged_deepseek_prefill": true}'
+ all2all-backend: "flashinfer_nvlink_one_sided"
+ gpu-memory-utilization: 0.9
+
+ decode:
+ kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}'
+ served-model-name: "nvidia/Kimi-K2.5-NVFP4"
+ kv-cache-dtype: "fp8"
+ tensor-parallel-size: 4
+ pipeline-parallel-size: 1
+ enable-expert-parallel: true
+ max-model-len: 3072
+ max-num-seqs: 1024
+ max-num-batched-tokens: 10240
+ safetensors-load-strategy: "prefetch"
+ trust-remote-code: true
+ no-enable-prefix-caching: true
+ no-enable-chunked-prefill: true
+ async-scheduling: true
+ attention-backend: "FLASHINFER_MLA"
+ block-size: 64
+ compilation-config: '{"cudagraph_mode":"FULL_DECODE_ONLY","custom_ops":["+quant_fp8","+rms_norm","+rotary_embedding"],"pass_config":{"fuse_attn_quant":true,"fuse_allreduce_rms":true}}'
+ gpu-memory-utilization: 0.9
+ stream-interval: 50
+ max-cudagraph-capture-size: 1024
+
+benchmark:
+ type: "sa-bench"
+ isl: 1024
+ osl: 1024
+ concurrencies: "4x8x16x32x64x128"
+ req_rate: "inf"
diff --git a/recipes/vllm/kimi-k2.5/8k1k/disagg-gb200-1p4d-dep4-tep4.yaml b/recipes/vllm/kimi-k2.5/8k1k/disagg-gb200-1p4d-dep4-tep4.yaml
new file mode 100644
index 00000000..1ab6ca27
--- /dev/null
+++ b/recipes/vllm/kimi-k2.5/8k1k/disagg-gb200-1p4d-dep4-tep4.yaml
@@ -0,0 +1,98 @@
+name: "kimi-vllm-disagg-gb200-1p4d-dep4-tep4"
+
+model:
+ path: "kimi-k2.5-nvfp4"
+ container: "vllm/vllm-openai:v0.18.0-cu130"
+ precision: "fp4"
+
+dynamo:
+ version: 1.0.1
+ install: true
+
+setup_script: vllm-container-deps.sh
+
+resources:
+ gpu_type: "gb200"
+ gpus_per_node: 4
+ prefill_nodes: 1
+ decode_nodes: 4
+ prefill_workers: 1
+ decode_workers: 4
+ gpus_per_prefill: 4
+ gpus_per_decode: 4
+
+frontend:
+ type: dynamo
+ enable_multiple_frontends: false
+
+backend:
+ type: vllm
+ connector: null
+
+ prefill_environment:
+ VLLM_USE_FLASHINFER_MOE_FP4: "1"
+ VLLM_USE_NCCL_SYMM_MEM: "1"
+ NCCL_CUMEM_ENABLE: "1"
+ NCCL_MNNVL_ENABLE: "1"
+ NCCL_NVLS_ENABLE: "1"
+
+ decode_environment:
+ VLLM_USE_FLASHINFER_MOE_FP4: "1"
+ VLLM_USE_NCCL_SYMM_MEM: "1"
+ NCCL_CUMEM_ENABLE: "1"
+ NCCL_MNNVL_ENABLE: "1"
+ NCCL_NVLS_ENABLE: "1"
+
+ vllm_config:
+ prefill:
+ kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}'
+ served-model-name: "nvidia/Kimi-K2.5-NVFP4"
+ kv-cache-dtype: "fp8"
+ tensor-parallel-size: 1
+ pipeline-parallel-size: 1
+ data-parallel-size: 4
+ data-parallel-rpc-port: 13345
+ enable-expert-parallel: true
+ max-model-len: 10240
+ max-num-seqs: 64
+ enforce-eager: true
+ compilation-config: '{"custom_ops":["+quant_fp8","+rms_norm","+rotary_embedding"],"pass_config":{"fuse_attn_quant":true,"fuse_allreduce_rms":true}}'
+ max-num-batched-tokens: 16384
+ safetensors-load-strategy: "prefetch"
+ trust-remote-code: true
+ no-enable-prefix-caching: true
+ no-enable-chunked-prefill: true
+ attention-backend: "FLASHINFER_MLA"
+ block-size: 64
+ attention-config: '{"use_trtllm_ragged_deepseek_prefill": true}'
+ all2all-backend: "flashinfer_nvlink_one_sided"
+ gpu-memory-utilization: 0.9
+
+ decode:
+ kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}'
+ served-model-name: "nvidia/Kimi-K2.5-NVFP4"
+ kv-cache-dtype: "fp8"
+ tensor-parallel-size: 4
+ pipeline-parallel-size: 1
+ enable-expert-parallel: true
+ max-model-len: 10240
+ max-num-seqs: 16
+ max-num-batched-tokens: 10240
+ safetensors-load-strategy: "prefetch"
+ trust-remote-code: true
+ no-enable-prefix-caching: true
+ no-enable-chunked-prefill: true
+ async-scheduling: true
+ attention-backend: "FLASHINFER_MLA"
+ block-size: 64
+ compilation-config: '{"cudagraph_mode":"FULL_DECODE_ONLY","custom_ops":["+quant_fp8","+rms_norm","+rotary_embedding"],"pass_config":{"fuse_attn_quant":true,"fuse_allreduce_rms":true}}'
+ gpu-memory-utilization: 0.9
+ stream-interval: 50
+ max-cudagraph-capture-size: 16
+
+benchmark:
+ type: "sa-bench"
+ isl: 8192
+ osl: 1024
+ concurrencies: "4x8x16x32x128"
+ req_rate: "inf"
diff --git a/recipes/vllm/kimi-k2.5/8k1k/disagg-gb200-3p1d-dep4-dep16.yaml b/recipes/vllm/kimi-k2.5/8k1k/disagg-gb200-3p1d-dep4-dep16.yaml
new file mode 100644
index 00000000..ca4e9813
--- /dev/null
+++ b/recipes/vllm/kimi-k2.5/8k1k/disagg-gb200-3p1d-dep4-dep16.yaml
@@ -0,0 +1,101 @@
+name: "kimi-vllm-disagg-gb200-3p1d-dep4-dep16"
+
+model:
+ path: "kimi-k2.5-nvfp4"
+ container: "vllm/vllm-openai:v0.18.0-cu130"
+ precision: "fp4"
+
+dynamo:
+ version: 1.0.1
+ install: true
+
+setup_script: vllm-container-deps.sh
+
+resources:
+ gpu_type: "gb200"
+ gpus_per_node: 4
+ prefill_nodes: 3
+ decode_nodes: 4
+ prefill_workers: 3
+ decode_workers: 1
+ gpus_per_prefill: 4
+ gpus_per_decode: 16
+
+frontend:
+ type: dynamo
+ enable_multiple_frontends: false
+
+backend:
+ type: vllm
+ connector: null
+
+ prefill_environment:
+ VLLM_USE_FLASHINFER_MOE_FP4: "1"
+ VLLM_USE_NCCL_SYMM_MEM: "1"
+ NCCL_CUMEM_ENABLE: "1"
+ NCCL_MNNVL_ENABLE: "1"
+ NCCL_NVLS_ENABLE: "1"
+
+ decode_environment:
+ VLLM_USE_FLASHINFER_MOE_FP4: "1"
+ VLLM_USE_NCCL_SYMM_MEM: "1"
+ NCCL_CUMEM_ENABLE: "1"
+ NCCL_MNNVL_ENABLE: "1"
+ NCCL_NVLS_ENABLE: "1"
+
+ vllm_config:
+ prefill:
+ kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}'
+ served-model-name: "nvidia/Kimi-K2.5-NVFP4"
+ kv-cache-dtype: "fp8"
+ tensor-parallel-size: 1
+ pipeline-parallel-size: 1
+ data-parallel-size: 4
+ data-parallel-rpc-port: 13345
+ enable-expert-parallel: true
+ max-model-len: 10240
+ max-num-seqs: 64
+ enforce-eager: true
+ compilation-config: '{"custom_ops":["+quant_fp8","+rms_norm","+rotary_embedding"],"pass_config":{"fuse_attn_quant":true,"fuse_allreduce_rms":true}}'
+ max-num-batched-tokens: 16384
+ safetensors-load-strategy: "prefetch"
+ trust-remote-code: true
+ no-enable-prefix-caching: true
+ no-enable-chunked-prefill: true
+ attention-backend: "FLASHINFER_MLA"
+ block-size: 64
+ attention-config: '{"use_trtllm_ragged_deepseek_prefill": true}'
+ all2all-backend: "flashinfer_nvlink_one_sided"
+ gpu-memory-utilization: 0.9
+
+ decode:
+ kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}'
+ served-model-name: "nvidia/Kimi-K2.5-NVFP4"
+ kv-cache-dtype: "fp8"
+ tensor-parallel-size: 1
+ pipeline-parallel-size: 1
+ data-parallel-size: 16
+ data-parallel-rpc-port: 13345
+ enable-expert-parallel: true
+ max-model-len: 10240
+ max-num-seqs: 256
+ max-num-batched-tokens: 10240
+ safetensors-load-strategy: "prefetch"
+ trust-remote-code: true
+ no-enable-prefix-caching: true
+ no-enable-chunked-prefill: true
+ async-scheduling: true
+ attention-backend: "FLASHINFER_MLA"
+ block-size: 64
+ all2all-backend: "flashinfer_nvlink_one_sided"
+ compilation-config: '{"cudagraph_mode":"FULL_DECODE_ONLY","custom_ops":["+quant_fp8","+rms_norm","+rotary_embedding"],"pass_config":{"fuse_attn_quant":true,"fuse_allreduce_rms":true}}'
+ gpu-memory-utilization: 0.9
+ stream-interval: 50
+ max-cudagraph-capture-size: 256
+
+benchmark:
+ type: "sa-bench"
+ isl: 8192
+ osl: 1024
+ concurrencies: "512x1024"
+ req_rate: "inf"
diff --git a/recipes/vllm/kimi-k2.5/8k1k/disagg-gb200-5p1d-dep4-dep8.yaml b/recipes/vllm/kimi-k2.5/8k1k/disagg-gb200-5p1d-dep4-dep8.yaml
new file mode 100644
index 00000000..cd9f94a9
--- /dev/null
+++ b/recipes/vllm/kimi-k2.5/8k1k/disagg-gb200-5p1d-dep4-dep8.yaml
@@ -0,0 +1,101 @@
+name: "kimi-vllm-disagg-gb200-5p1d-dep4-dep8"
+
+model:
+ path: "kimi-k2.5-nvfp4"
+ container: "vllm/vllm-openai:v0.18.0-cu130"
+ precision: "fp4"
+
+dynamo:
+ version: 1.0.1
+ install: true
+
+setup_script: vllm-container-deps.sh
+
+resources:
+ gpu_type: "gb200"
+ gpus_per_node: 4
+ prefill_nodes: 5
+ decode_nodes: 2
+ prefill_workers: 5
+ decode_workers: 1
+ gpus_per_prefill: 4
+ gpus_per_decode: 8
+
+frontend:
+ type: dynamo
+ enable_multiple_frontends: false
+
+backend:
+ type: vllm
+ connector: null
+
+ prefill_environment:
+ VLLM_USE_FLASHINFER_MOE_FP4: "1"
+ VLLM_USE_NCCL_SYMM_MEM: "1"
+ NCCL_CUMEM_ENABLE: "1"
+ NCCL_MNNVL_ENABLE: "1"
+ NCCL_NVLS_ENABLE: "1"
+
+ decode_environment:
+ VLLM_USE_FLASHINFER_MOE_FP4: "1"
+ VLLM_USE_NCCL_SYMM_MEM: "1"
+ NCCL_CUMEM_ENABLE: "1"
+ NCCL_MNNVL_ENABLE: "1"
+ NCCL_NVLS_ENABLE: "1"
+
+ vllm_config:
+ prefill:
+ kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}'
+ served-model-name: "nvidia/Kimi-K2.5-NVFP4"
+ kv-cache-dtype: "fp8"
+ tensor-parallel-size: 1
+ pipeline-parallel-size: 1
+ data-parallel-size: 4
+ data-parallel-rpc-port: 13345
+ enable-expert-parallel: true
+ max-model-len: 10240
+ max-num-seqs: 64
+ enforce-eager: true
+ compilation-config: '{"custom_ops":["+quant_fp8","+rms_norm","+rotary_embedding"],"pass_config":{"fuse_attn_quant":true,"fuse_allreduce_rms":true}}'
+ max-num-batched-tokens: 16384
+ safetensors-load-strategy: "prefetch"
+ trust-remote-code: true
+ no-enable-prefix-caching: true
+ no-enable-chunked-prefill: true
+ attention-backend: "FLASHINFER_MLA"
+ block-size: 64
+ attention-config: '{"use_trtllm_ragged_deepseek_prefill": true}'
+ all2all-backend: "flashinfer_nvlink_one_sided"
+ gpu-memory-utilization: 0.9
+
+ decode:
+ kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}'
+ served-model-name: "nvidia/Kimi-K2.5-NVFP4"
+ kv-cache-dtype: "fp8"
+ tensor-parallel-size: 1
+ pipeline-parallel-size: 1
+ data-parallel-size: 8
+ data-parallel-rpc-port: 13345
+ enable-expert-parallel: true
+ max-model-len: 10240
+ max-num-seqs: 512
+ max-num-batched-tokens: 10240
+ safetensors-load-strategy: "prefetch"
+ trust-remote-code: true
+ no-enable-prefix-caching: true
+ no-enable-chunked-prefill: true
+ async-scheduling: true
+ attention-backend: "FLASHINFER_MLA"
+ block-size: 64
+ all2all-backend: "flashinfer_nvlink_one_sided"
+ compilation-config: '{"cudagraph_mode":"FULL_DECODE_ONLY","custom_ops":["+quant_fp8","+rms_norm","+rotary_embedding"],"pass_config":{"fuse_attn_quant":true,"fuse_allreduce_rms":true}}'
+ gpu-memory-utilization: 0.9
+ stream-interval: 50
+ max-cudagraph-capture-size: 512
+
+benchmark:
+ type: "sa-bench"
+ isl: 8192
+ osl: 1024
+ concurrencies: "2048"
+ req_rate: "inf"
diff --git a/recipes/vllm/kimi-k2.5/8k1k/disagg-gb200-6p1d-dep4-dep16.yaml b/recipes/vllm/kimi-k2.5/8k1k/disagg-gb200-6p1d-dep4-dep16.yaml
new file mode 100644
index 00000000..47d3d7ee
--- /dev/null
+++ b/recipes/vllm/kimi-k2.5/8k1k/disagg-gb200-6p1d-dep4-dep16.yaml
@@ -0,0 +1,101 @@
+name: "kimi-vllm-disagg-gb200-6p1d-dep4-dep16"
+
+model:
+ path: "kimi-k2.5-nvfp4"
+ container: "vllm/vllm-openai:v0.18.0-cu130"
+ precision: "fp4"
+
+dynamo:
+ version: 1.0.1
+ install: true
+
+setup_script: vllm-container-deps.sh
+
+resources:
+ gpu_type: "gb200"
+ gpus_per_node: 4
+ prefill_nodes: 6
+ decode_nodes: 4
+ prefill_workers: 6
+ decode_workers: 1
+ gpus_per_prefill: 4
+ gpus_per_decode: 16
+
+frontend:
+ type: dynamo
+ enable_multiple_frontends: false
+
+backend:
+ type: vllm
+ connector: null
+
+ prefill_environment:
+ VLLM_USE_FLASHINFER_MOE_FP4: "1"
+ VLLM_USE_NCCL_SYMM_MEM: "1"
+ NCCL_CUMEM_ENABLE: "1"
+ NCCL_MNNVL_ENABLE: "1"
+ NCCL_NVLS_ENABLE: "1"
+
+ decode_environment:
+ VLLM_USE_FLASHINFER_MOE_FP4: "1"
+ VLLM_USE_NCCL_SYMM_MEM: "1"
+ NCCL_CUMEM_ENABLE: "1"
+ NCCL_MNNVL_ENABLE: "1"
+ NCCL_NVLS_ENABLE: "1"
+
+ vllm_config:
+ prefill:
+ kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}'
+ served-model-name: "nvidia/Kimi-K2.5-NVFP4"
+ kv-cache-dtype: "fp8"
+ tensor-parallel-size: 1
+ pipeline-parallel-size: 1
+ data-parallel-size: 4
+ data-parallel-rpc-port: 13345
+ enable-expert-parallel: true
+ max-model-len: 10240
+ max-num-seqs: 64
+ enforce-eager: true
+ compilation-config: '{"custom_ops":["+quant_fp8","+rms_norm","+rotary_embedding"],"pass_config":{"fuse_attn_quant":true,"fuse_allreduce_rms":true}}'
+ max-num-batched-tokens: 16384
+ safetensors-load-strategy: "prefetch"
+ trust-remote-code: true
+ no-enable-prefix-caching: true
+ no-enable-chunked-prefill: true
+ attention-backend: "FLASHINFER_MLA"
+ block-size: 64
+ attention-config: '{"use_trtllm_ragged_deepseek_prefill": true}'
+ all2all-backend: "flashinfer_nvlink_one_sided"
+ gpu-memory-utilization: 0.9
+
+ decode:
+ kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}'
+ served-model-name: "nvidia/Kimi-K2.5-NVFP4"
+ kv-cache-dtype: "fp8"
+ tensor-parallel-size: 1
+ pipeline-parallel-size: 1
+ data-parallel-size: 16
+ data-parallel-rpc-port: 13345
+ enable-expert-parallel: true
+ max-model-len: 10240
+ max-num-seqs: 512
+ max-num-batched-tokens: 10240
+ safetensors-load-strategy: "prefetch"
+ trust-remote-code: true
+ no-enable-prefix-caching: true
+ no-enable-chunked-prefill: true
+ async-scheduling: true
+ attention-backend: "FLASHINFER_MLA"
+ block-size: 64
+ all2all-backend: "flashinfer_nvlink_one_sided"
+ compilation-config: '{"cudagraph_mode":"FULL_DECODE_ONLY","custom_ops":["+quant_fp8","+rms_norm","+rotary_embedding"],"pass_config":{"fuse_attn_quant":true,"fuse_allreduce_rms":true}}'
+ gpu-memory-utilization: 0.9
+ stream-interval: 50
+ max-cudagraph-capture-size: 512
+
+benchmark:
+ type: "sa-bench"
+ isl: 8192
+ osl: 1024
+ concurrencies: "3072x4096"
+ req_rate: "inf"
diff --git a/recipes/vllm/minimax-m2.5/b200-fp4/1k1k.yaml b/recipes/vllm/minimax-m2.5/b200-fp4/1k1k.yaml
new file mode 100644
index 00000000..daef7b0d
--- /dev/null
+++ b/recipes/vllm/minimax-m2.5/b200-fp4/1k1k.yaml
@@ -0,0 +1,103 @@
+# MiniMax-M2.5 NVFP4 B200 — 1K/1K ISL/OSL
+# Aggregated vLLM, single-node
+# requires github.com/NVIDIA/srt-slurm, branch sa-submission-q2-2026
+# usage examples:
+# srtctl apply -f 1k1k.yaml # run all variants
+# srtctl apply -f 1k1k.yaml:zip_override_lowlat # full lowlat sweep
+# srtctl apply -f 1k1k.yaml:zip_override_lowlat[2] # lowlat, tep2 variant only
+# srtctl apply -f 1k1k.yaml:zip_override_hightput # full high tput sweep
+# srtctl dry-run -f 1k1k.yaml # preview the variants
+
+base:
+ name: "minimax-m2.5-nvfp4-b200-1k1k"
+
+ model:
+ path: "minimax_m2.5_fp4"
+ container: "vllm/vllm-openai:v0.19.0-cu130"
+ precision: "fp4"
+
+ resources:
+ gpu_type: "b200"
+ gpus_per_node: 8
+ agg_nodes: 1
+ agg_workers: 1
+ gpus_per_agg: 1
+
+ frontend:
+ type: dynamo
+ enable_multiple_frontends: false
+
+ dynamo:
+ install: true
+ top_of_tree: true # currently need ToT for vllm 0.19.0
+
+ setup_script: vllm-container-deps.sh
+
+ backend:
+ type: vllm
+
+ aggregated_environment:
+ DYN_HEALTH_CHECK_ENABLED: "false"
+ PYTHONUNBUFFERED: "1"
+
+ vllm_config:
+ aggregated:
+ tensor-parallel-size: 1
+ gpu-memory-utilization: 0.90
+ max-model-len: 2248
+ max-num-batched-tokens: 2048
+ kv-cache-dtype: fp8
+ max-cudagraph-capture-size: 2048
+ stream-interval: 20
+ no-enable-prefix-caching: true
+ trust-remote-code: true
+
+ benchmark:
+ type: "sa-bench"
+ isl: 1024
+ osl: 1024
+ req_rate: "inf"
+
+
+zip_override_lowlat:
+ name:
+ - "minimax-m2.5-nvfp4-b200-1k1k-lowlat-tp1"
+ - "minimax-m2.5-nvfp4-b200-1k1k-lowlat-tp2"
+ - "minimax-m2.5-nvfp4-b200-1k1k-lowlat-tep2"
+ resources:
+ gpus_per_agg: [1, 2, 2]
+ backend:
+ vllm_config:
+ aggregated:
+ tensor-parallel-size: [1, 2, 2]
+ enable-expert-parallel: [false, false, true]
+ benchmark:
+ concurrencies: ["4","4x8x16x32x64x128x256x512","128x256"]
+
+override_maxtput:
+ name: "minimax-m2.5-nvfp4-b200-1k1k-maxtput-dep2"
+ resources:
+ gpus_per_agg: 2
+ backend:
+ vllm_config:
+ aggregated:
+ tensor-parallel-size: 1
+ enable-expert-parallel: true
+ data-parallel-size: 2
+ benchmark:
+ concurrencies: "512"
+
+zip_override_hightput:
+ name:
+ - "minimax-m2.5-nvfp4-b200-1k1k-hightput-tp4"
+ - "minimax-m2.5-nvfp4-b200-1k1k-hightput-tep4"
+ - "minimax-m2.5-nvfp4-b200-1k1k-hightput-tp8"
+ resources:
+ gpus_per_agg: [4, 4, 8]
+ backend:
+ vllm_config:
+ aggregated:
+ tensor-parallel-size: [4, 4, 8]
+ enable-expert-parallel: [false, true, false]
+ benchmark:
+ concurrencies: ["4x8x16x32x64x128x256x512", "32x64x128", "4"]
diff --git a/recipes/vllm/minimax-m2.5/b200-fp4/8k1k.yaml b/recipes/vllm/minimax-m2.5/b200-fp4/8k1k.yaml
new file mode 100644
index 00000000..7d817e73
--- /dev/null
+++ b/recipes/vllm/minimax-m2.5/b200-fp4/8k1k.yaml
@@ -0,0 +1,88 @@
+# MiniMax-M2.5 NVFP4 B200 — 8K/1K ISL/OSL
+# Aggregated vLLM, single-node
+# requires github.com/NVIDIA/srt-slurm, branch sa-submission-q2-2026
+# usage examples:
+# srtctl apply -f 8k1k.yaml # run all variants
+# srtctl apply -f 8k1k.yaml:zip_override_lowlat # full lowlat sweep
+# srtctl apply -f 8k1k.yaml:zip_override_lowlat[2] # lowlat, tep2 variant only
+# srtctl apply -f 8k1k.yaml:zip_override_maxtput # full max tput sweep
+# srtctl dry-run -f 8k1k.yaml # preview the variants
+
+base:
+ name: "minimax-m2.5-nvfp4-b200-8k1k"
+
+ model:
+ path: "minimax_m2.5_fp4"
+ container: "vllm/vllm-openai:v0.19.0-cu130"
+ precision: "fp4"
+
+ resources:
+ gpu_type: "b200"
+ gpus_per_node: 8
+ agg_nodes: 1
+ agg_workers: 1
+ gpus_per_agg: 1
+
+ frontend:
+ type: dynamo
+ enable_multiple_frontends: false
+
+ dynamo:
+ install: true
+ top_of_tree: true # currently need ToT for vllm 0.19.0
+
+ setup_script: vllm-container-deps.sh
+
+ backend:
+ type: vllm
+
+ aggregated_environment:
+ DYN_HEALTH_CHECK_ENABLED: "false"
+ PYTHONUNBUFFERED: "1"
+
+ vllm_config:
+ aggregated:
+ tensor-parallel-size: 1
+ gpu-memory-utilization: 0.90
+ max-model-len: 9416
+ max-num-batched-tokens: 16384
+ kv-cache-dtype: fp8
+ max-cudagraph-capture-size: 2048
+ stream-interval: 20
+ no-enable-prefix-caching: true
+ trust-remote-code: true
+
+ benchmark:
+ type: "sa-bench"
+ isl: 8192
+ osl: 1024
+ req_rate: "inf"
+
+zip_override_lowlat:
+ name:
+ - "minimax-m2.5-nvfp4-b200-8k1k-lowlat-tp1"
+ - "minimax-m2.5-nvfp4-b200-8k1k-lowlat-tp2"
+ - "minimax-m2.5-nvfp4-b200-8k1k-lowlat-tep2"
+ resources:
+ gpus_per_agg: [1, 2, 2]
+ backend:
+ vllm_config:
+ aggregated:
+ tensor-parallel-size: [1, 2, 2]
+ enable-expert-parallel: [false, false, true]
+ benchmark:
+ concurrencies: ["4x8x16x32x256x512", "4x8x16x32x64x128x256x512", "128x256x512"]
+
+zip_override_maxtput:
+ name:
+ - "minimax-m2.5-nvfp4-b200-8k1k-maxtput-tp4"
+ - "minimax-m2.5-nvfp4-b200-8k1k-maxtput-tp8"
+ resources:
+ gpus_per_agg: [4, 8]
+ backend:
+ vllm_config:
+ aggregated:
+ tensor-parallel-size: [4, 8]
+ enable-expert-parallel: false
+ benchmark:
+ concurrencies: ["4x8x16x32x64x128x256x512", "4"]
diff --git a/src/srtctl/backends/vllm.py b/src/srtctl/backends/vllm.py
index ff20cb40..1acbd50c 100644
--- a/src/srtctl/backends/vllm.py
+++ b/src/srtctl/backends/vllm.py
@@ -132,12 +132,16 @@ def get_process_environment(self, process: Process) -> dict[str, str]:
vLLM with dynamo requires unique ports for each worker:
- DYN_VLLM_KV_EVENT_PORT: ZMQ port for KV events publishing
- VLLM_NIXL_SIDE_CHANNEL_PORT: Port for NIXL side channel transfers
+ - VLLM_NIXL_SIDE_CHANNEL_HOST: Routable IP for NIXL side channel (not 0.0.0.0/localhost)
"""
+ from srtctl.core.slurm import get_hostname_ip
+
env: dict[str, str] = {}
if process.kv_events_port is not None:
env["DYN_VLLM_KV_EVENT_PORT"] = str(process.kv_events_port)
if process.nixl_port is not None:
env["VLLM_NIXL_SIDE_CHANNEL_PORT"] = str(process.nixl_port)
+ env["VLLM_NIXL_SIDE_CHANNEL_HOST"] = get_hostname_ip(process.node)
return env
def get_served_model_name(self, default: str) -> str:
diff --git a/src/srtctl/benchmarks/__init__.py b/src/srtctl/benchmarks/__init__.py
index 3a2d6449..088617a6 100644
--- a/src/srtctl/benchmarks/__init__.py
+++ b/src/srtctl/benchmarks/__init__.py
@@ -4,7 +4,7 @@
"""Benchmark runners for srtctl."""
# Import runners to trigger registration
-from srtctl.benchmarks import gpqa, gsm8k, longbenchv2, mmlu, mooncake_router, router, sa_bench, sglang_bench
+from srtctl.benchmarks import gpqa, gsm8k, lm_eval, longbenchv2, mmlu, mooncake_router, router, sa_bench, sglang_bench
from srtctl.benchmarks.base import (
BenchmarkRunner,
get_runner,
@@ -18,6 +18,7 @@
"list_benchmarks",
"register_benchmark",
# Runners
+ "lm_eval",
"sa_bench",
"sglang_bench",
"mmlu",
diff --git a/src/srtctl/benchmarks/lm_eval.py b/src/srtctl/benchmarks/lm_eval.py
new file mode 100644
index 00000000..c63ec097
--- /dev/null
+++ b/src/srtctl/benchmarks/lm_eval.py
@@ -0,0 +1,58 @@
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 2026 SemiAnalysis LLC. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+"""lm-eval benchmark runner for InferenceX evals."""
+
+from __future__ import annotations
+
+from typing import TYPE_CHECKING
+
+from srtctl.benchmarks.base import SCRIPTS_DIR, BenchmarkRunner, register_benchmark
+
+if TYPE_CHECKING:
+ from srtctl.core.runtime import RuntimeContext
+ from srtctl.core.schema import SrtConfig
+
+
+@register_benchmark("lm-eval")
+class LMEvalRunner(BenchmarkRunner):
+ """lm-eval accuracy evaluation using InferenceX benchmark_lib.
+
+ Runs lm-eval via the InferenceX benchmark_lib.sh harness,
+ which handles task selection, result collection, and summary generation.
+ """
+
+ @property
+ def name(self) -> str:
+ return "lm-eval"
+
+ @property
+ def script_path(self) -> str:
+ return "/srtctl-benchmarks/lm-eval/bench.sh"
+
+ @property
+ def local_script_dir(self) -> str:
+ return str(SCRIPTS_DIR / "lm-eval")
+
+ def validate_config(self, config: SrtConfig) -> list[str]:
+ # lm-eval has sensible defaults
+ return []
+
+ def build_command(
+ self,
+ config: SrtConfig,
+ runtime: RuntimeContext,
+ ) -> list[str]:
+ endpoint = f"http://localhost:{runtime.frontend_port}"
+ # Always use the container mount path, not the host path.
+ # INFMAX_WORKSPACE env var contains the host path (used for mount setup
+ # in runtime.py), but inside the container it's at /infmax-workspace.
+ infmax_workspace = "/infmax-workspace"
+
+ return [
+ "bash",
+ self.script_path,
+ endpoint,
+ infmax_workspace,
+ ]
diff --git a/src/srtctl/benchmarks/sa_bench.py b/src/srtctl/benchmarks/sa_bench.py
index 9adc6678..5f220393 100644
--- a/src/srtctl/benchmarks/sa_bench.py
+++ b/src/srtctl/benchmarks/sa_bench.py
@@ -97,5 +97,9 @@ def build_command(
str(prefill_gpus),
str(decode_gpus),
str(b.random_range_ratio) if b.random_range_ratio is not None else "0.8",
+ str(b.num_prompts_mult) if b.num_prompts_mult is not None else "10",
+ str(b.num_warmup_mult) if b.num_warmup_mult is not None else "2",
+ b.custom_tokenizer or "",
+ str(b.use_chat_template).lower(),
]
return cmd
diff --git a/src/srtctl/benchmarks/scripts/lm-eval/bench.sh b/src/srtctl/benchmarks/scripts/lm-eval/bench.sh
new file mode 100755
index 00000000..a10e4e7d
--- /dev/null
+++ b/src/srtctl/benchmarks/scripts/lm-eval/bench.sh
@@ -0,0 +1,77 @@
+#!/bin/bash
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 2026 SemiAnalysis LLC. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+# lm-eval accuracy evaluation using InferenceX benchmark_lib
+# Expects: endpoint [infmax_workspace]
+
+set -e
+
+ENDPOINT=$1
+INFMAX_WORKSPACE=${2:-/infmax-workspace}
+
+# Extract HOST and PORT from endpoint (e.g., http://localhost:8000)
+HOST=$(echo "$ENDPOINT" | sed -E 's|https?://||; s|:.*||')
+PORT=$(echo "$ENDPOINT" | sed -E 's|.*:([0-9]+).*|\1|')
+
+echo "lm-eval Config: endpoint=${ENDPOINT}; host=${HOST}; port=${PORT}; workspace=${INFMAX_WORKSPACE}"
+
+# Auto-discover the served model name from /v1/models if MODEL_NAME is not set.
+# This ensures we use the exact name the server recognizes, regardless of what
+# $MODEL (the HuggingFace ID from the workflow) is set to.
+if [[ -z "${MODEL_NAME:-}" ]]; then
+ DISCOVERED_MODEL=$(curl -sf "${ENDPOINT}/v1/models" 2>/dev/null \
+ | python3 -c "import sys,json; d=json.load(sys.stdin); print(d['data'][0]['id'])" 2>/dev/null || true)
+ if [[ -n "$DISCOVERED_MODEL" ]]; then
+ export MODEL_NAME="$DISCOVERED_MODEL"
+ echo "Auto-discovered MODEL_NAME from /v1/models: ${MODEL_NAME}"
+ else
+ echo "WARNING: Could not discover model name from /v1/models, using MODEL_NAME=${MODEL_NAME:-$MODEL}"
+ fi
+else
+ echo "Using MODEL_NAME from environment: ${MODEL_NAME}"
+fi
+
+# cd to workspace so that relative paths (e.g., utils/evals/*.yaml) resolve
+cd "${INFMAX_WORKSPACE}"
+
+# Source the InferenceX benchmark library
+source "${INFMAX_WORKSPACE}/benchmarks/benchmark_lib.sh"
+
+# Run lm-eval via benchmark_lib
+# EVAL_CONC is set by the InferenceX workflow (median of conc list).
+# benchmark_lib reads concurrency from EVAL_CONCURRENT_REQUESTS env var.
+export EVAL_CONCURRENT_REQUESTS="${EVAL_CONC:-${EVAL_CONCURRENT_REQUESTS:-256}}"
+echo "Running lm-eval with concurrent-requests=${EVAL_CONCURRENT_REQUESTS}..."
+eval_rc=0
+run_eval --framework lm-eval --port "$PORT" || eval_rc=$?
+
+# Derive metadata env vars that append_lm_eval_summary needs but do_sweep.py
+# does not pass directly (it passes PREFILL_TP/EP/etc, not TP/EP_SIZE/CONC).
+export IS_MULTINODE="${IS_MULTINODE:-true}"
+export TP="${TP:-${PREFILL_TP:-1}}"
+export CONC="${CONC:-${EVAL_CONC:-${EVAL_CONCURRENT_REQUESTS:-1}}}"
+export EP_SIZE="${EP_SIZE:-${PREFILL_EP:-1}}"
+export DP_ATTENTION="${DP_ATTENTION:-${PREFILL_DP_ATTN:-false}}"
+# Remap srt-slurm's DP_ATTN names to InferenceX's DP_ATTENTION names
+export PREFILL_DP_ATTENTION="${PREFILL_DP_ATTENTION:-${PREFILL_DP_ATTN:-${DP_ATTENTION:-false}}}"
+export DECODE_DP_ATTENTION="${DECODE_DP_ATTENTION:-${DECODE_DP_ATTN:-${DP_ATTENTION:-false}}}"
+
+# Generate the lm-eval summary
+echo "Generating lm-eval summary..."
+append_lm_eval_summary || true
+
+# Copy eval artifacts to /logs/eval_results/
+mkdir -p /logs/eval_results
+echo "Copying eval artifacts to /logs/eval_results/..."
+cp -v meta_env.json /logs/eval_results/ 2>/dev/null || true
+cp -v results*.json /logs/eval_results/ 2>/dev/null || true
+cp -v sample*.jsonl /logs/eval_results/ 2>/dev/null || true
+
+if [[ "$eval_rc" -ne 0 ]]; then
+ echo "lm-eval evaluation failed with exit code ${eval_rc}"
+ exit "$eval_rc"
+fi
+
+echo "lm-eval evaluation complete"
diff --git a/src/srtctl/benchmarks/scripts/sa-bench/backend_request_func.py b/src/srtctl/benchmarks/scripts/sa-bench/backend_request_func.py
index dd2cac44..ded56a80 100644
--- a/src/srtctl/benchmarks/scripts/sa-bench/backend_request_func.py
+++ b/src/srtctl/benchmarks/scripts/sa-bench/backend_request_func.py
@@ -511,10 +511,107 @@ def get_model(pretrained_model_name_or_path: str) -> str:
return pretrained_model_name_or_path
+def _resolve_tokenizer_file(model_name_or_path):
+ """Resolve tokenizer.json from a local directory or HF hub cache."""
+ from pathlib import Path
+
+ local_path = Path(model_name_or_path) / "tokenizer.json"
+ if local_path.is_file():
+ return str(local_path)
+ try:
+ from huggingface_hub import hf_hub_download
+
+ return hf_hub_download(model_name_or_path, "tokenizer.json", local_files_only=True)
+ except Exception:
+ return None
+
+
+def _fix_v5_tokenizer_components(tokenizer, model_name_or_path):
+ """Fix pre_tokenizer/decoder when transformers v5 LlamaTokenizerFast overwrites them.
+
+ In transformers v5, LlamaTokenizerFast.__init__ rebuilds the pre_tokenizer
+ and decoder from scratch, discarding the originals from tokenizer.json.
+ This breaks models like DeepSeek-R1 that declare LlamaTokenizerFast but
+ actually use a ByteLevel pre_tokenizer.
+
+ Ported from sglang/python/sglang/srt/utils/hf_transformers_utils.py.
+ """
+ backend = getattr(tokenizer, "_tokenizer", None)
+ if backend is None:
+ return
+
+ try:
+ from tokenizers import Tokenizer as RawTokenizer
+
+ tok_file = _resolve_tokenizer_file(model_name_or_path)
+ if tok_file is None:
+ return
+ raw = RawTokenizer.from_file(tok_file)
+ except Exception:
+ return
+
+ raw_pre = type(raw.pre_tokenizer).__name__ if raw.pre_tokenizer else None
+ loaded_pre = type(backend.pre_tokenizer).__name__ if backend.pre_tokenizer else None
+
+ if raw_pre and loaded_pre and raw_pre != loaded_pre:
+ print(
+ f"[sa-bench] Fixing v5 tokenizer component mismatch for {model_name_or_path}: "
+ f"pre_tokenizer {loaded_pre} -> {raw_pre}, "
+ f"decoder {type(backend.decoder).__name__ if backend.decoder else None} "
+ f"-> {type(raw.decoder).__name__ if raw.decoder else None}",
+ flush=True,
+ )
+ backend.pre_tokenizer = raw.pre_tokenizer
+ backend.decoder = raw.decoder
+
+
+def _load_glm_moe_dsa_tokenizer(pretrained_model_name_or_path: str) -> "PreTrainedTokenizerFast":
+ """Load GLM-Moe-Dsa / GLM-5 tokenizer directly from tokenizer.json.
+
+ Works around incompatibilities when the checkpoint was saved with
+ transformers 5.x (TokenizersBackend / list-style extra_special_tokens).
+ """
+ import json
+ from pathlib import Path
+
+ from tokenizers import Tokenizer as RustTokenizer
+ from transformers import PreTrainedTokenizerFast
+
+ _SAFE_CONFIG_KEYS = (
+ "pad_token", "pad_token_id", "eos_token", "eos_token_id",
+ "bos_token", "bos_token_id", "unk_token", "unk_token_id",
+ "model_max_length", "padding_side", "truncation_side",
+ )
+
+ path = Path(pretrained_model_name_or_path)
+ tokenizer_json = path / "tokenizer.json"
+ if not tokenizer_json.exists():
+ raise FileNotFoundError(
+ f"Expected tokenizer.json at {tokenizer_json}. "
+ "GlmMoeDsaTokenizer loads from tokenizer.json only."
+ )
+
+ rust_tok = RustTokenizer.from_file(str(tokenizer_json))
+ init_kwargs = {}
+ config_path = path / "tokenizer_config.json"
+ if config_path.exists():
+ with open(config_path, encoding="utf-8") as f:
+ config = json.load(f)
+ for key in _SAFE_CONFIG_KEYS:
+ if key in config:
+ init_kwargs[key] = config[key]
+ if "extra_special_tokens" in config:
+ init_kwargs["additional_special_tokens"] = config["extra_special_tokens"]
+
+ return PreTrainedTokenizerFast(tokenizer_object=rust_tok, **init_kwargs)
+
+
def get_tokenizer(
pretrained_model_name_or_path: str,
tokenizer_mode: str = "auto",
trust_remote_code: bool = False,
+ custom_tokenizer: str | None = None,
+ backend: str | None = None,
**kwargs,
) -> PreTrainedTokenizer | PreTrainedTokenizerFast:
if pretrained_model_name_or_path is not None and not os.path.exists(pretrained_model_name_or_path):
@@ -533,12 +630,60 @@ def get_tokenizer(
"to use mistral tokenizer mode."
) from e
return MistralTokenizer.from_pretrained(str(pretrained_model_name_or_path))
- else:
- return AutoTokenizer.from_pretrained(
- pretrained_model_name_or_path,
- trust_remote_code=trust_remote_code,
- **kwargs,
- )
+
+ if custom_tokenizer:
+ if custom_tokenizer == "glm_moe_dsa":
+ return _load_glm_moe_dsa_tokenizer(pretrained_model_name_or_path)
+ if custom_tokenizer == "deepseek_v4":
+ if backend == "sglang":
+ # SGLang has no client-side DeepseekV4Tokenizer package; we
+ # vendor sglang's own server-side encoder (encoding_dsv4.py)
+ # under ./tokenizers/ so the sa-bench client renders the
+ # exact same DSML prompt the sglang server builds.
+ from tokenizers.sglang_deepseek_v4 import (
+ SGLangDeepseekV4Tokenizer,
+ )
+ return SGLangDeepseekV4Tokenizer.from_pretrained(
+ str(pretrained_model_name_or_path)
+ )
+ if backend in (None, "vllm"):
+ try:
+ from vllm.tokenizers.deepseek_v4 import DeepseekV4Tokenizer
+ except ImportError as e:
+ raise ImportError(
+ "DeepseekV4Tokenizer requires vllm package.\n"
+ "Please install it with `pip install vllm` "
+ "to use deepseek_v4 tokenizer."
+ ) from e
+ return DeepseekV4Tokenizer.from_pretrained(
+ str(pretrained_model_name_or_path)
+ )
+ raise ValueError(
+ f"custom_tokenizer='deepseek_v4' does not support backend={backend!r}; "
+ "expected 'vllm' or 'sglang'."
+ )
+ from importlib import import_module
+ try:
+ module_path, class_name = custom_tokenizer.rsplit('.', 1)
+ module = import_module(module_path)
+ tokenizer_class = getattr(module, class_name)
+ return tokenizer_class.from_pretrained(
+ pretrained_model_name_or_path,
+ trust_remote_code=trust_remote_code,
+ **kwargs,
+ )
+ except (ValueError, ImportError, AttributeError) as e:
+ raise ValueError(
+ f"Failed to load custom_tokenizer '{custom_tokenizer}'. "
+ "Expected 'glm_moe_dsa' or 'module.path.ClassName'.") from e
+
+ tokenizer = AutoTokenizer.from_pretrained(
+ pretrained_model_name_or_path,
+ trust_remote_code=trust_remote_code,
+ **kwargs,
+ )
+ _fix_v5_tokenizer_components(tokenizer, pretrained_model_name_or_path)
+ return tokenizer
ASYNC_REQUEST_FUNCS = {
diff --git a/src/srtctl/benchmarks/scripts/sa-bench/bench.sh b/src/srtctl/benchmarks/scripts/sa-bench/bench.sh
index ed907308..acddf754 100644
--- a/src/srtctl/benchmarks/scripts/sa-bench/bench.sh
+++ b/src/srtctl/benchmarks/scripts/sa-bench/bench.sh
@@ -60,6 +60,22 @@ TOTAL_GPUS=${9:-0}
PREFILL_GPUS=${10:-0}
DECODE_GPUS=${11:-0}
RANDOM_RANGE_RATIO=${12:-0.8}
+NUM_PROMPTS_MULT=${13:-10}
+NUM_WARMUP_MULT=${14:-2}
+CUSTOM_TOKENIZER=${15:-}
+USE_CHAT_TEMPLATE=${16:-true}
+
+# Build optional custom tokenizer args
+CUSTOM_TOKENIZER_ARGS=()
+if [ -n "$CUSTOM_TOKENIZER" ]; then
+ CUSTOM_TOKENIZER_ARGS=(--custom-tokenizer "$CUSTOM_TOKENIZER")
+fi
+
+# Build optional chat template args
+CHAT_TEMPLATE_ARGS=()
+if [ "$USE_CHAT_TEMPLATE" = "true" ]; then
+ CHAT_TEMPLATE_ARGS=(--use-chat-template)
+fi
# Parse endpoint into host:port
HOST=$(echo "$ENDPOINT" | sed 's|http://||' | cut -d: -f1)
@@ -119,7 +135,8 @@ for concurrency in "${CONCURRENCY_LIST[@]}"; do
--request-rate 250 \
--percentile-metrics ttft,tpot,itl,e2el \
--max-concurrency "$concurrency" \
- --trust-remote-code
+ --trust-remote-code \
+ "${CUSTOM_TOKENIZER_ARGS[@]}"
num_prompts=$((concurrency * 10))
@@ -149,7 +166,8 @@ for concurrency in "${CONCURRENCY_LIST[@]}"; do
--percentile-metrics ttft,tpot,itl,e2el \
--max-concurrency "$concurrency" \
--trust-remote-code \
- --use-chat-template \
+ "${CHAT_TEMPLATE_ARGS[@]}" \
+ "${CUSTOM_TOKENIZER_ARGS[@]}" \
--save-result --result-dir "$result_dir" --result-filename "$result_filename"
set +x
diff --git a/src/srtctl/benchmarks/scripts/sa-bench/benchmark_serving.py b/src/srtctl/benchmarks/scripts/sa-bench/benchmark_serving.py
index 4363ef6e..75b3a97f 100644
--- a/src/srtctl/benchmarks/scripts/sa-bench/benchmark_serving.py
+++ b/src/srtctl/benchmarks/scripts/sa-bench/benchmark_serving.py
@@ -837,6 +837,8 @@ def main(args: argparse.Namespace):
tokenizer_id,
tokenizer_mode=tokenizer_mode,
trust_remote_code=args.trust_remote_code,
+ custom_tokenizer=args.custom_tokenizer,
+ backend=backend,
)
if args.dataset is not None:
@@ -1279,6 +1281,14 @@ def main(args: argparse.Namespace):
'"custom" will use --tokenizer to select the preregistered tokenizer.',
)
+ parser.add_argument(
+ "--custom-tokenizer",
+ type=str,
+ default=None,
+ help="Custom tokenizer to use (e.g., 'glm_moe_dsa' or 'module.path.ClassName'). "
+ "When set, overrides the default tokenizer loading.",
+ )
+
parser.add_argument(
"--served-model-name",
type=str,
diff --git a/src/srtctl/benchmarks/scripts/sa-bench/tokenizers/__init__.py b/src/srtctl/benchmarks/scripts/sa-bench/tokenizers/__init__.py
new file mode 100644
index 00000000..42d334ba
--- /dev/null
+++ b/src/srtctl/benchmarks/scripts/sa-bench/tokenizers/__init__.py
@@ -0,0 +1 @@
+"""Custom tokenizers bundled with sa-bench."""
diff --git a/src/srtctl/benchmarks/scripts/sa-bench/tokenizers/_sglang_encoding_dsv4.py b/src/srtctl/benchmarks/scripts/sa-bench/tokenizers/_sglang_encoding_dsv4.py
new file mode 100644
index 00000000..2212e090
--- /dev/null
+++ b/src/srtctl/benchmarks/scripts/sa-bench/tokenizers/_sglang_encoding_dsv4.py
@@ -0,0 +1,856 @@
+# SPDX-License-Identifier: Apache-2.0
+#
+# Vendored from sgl-project/sglang PR #23600 (currently unmerged).
+# Source: https://github.com/sgl-project/sglang/blob/f5d03db853862c8fb0e805df591bed883a71868b/python/sglang/srt/entrypoints/openai/encoding_dsv4.py
+# Upstream SHA-256: 106b471e559153d93c4af34a4865b2a68b205b72ddd688dbed93dfd86e4b92cb
+#
+# This file is vendored because sglang does not ship a client-side
+# tokenizer package equivalent to vllm.tokenizers.deepseek_v4. Keeping
+# a byte-identical copy here lets the sa-bench client render the exact
+# DeepSeek-V4 DSML prompt that sglang server builds internally, so
+# input_tokens reported by the client match the server's #new-token.
+#
+# When sglang upstream merges an official client-side tokenizer package,
+# this vendored copy can be removed in favor of that import.
+#
+# -------------------- Original sglang file begins below --------------------
+# Adapted from the DeepSeek-V4 release reference implementation.
+"""
+DeepSeek-V4 Encoding
+
+A self-contained implementation for encoding/decoding DeepSeek-V4 chat messages
+with tool calling, thinking mode, and quick instruction task support.
+"""
+
+import copy
+import json
+import re
+from typing import Any, Dict, List, Optional, Tuple, Union
+
+# ============================================================
+# Special Tokens
+# ============================================================
+
+bos_token: str = "<|begin▁of▁sentence|>"
+eos_token: str = "<|end▁of▁sentence|>"
+thinking_start_token: str = ""
+thinking_end_token: str = ""
+dsml_token: str = "|DSML|"
+
+USER_SP_TOKEN = "<|User|>"
+ASSISTANT_SP_TOKEN = "<|Assistant|>"
+LATEST_REMINDER_SP_TOKEN = "<|latest_reminder|>"
+
+# Task special tokens for internal classification tasks
+DS_TASK_SP_TOKENS = {
+ "action": "<|action|>",
+ "query": "<|query|>",
+ "authority": "<|authority|>",
+ "domain": "<|domain|>",
+ "title": "<|title|>",
+ "read_url": "<|read_url|>",
+}
+VALID_TASKS = set(DS_TASK_SP_TOKENS.keys())
+
+# ============================================================
+# Templates
+# ============================================================
+
+system_msg_template: str = "{content}"
+user_msg_template: str = "{content}"
+latest_reminder_msg_template: str = "{content}"
+assistant_msg_template: str = "{reasoning}{content}{tool_calls}" + eos_token
+assistant_msg_wo_eos_template: str = "{reasoning}{content}{tool_calls}"
+thinking_template: str = "{reasoning_content}"
+
+response_format_template: str = (
+ "## Response Format:\n\nYou MUST strictly adhere to the following schema to reply:\n{schema}"
+)
+tool_call_template: str = (
+ '<{dsml_token}invoke name="{name}">\n{arguments}\n{dsml_token}invoke>'
+)
+tool_calls_template = (
+ "<{dsml_token}{tc_block_name}>\n{tool_calls}\n{dsml_token}{tc_block_name}>"
+)
+tool_calls_block_name: str = "tool_calls"
+
+tool_output_template: str = "{content}"
+
+REASONING_EFFORT_MAX = (
+ "Reasoning Effort: Absolute maximum with no shortcuts permitted.\n"
+ "You MUST be very thorough in your thinking and comprehensively decompose the problem to resolve the root cause, rigorously stress-testing your logic against all potential paths, edge cases, and adversarial scenarios.\n"
+ "Explicitly write out your entire deliberation process, documenting every intermediate step, considered alternative, and rejected hypothesis to ensure absolutely no assumption is left unchecked.\n\n"
+)
+
+TOOLS_TEMPLATE = """## Tools
+
+You have access to a set of tools to help answer the user's question. You can invoke tools by writing a "<{dsml_token}tool_calls>" block like the following:
+
+<{dsml_token}tool_calls>
+<{dsml_token}invoke name="$TOOL_NAME">
+<{dsml_token}parameter name="$PARAMETER_NAME" string="true|false">$PARAMETER_VALUE{dsml_token}parameter>
+...
+{dsml_token}invoke>
+<{dsml_token}invoke name="$TOOL_NAME2">
+...
+{dsml_token}invoke>
+{dsml_token}tool_calls>
+
+String parameters should be specified as is and set `string="true"`. For all other types (numbers, booleans, arrays, objects), pass the value in JSON format and set `string="false"`.
+
+If thinking_mode is enabled (triggered by {thinking_start_token}), you MUST output your complete reasoning inside {thinking_start_token}...{thinking_end_token} BEFORE any tool calls or final response.
+
+Otherwise, output directly after {thinking_end_token} with tool calls or final response.
+
+### Available Tool Schemas
+
+{tool_schemas}
+
+You MUST strictly follow the above defined tool name and parameter schemas to invoke tool calls.
+"""
+
+# ============================================================
+# Utility Functions
+# ============================================================
+
+
+def to_json(value: Any) -> str:
+ """Serialize a value to JSON string."""
+ try:
+ return json.dumps(value, ensure_ascii=False)
+ except:
+ return json.dumps(value, ensure_ascii=True)
+
+
+def tools_from_openai_format(tools):
+ """Extract function definitions from OpenAI-format tool list."""
+ return [tool["function"] for tool in tools]
+
+
+def tool_calls_from_openai_format(tool_calls):
+ """Convert OpenAI-format tool calls to internal format."""
+ return [
+ {
+ "name": tool_call["function"]["name"],
+ "arguments": tool_call["function"]["arguments"],
+ }
+ for tool_call in tool_calls
+ ]
+
+
+def tool_calls_to_openai_format(tool_calls):
+ """Convert internal tool calls to OpenAI format."""
+ return [
+ {
+ "type": "function",
+ "function": {
+ "name": tool_call["name"],
+ "arguments": tool_call["arguments"],
+ },
+ }
+ for tool_call in tool_calls
+ ]
+
+
+def encode_arguments_to_dsml(tool_call: Dict[str, str]) -> str:
+ """
+ Encode tool call arguments into DSML parameter format.
+
+ Args:
+ tool_call: Dict with "name" and "arguments" (JSON string) keys.
+
+ Returns:
+ DSML-formatted parameter string.
+ """
+ p_dsml_template = '<{dsml_token}parameter name="{key}" string="{is_str}">{value}{dsml_token}parameter>'
+ P_dsml_strs = []
+
+ try:
+ arguments = json.loads(tool_call["arguments"])
+ except Exception as err:
+ arguments = {"arguments": tool_call["arguments"]}
+
+ for k, v in arguments.items():
+ p_dsml_str = p_dsml_template.format(
+ dsml_token=dsml_token,
+ key=k,
+ is_str="true" if isinstance(v, str) else "false",
+ value=v if isinstance(v, str) else to_json(v),
+ )
+ P_dsml_strs.append(p_dsml_str)
+
+ return "\n".join(P_dsml_strs)
+
+
+def decode_dsml_to_arguments(
+ tool_name: str, tool_args: Dict[str, Tuple[str, str]]
+) -> Dict[str, str]:
+ """
+ Decode DSML parameters back to a tool call dict.
+
+ Args:
+ tool_name: Name of the tool.
+ tool_args: Dict mapping param_name -> (value, is_string_flag).
+
+ Returns:
+ Dict with "name" and "arguments" (JSON string) keys.
+ """
+
+ def _decode_value(key: str, value: str, string: str):
+ if string == "true":
+ value = to_json(value)
+ return f"{to_json(key)}: {value}"
+
+ tool_args_json = (
+ "{"
+ + ", ".join(
+ [_decode_value(k, v, string=is_str) for k, (v, is_str) in tool_args.items()]
+ )
+ + "}"
+ )
+ return dict(name=tool_name, arguments=tool_args_json)
+
+
+def render_tools(tools: List[Dict[str, Union[str, Dict[str, Any]]]]) -> str:
+ """
+ Render tool schemas into the system prompt format.
+
+ Args:
+ tools: List of tool schema dicts (each with name, description, parameters).
+
+ Returns:
+ Formatted tools section string.
+ """
+ tools_json = [to_json(t) for t in tools]
+
+ return TOOLS_TEMPLATE.format(
+ tool_schemas="\n".join(tools_json),
+ dsml_token=dsml_token,
+ thinking_start_token=thinking_start_token,
+ thinking_end_token=thinking_end_token,
+ )
+
+
+def find_last_user_index(messages: List[Dict[str, Any]]) -> int:
+ """Find the index of the last user/developer message."""
+ last_user_index = -1
+ for idx in range(len(messages) - 1, -1, -1):
+ if messages[idx].get("role") in ["user", "developer"]:
+ last_user_index = idx
+ break
+ return last_user_index
+
+
+# ============================================================
+# Message Rendering
+# ============================================================
+
+
+def render_message(
+ index: int,
+ messages: List[Dict[str, Any]],
+ thinking_mode: str,
+ drop_thinking: bool = True,
+ reasoning_effort: Optional[str] = None,
+) -> str:
+ """
+ Render a single message at the given index into its encoded string form.
+
+ This is the core function that converts each message in the conversation
+ into the DeepSeek-V4 format.
+
+ Args:
+ index: Index of the message to render.
+ messages: Full list of messages in the conversation.
+ thinking_mode: Either "chat" or "thinking".
+ drop_thinking: Whether to drop reasoning content from earlier turns.
+ reasoning_effort: Optional reasoning effort level ("max", "high", or None).
+
+ Returns:
+ Encoded string for this message.
+ """
+ assert 0 <= index < len(messages)
+ assert thinking_mode in [
+ "chat",
+ "thinking",
+ ], f"Invalid thinking_mode `{thinking_mode}`"
+
+ prompt = ""
+ msg = messages[index]
+ last_user_idx = find_last_user_index(messages)
+
+ role = msg.get("role")
+ content = msg.get("content")
+ tools = msg.get("tools")
+ response_format = msg.get("response_format")
+ tool_calls = msg.get("tool_calls")
+ reasoning_content = msg.get("reasoning_content")
+ wo_eos = msg.get("wo_eos", False)
+
+ if tools:
+ tools = tools_from_openai_format(tools)
+ if tool_calls:
+ tool_calls = tool_calls_from_openai_format(tool_calls)
+
+ # Reasoning effort prefix (only at index 0 in thinking mode with max effort)
+ assert reasoning_effort in [
+ "max",
+ None,
+ "high",
+ ], f"Invalid reasoning effort: {reasoning_effort}"
+ if index == 0 and thinking_mode == "thinking" and reasoning_effort == "max":
+ prompt += REASONING_EFFORT_MAX
+
+ if role == "system":
+ prompt += system_msg_template.format(content=content or "")
+ if tools:
+ prompt += "\n\n" + render_tools(tools)
+ if response_format:
+ prompt += "\n\n" + response_format_template.format(
+ schema=to_json(response_format)
+ )
+
+ elif role == "developer":
+ assert content, f"Invalid message for role `{role}`: {msg}"
+
+ content_developer = USER_SP_TOKEN
+ content_developer += content
+
+ if tools:
+ content_developer += "\n\n" + render_tools(tools)
+ if response_format:
+ content_developer += "\n\n" + response_format_template.format(
+ schema=to_json(response_format)
+ )
+
+ prompt += user_msg_template.format(content=content_developer)
+
+ elif role == "user":
+ prompt += USER_SP_TOKEN
+
+ # Handle content blocks (tool results mixed with text)
+ content_blocks = msg.get("content_blocks")
+ if content_blocks:
+ parts = []
+ for block in content_blocks:
+ block_type = block.get("type")
+ if block_type == "text":
+ parts.append(block.get("text", ""))
+ elif block_type == "tool_result":
+ tool_content = block.get("content", "")
+ if isinstance(tool_content, list):
+ text_parts = []
+ for b in tool_content:
+ if b.get("type") == "text":
+ text_parts.append(b.get("text", ""))
+ else:
+ text_parts.append(f"[Unsupported {b.get('type')}]")
+ tool_content = "\n\n".join(text_parts)
+ parts.append(tool_output_template.format(content=tool_content))
+ else:
+ parts.append(f"[Unsupported {block_type}]")
+ prompt += "\n\n".join(parts)
+ else:
+ prompt += content or ""
+
+ elif role == "latest_reminder":
+ prompt += LATEST_REMINDER_SP_TOKEN + latest_reminder_msg_template.format(
+ content=content
+ )
+
+ elif role == "tool":
+ raise NotImplementedError(
+ "deepseek_v4 merges tool messages into user; please preprocess with merge_tool_messages()"
+ )
+
+ elif role == "assistant":
+ thinking_part = ""
+ tc_content = ""
+
+ if tool_calls:
+ tc_list = [
+ tool_call_template.format(
+ dsml_token=dsml_token,
+ name=tc.get("name"),
+ arguments=encode_arguments_to_dsml(tc),
+ )
+ for tc in tool_calls
+ ]
+ tc_content += "\n\n" + tool_calls_template.format(
+ dsml_token=dsml_token,
+ tool_calls="\n".join(tc_list),
+ tc_block_name=tool_calls_block_name,
+ )
+
+ summary_content = content or ""
+ rc = reasoning_content or ""
+
+ # Check if previous message has a task - if so, this is a task output (no thinking)
+ prev_has_task = index - 1 >= 0 and messages[index - 1].get("task") is not None
+
+ if thinking_mode == "thinking" and not prev_has_task:
+ if not drop_thinking or index > last_user_idx:
+ thinking_part = (
+ thinking_template.format(reasoning_content=rc) + thinking_end_token
+ )
+ else:
+ thinking_part = ""
+
+ if wo_eos:
+ prompt += assistant_msg_wo_eos_template.format(
+ reasoning=thinking_part,
+ content=summary_content,
+ tool_calls=tc_content,
+ )
+ else:
+ prompt += assistant_msg_template.format(
+ reasoning=thinking_part,
+ content=summary_content,
+ tool_calls=tc_content,
+ )
+ else:
+ raise NotImplementedError(f"Unknown role: {role}")
+
+ # Append transition tokens based on what follows
+ if index + 1 < len(messages) and messages[index + 1].get("role") not in [
+ "assistant",
+ "latest_reminder",
+ ]:
+ return prompt
+
+ task = messages[index].get("task")
+ if task is not None:
+ # Task special token for internal classification tasks
+ assert (
+ task in VALID_TASKS
+ ), f"Invalid task: '{task}'. Valid tasks are: {list(VALID_TASKS)}"
+ task_sp_token = DS_TASK_SP_TOKENS[task]
+
+ if task != "action":
+ # Non-action tasks: append task sp token directly after the message
+ prompt += task_sp_token
+ else:
+ # Action task: append Assistant + thinking token + action sp token
+ prompt += ASSISTANT_SP_TOKEN
+ prompt += (
+ thinking_end_token
+ if thinking_mode != "thinking"
+ else thinking_start_token
+ )
+ prompt += task_sp_token
+
+ elif messages[index].get("role") in ["user", "developer"]:
+ # Normal generation: append Assistant + thinking token
+ prompt += ASSISTANT_SP_TOKEN
+ if not drop_thinking and thinking_mode == "thinking":
+ prompt += thinking_start_token
+ elif drop_thinking and thinking_mode == "thinking" and index >= last_user_idx:
+ prompt += thinking_start_token
+ else:
+ prompt += thinking_end_token
+
+ return prompt
+
+
+# ============================================================
+# Preprocessing
+# ============================================================
+
+
+def merge_tool_messages(messages: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
+ """
+ Merge tool messages into the preceding user message using content_blocks format.
+
+ DeepSeek-V4 does not have a standalone "tool" role; instead, tool results
+ are encoded as blocks within user messages.
+
+ This function converts a standard OpenAI-format conversation (with separate
+ "tool" role messages) into V4 format where tool results are merged into
+ user messages.
+
+ Args:
+ messages: List of message dicts in OpenAI format.
+
+ Returns:
+ Processed message list with tool messages merged into user messages.
+ """
+ merged: List[Dict[str, Any]] = []
+
+ for msg in messages:
+ msg = copy.deepcopy(msg)
+ role = msg.get("role")
+
+ if role == "tool":
+ # Convert tool message to a user message with tool_result block
+ tool_block = {
+ "type": "tool_result",
+ "tool_use_id": msg.get("tool_call_id", ""),
+ "content": msg.get("content", ""),
+ }
+ # Merge into previous message if it's already a user (merged tool)
+ if (
+ merged
+ and merged[-1].get("role") == "user"
+ and "content_blocks" in merged[-1]
+ ):
+ merged[-1]["content_blocks"].append(tool_block)
+ else:
+ merged.append(
+ {
+ "role": "user",
+ "content_blocks": [tool_block],
+ }
+ )
+ elif role == "user":
+ text_block = {"type": "text", "text": msg.get("content", "")}
+ if (
+ merged
+ and merged[-1].get("role") == "user"
+ and "content_blocks" in merged[-1]
+ and merged[-1].get("task") is None
+ ):
+ merged[-1]["content_blocks"].append(text_block)
+ else:
+ new_msg = {
+ "role": "user",
+ "content": msg.get("content", ""),
+ "content_blocks": [text_block],
+ }
+ # Preserve extra fields (task, wo_eos, mask, etc.)
+ for key in ("task", "wo_eos", "mask"):
+ if key in msg:
+ new_msg[key] = msg[key]
+ merged.append(new_msg)
+ else:
+ merged.append(msg)
+
+ return merged
+
+
+def sort_tool_results_by_call_order(
+ messages: List[Dict[str, Any]]
+) -> List[Dict[str, Any]]:
+ """
+ Sort tool_result blocks within user messages by the order of tool_calls
+ in the preceding assistant message.
+
+ Args:
+ messages: Preprocessed message list (after merge_tool_messages).
+
+ Returns:
+ Message list with sorted tool result blocks.
+ """
+ last_tool_call_order: Dict[str, int] = {}
+
+ for msg in messages:
+ role = msg.get("role")
+ if role == "assistant" and msg.get("tool_calls"):
+ last_tool_call_order = {}
+ for idx, tc in enumerate(msg["tool_calls"]):
+ tc_id = tc.get("id") or tc.get("function", {}).get("id", "")
+ if tc_id:
+ last_tool_call_order[tc_id] = idx
+
+ elif role == "user" and msg.get("content_blocks"):
+ tool_blocks = [
+ b for b in msg["content_blocks"] if b.get("type") == "tool_result"
+ ]
+ if len(tool_blocks) > 1 and last_tool_call_order:
+ sorted_blocks = sorted(
+ tool_blocks,
+ key=lambda b: last_tool_call_order.get(b.get("tool_use_id", ""), 0),
+ )
+ sorted_idx = 0
+ new_blocks = []
+ for block in msg["content_blocks"]:
+ if block.get("type") == "tool_result":
+ new_blocks.append(sorted_blocks[sorted_idx])
+ sorted_idx += 1
+ else:
+ new_blocks.append(block)
+ msg["content_blocks"] = new_blocks
+
+ return messages
+
+
+# ============================================================
+# Main Encoding Function
+# ============================================================
+
+
+def encode_messages(
+ messages: List[Dict[str, Any]],
+ thinking_mode: str,
+ context: Optional[List[Dict[str, Any]]] = None,
+ drop_thinking: bool = True,
+ add_default_bos_token: bool = True,
+ reasoning_effort: Optional[str] = None,
+) -> str:
+ """
+ Encode a list of messages into the DeepSeek-V4 prompt format.
+
+ This is the main entry point for encoding conversations. It handles:
+ - BOS token insertion
+ - Thinking mode with optional reasoning content dropping
+ - Tool message merging into user messages
+ - Multi-turn conversation context
+
+ Args:
+ messages: List of message dicts to encode.
+ thinking_mode: Either "chat" or "thinking".
+ context: Optional preceding context messages (already encoded prefix).
+ drop_thinking: If True, drop reasoning_content from earlier assistant turns
+ (only keep reasoning for messages after the last user message).
+ add_default_bos_token: Whether to prepend BOS token at conversation start.
+ reasoning_effort: Optional reasoning effort level ("max", "high", or None).
+
+ Returns:
+ The encoded prompt string.
+ """
+ context = context if context else []
+
+ # Preprocess: merge tool messages and sort tool results
+ messages = merge_tool_messages(messages)
+ messages = sort_tool_results_by_call_order(context + messages)[len(context) :]
+ if context:
+ context = merge_tool_messages(context)
+ context = sort_tool_results_by_call_order(context)
+
+ full_messages = context + messages
+
+ prompt = bos_token if add_default_bos_token and len(context) == 0 else ""
+
+ # Resolve drop_thinking: if any message has tools defined, don't drop thinking
+ effective_drop_thinking = drop_thinking
+ if any(m.get("tools") for m in full_messages):
+ effective_drop_thinking = False
+
+ if thinking_mode == "thinking" and effective_drop_thinking:
+ full_messages = _drop_thinking_messages(full_messages)
+ # After dropping, recalculate how many messages to render
+ # (context may have shrunk too)
+ num_to_render = len(full_messages) - len(_drop_thinking_messages(context))
+ context_len = len(full_messages) - num_to_render
+ else:
+ num_to_render = len(messages)
+ context_len = len(context)
+
+ for idx in range(num_to_render):
+ prompt += render_message(
+ idx + context_len,
+ full_messages,
+ thinking_mode=thinking_mode,
+ drop_thinking=effective_drop_thinking,
+ reasoning_effort=reasoning_effort,
+ )
+
+ return prompt
+
+
+def _drop_thinking_messages(messages: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
+ """
+ Drop reasoning_content and non-essential messages before the last user message.
+
+ Behavior:
+ - Messages with role in ["user", "system", "tool", "latest_reminder"] are always kept.
+ - Messages at or after the last user index are always kept.
+ - Assistant messages before the last user get reasoning_content removed.
+ - Developer messages before the last user are dropped entirely.
+ """
+ last_user_idx = find_last_user_index(messages)
+ result = []
+ keep_roles = {"user", "system", "tool", "latest_reminder", "direct_search_results"}
+
+ for idx, msg in enumerate(messages):
+ role = msg.get("role")
+ if role in keep_roles or idx >= last_user_idx:
+ result.append(msg)
+ elif role == "assistant":
+ msg = copy.copy(msg)
+ msg.pop("reasoning_content", None)
+ result.append(msg)
+ # developer and other roles before last_user_idx are dropped
+
+ return result
+
+
+# ============================================================
+# Parsing (Decoding model output)
+# ============================================================
+
+
+def _read_until_stop(
+ index: int, text: str, stop: List[str]
+) -> Tuple[int, str, Optional[str]]:
+ """
+ Read text from index until one of the stop strings is found.
+
+ Returns:
+ Tuple of (new_index, content_before_stop, matched_stop_string_or_None).
+ """
+ min_pos = len(text)
+ matched_stop = None
+
+ for s in stop:
+ pos = text.find(s, index)
+ if pos != -1 and pos < min_pos:
+ min_pos = pos
+ matched_stop = s
+
+ if matched_stop:
+ content = text[index:min_pos]
+ return min_pos + len(matched_stop), content, matched_stop
+ else:
+ content = text[index:]
+ return len(text), content, None
+
+
+def parse_tool_calls(
+ index: int, text: str
+) -> Tuple[int, Optional[str], List[Dict[str, str]]]:
+ """
+ Parse DSML tool calls from text starting at the given index.
+
+ Args:
+ index: Starting position in text.
+ text: The full text to parse.
+
+ Returns:
+ Tuple of (new_index, last_stop_token, list_of_tool_call_dicts).
+ Each tool call dict has "name" and "arguments" keys.
+ """
+ tool_calls: List[Dict[str, Any]] = []
+ stop_token = None
+ tool_calls_end_token = f"{dsml_token}{tool_calls_block_name}>"
+
+ while index < len(text):
+ index, _, stop_token = _read_until_stop(
+ index, text, [f"<{dsml_token}invoke", tool_calls_end_token]
+ )
+ if _ != ">\n":
+ raise ValueError(f"Tool call format error: expected '>\\n' but got '{_}'")
+
+ if stop_token == tool_calls_end_token:
+ break
+
+ if stop_token is None:
+ raise ValueError("Missing special token in tool calls")
+
+ index, tool_name_content, stop_token = _read_until_stop(
+ index, text, [f"<{dsml_token}parameter", f"{dsml_token}invoke"]
+ )
+
+ p_tool_name = re.findall(
+ r'^\s*name="(.*?)">\n$', tool_name_content, flags=re.DOTALL
+ )
+ if len(p_tool_name) != 1:
+ raise ValueError(f"Tool name format error: '{tool_name_content}'")
+ tool_name = p_tool_name[0]
+
+ tool_args: Dict[str, Tuple[str, str]] = {}
+ while stop_token == f"<{dsml_token}parameter":
+ index, param_content, stop_token = _read_until_stop(
+ index, text, [f"/{dsml_token}parameter"]
+ )
+
+ param_kv = re.findall(
+ r'^ name="(.*?)" string="(true|false)">(.*?)<$',
+ param_content,
+ flags=re.DOTALL,
+ )
+ if len(param_kv) != 1:
+ raise ValueError(f"Parameter format error: '{param_content}'")
+ param_name, string, param_value = param_kv[0]
+
+ if param_name in tool_args:
+ raise ValueError(f"Duplicate parameter name: '{param_name}'")
+ tool_args[param_name] = (param_value, string)
+
+ index, content, stop_token = _read_until_stop(
+ index, text, [f"<{dsml_token}parameter", f"{dsml_token}invoke"]
+ )
+ if content != ">\n":
+ raise ValueError(
+ f"Parameter format error: expected '>\\n' but got '{content}'"
+ )
+
+ tool_call = decode_dsml_to_arguments(tool_name=tool_name, tool_args=tool_args)
+ tool_calls.append(tool_call)
+
+ return index, stop_token, tool_calls
+
+
+def parse_message_from_completion_text(text: str, thinking_mode: str) -> Dict[str, Any]:
+ """
+ Parse a model completion text into a structured assistant message.
+
+ This function takes the raw text output from the model (a single assistant turn)
+ and extracts:
+ - reasoning_content (thinking block)
+ - content (summary/response)
+ - tool_calls (if any)
+
+ NOTE: This function is designed to parse only correctly formatted strings and
+ will raise ValueError for malformed output.
+
+ Args:
+ text: The raw completion text (including EOS token).
+ thinking_mode: Either "chat" or "thinking".
+
+ Returns:
+ Dict with keys: "role", "content", "reasoning_content", "tool_calls".
+ tool_calls are in OpenAI format.
+ """
+ summary_content, reasoning_content, tool_calls = "", "", []
+ index, stop_token = 0, None
+ tool_calls_start_token = f"\n\n<{dsml_token}{tool_calls_block_name}"
+
+ is_thinking = thinking_mode == "thinking"
+ is_tool_calling = False
+
+ if is_thinking:
+ index, content_delta, stop_token = _read_until_stop(
+ index, text, [thinking_end_token, tool_calls_start_token]
+ )
+ reasoning_content = content_delta
+ assert (
+ stop_token == thinking_end_token
+ ), "Invalid thinking format: missing "
+
+ index, content_delta, stop_token = _read_until_stop(
+ index, text, [eos_token, tool_calls_start_token]
+ )
+ summary_content = content_delta
+ if stop_token == tool_calls_start_token:
+ is_tool_calling = True
+ else:
+ assert stop_token == eos_token, "Invalid format: missing EOS token"
+
+ if is_tool_calling:
+ index, stop_token, tool_calls = parse_tool_calls(index, text)
+
+ index, tool_ends_text, stop_token = _read_until_stop(index, text, [eos_token])
+ assert not tool_ends_text, "Unexpected content after tool calls"
+
+ assert len(text) == index and stop_token in [
+ eos_token,
+ None,
+ ], "Unexpected content at end"
+
+ for sp_token in [
+ bos_token,
+ eos_token,
+ thinking_start_token,
+ thinking_end_token,
+ dsml_token,
+ ]:
+ assert (
+ sp_token not in summary_content and sp_token not in reasoning_content
+ ), f"Unexpected special token '{sp_token}' in content"
+
+ return {
+ "role": "assistant",
+ "content": summary_content,
+ "reasoning_content": reasoning_content,
+ "tool_calls": tool_calls_to_openai_format(tool_calls),
+ }
diff --git a/src/srtctl/benchmarks/scripts/sa-bench/tokenizers/sglang_deepseek_v4.py b/src/srtctl/benchmarks/scripts/sa-bench/tokenizers/sglang_deepseek_v4.py
new file mode 100644
index 00000000..595e7b2f
--- /dev/null
+++ b/src/srtctl/benchmarks/scripts/sa-bench/tokenizers/sglang_deepseek_v4.py
@@ -0,0 +1,125 @@
+
+# SPDX-License-Identifier: Apache-2.0
+"""
+SGLang-side DeepSeek-V4 tokenizer for sa-bench.
+
+Mirrors what sglang's ``serving_chat._apply_jinja_template`` does
+when ``chat_encoding_spec == "dsv4"`` (see
+sgl-project/sglang PR #23600), so that the tokens counted on the
+sa-bench client side match the tokens the sglang server actually
+feeds into the model.
+
+The vllm counterpart lives in ``vllm.tokenizers.deepseek_v4``; sglang
+has no equivalent client-side package, so we vendor the rendering
+logic from ``encoding_dsv4.py`` in ``_sglang_encoding_dsv4.py``.
+"""
+from __future__ import annotations
+
+from typing import Any, Dict, List, Optional
+
+from transformers import AutoTokenizer
+
+from ._sglang_encoding_dsv4 import encode_messages as _encode_messages
+
+
+class SGLangDeepseekV4Tokenizer:
+ """Client-side DeepSeek-V4 tokenizer matching sglang server behavior.
+
+ The server-side call chain (sglang PR #23600) is:
+
+ messages = request.messages # OpenAI-style
+ if messages[0]["role"] != "system":
+ messages.insert(0, {"role": "system", "content": ""})
+ real_input = encoding_dsv4.encode_messages(
+ messages,
+ thinking_mode="chat", # default
+ reasoning_effort=None, # "medium" dropped
+ )
+ prompt_ids = tokenizer.encode(real_input)
+
+ We reproduce the exact same steps here.
+ """
+
+ def __init__(self, hf_tokenizer):
+ self._hf = hf_tokenizer
+
+ @classmethod
+ def from_pretrained(cls, pretrained_model_name_or_path: str, **kwargs):
+ kwargs.setdefault("trust_remote_code", True)
+ hf = AutoTokenizer.from_pretrained(
+ pretrained_model_name_or_path, **kwargs
+ )
+ return cls(hf)
+
+ def _render_prompt(
+ self,
+ messages: List[Dict[str, Any]],
+ thinking_mode: str = "chat",
+ reasoning_effort: Optional[str] = None,
+ ) -> str:
+ msgs = [dict(m) for m in messages]
+ if not msgs or msgs[0].get("role") != "system":
+ msgs.insert(0, {"role": "system", "content": ""})
+
+ if reasoning_effort not in ("max", "high"):
+ reasoning_effort = None
+
+ return _encode_messages(
+ msgs,
+ thinking_mode=thinking_mode,
+ reasoning_effort=reasoning_effort,
+ )
+
+ def apply_chat_template(
+ self,
+ messages: List[Dict[str, Any]],
+ tokenize: bool = True,
+ add_generation_prompt: bool = True, # noqa: ARG002 (encoder always adds the <|Assistant|>... tail)
+ tools: Optional[List[Dict[str, Any]]] = None,
+ thinking: bool = False,
+ reasoning_effort: Optional[str] = None,
+ **_: Any,
+ ):
+ msgs = [dict(m) for m in messages]
+ if tools:
+ if not msgs or msgs[0].get("role") != "system":
+ msgs.insert(0, {"role": "system", "content": ""})
+ msgs[0]["tools"] = list(tools)
+
+ thinking_mode = "thinking" if thinking else "chat"
+ prompt = self._render_prompt(
+ msgs,
+ thinking_mode=thinking_mode,
+ reasoning_effort=reasoning_effort,
+ )
+ if not tokenize:
+ return prompt
+ return self._hf.encode(prompt, add_special_tokens=False)
+
+ def encode(self, text, **kwargs):
+ return self._hf.encode(text, **kwargs)
+
+ def decode(self, token_ids, **kwargs):
+ return self._hf.decode(token_ids, **kwargs)
+
+ def __len__(self):
+ return len(self._hf)
+
+ @property
+ def vocab_size(self):
+ return self._hf.vocab_size
+
+ @property
+ def eos_token_id(self):
+ return self._hf.eos_token_id
+
+ @property
+ def bos_token_id(self):
+ return self._hf.bos_token_id
+
+ @property
+ def pad_token_id(self):
+ return self._hf.pad_token_id
+
+ def __getattr__(self, name):
+ return getattr(self._hf, name)
diff --git a/src/srtctl/cli/do_sweep.py b/src/srtctl/cli/do_sweep.py
index ff6eaa91..77b79ac5 100644
--- a/src/srtctl/cli/do_sweep.py
+++ b/src/srtctl/cli/do_sweep.py
@@ -18,6 +18,7 @@
import os
import sys
import threading
+import time
from dataclasses import dataclass
from pathlib import Path
@@ -179,6 +180,118 @@ def _print_connection_info(self) -> None:
logger.info("=" * 60)
logger.info("")
+ def _run_post_eval(self, stop_event: threading.Event) -> int:
+ """Run lm-eval after the main benchmark completes (or directly in eval-only mode)."""
+ from srtctl.benchmarks import get_runner
+ from srtctl.core.health import wait_for_model
+
+ # In eval-only mode the benchmark health check was skipped, so do the
+ # full model-ready wait here. In post-benchmark mode a quick port
+ # check is sufficient since the server already served traffic.
+ if os.environ.get("EVAL_ONLY", "false").lower() == "true":
+ r = self.config.resources
+ n_prefill = 0 if r.num_agg > 0 else r.num_prefill
+ n_decode = r.num_agg if r.num_agg > 0 else r.num_decode
+ hc = self.config.health_check
+ logger.info("EVAL_ONLY: Waiting for server health before eval...")
+ if not wait_for_model(
+ host=self.runtime.nodes.head,
+ port=8000,
+ n_prefill=n_prefill,
+ n_decode=n_decode,
+ poll_interval=float(hc.interval_seconds),
+ timeout=float(hc.max_attempts * hc.interval_seconds),
+ report_every=60.0,
+ frontend_type=self.config.frontend.type,
+ stop_event=stop_event,
+ ):
+ logger.error("Server did not become healthy for eval")
+ return 1
+ else:
+ if not wait_for_port(self.runtime.nodes.head, 8000, timeout=30):
+ logger.error("Server health check failed before eval - skipping")
+ return 1
+
+ try:
+ runner = get_runner("lm-eval")
+ except ValueError as e:
+ logger.error("lm-eval runner not available: %s", e)
+ return 1
+
+ eval_log = self.runtime.log_dir / "eval.out"
+ cmd = runner.build_command(self.config, self.runtime)
+
+ logger.info("Eval command: %s", " ".join(cmd))
+ logger.info("Eval log: %s", eval_log)
+
+ # Pass through eval-related env vars. InferenceX writes multi-node
+ # metadata from these variables in append_lm_eval_summary().
+ env_to_set = {}
+ for var in [
+ "RUN_EVAL",
+ "EVAL_ONLY",
+ "IS_MULTINODE",
+ "FRAMEWORK",
+ "PRECISION",
+ "MODEL_PREFIX",
+ "RUNNER_TYPE",
+ "RESULT_FILENAME",
+ "SPEC_DECODING",
+ "ISL",
+ "OSL",
+ "MODEL",
+ "MODEL_PATH",
+ "MAX_MODEL_LEN",
+ "EVAL_MAX_MODEL_LEN",
+ "PREFILL_TP",
+ "PREFILL_EP",
+ "PREFILL_DP_ATTN",
+ "PREFILL_NUM_WORKERS",
+ "DECODE_TP",
+ "DECODE_EP",
+ "DECODE_DP_ATTN",
+ "DECODE_NUM_WORKERS",
+ ]:
+ val = os.environ.get(var)
+ if val:
+ env_to_set[var] = val
+
+ # Set MODEL_NAME to the served model name so lm-eval uses the correct
+ # name for API requests. Without this, benchmark_lib.sh falls back to
+ # $MODEL (the HuggingFace ID) which the server doesn't recognize.
+ env_to_set["MODEL_NAME"] = self.config.served_model_name
+ logger.info("Eval MODEL_NAME: %s", env_to_set["MODEL_NAME"])
+
+ # Use EVAL_CONC from workflow (median chosen by InferenceX mark_eval_entries),
+ # falling back to max of benchmark concurrency list.
+ eval_conc = os.environ.get("EVAL_CONC")
+ if eval_conc:
+ env_to_set["EVAL_CONC"] = eval_conc
+ logger.info("Eval concurrency (from workflow): %s", eval_conc)
+ else:
+ conc_list = self.config.benchmark.get_concurrency_list()
+ if conc_list:
+ env_to_set["EVAL_CONC"] = str(max(conc_list))
+ logger.info("Eval concurrency (max of %s): %s", conc_list, env_to_set["EVAL_CONC"])
+
+ proc = start_srun_process(
+ command=cmd,
+ nodelist=[self.runtime.nodes.head],
+ output=str(eval_log),
+ container_image=str(self.runtime.container_image),
+ container_mounts=self.runtime.container_mounts,
+ env_to_set=env_to_set,
+ )
+
+ while proc.poll() is None:
+ if stop_event.is_set():
+ logger.info("Stop requested, terminating eval")
+ proc.terminate()
+ return 1
+ time.sleep(1)
+
+ return proc.returncode or 0
+
def run(self) -> int:
"""Run the complete sweep."""
# Create status reporter (fire-and-forget, no-op if not configured)
@@ -221,8 +334,27 @@ def run(self) -> int:
self._print_connection_info()
- # Stage 4: Benchmark (status reported AFTER health check passes)
- exit_code = self.run_benchmark(registry, stop_event, reporter)
+ if os.environ.get("EVAL_ONLY", "false").lower() == "true":
+ reporter.report(JobStatus.BENCHMARK, JobStage.BENCHMARK, "Running eval-only evaluation")
+ logger.info("EVAL_ONLY=true: Skipping benchmark stage and running lm-eval evaluation...")
+ exit_code = self._run_post_eval(stop_event)
+ if exit_code != 0:
+ logger.error("Eval-only evaluation failed with exit code %d", exit_code)
+ else:
+ logger.info("Eval-only evaluation completed successfully")
+ else:
+ # Stage 4: Benchmark (status reported AFTER health check passes)
+ exit_code = self.run_benchmark(registry, stop_event, reporter)
+
+ # Stage 5: Post-benchmark eval (optional, non-fatal)
+ if os.environ.get("RUN_EVAL", "false").lower() == "true" and exit_code == 0:
+ reporter.report(JobStatus.BENCHMARK, JobStage.BENCHMARK, "Running post-benchmark evaluation")
+ logger.info("RUN_EVAL=true: Running post-benchmark lm-eval evaluation...")
+ eval_exit = self._run_post_eval(stop_event)
+ if eval_exit != 0:
+ logger.warning("Eval failed with exit code %d (benchmark result is still valid)", eval_exit)
+ else:
+ logger.info("Post-benchmark eval completed successfully")
except Exception as e:
logger.exception("Error during sweep: %s", e)
diff --git a/src/srtctl/core/config.py b/src/srtctl/core/config.py
index 8cea4e17..f30fc7fc 100644
--- a/src/srtctl/core/config.py
+++ b/src/srtctl/core/config.py
@@ -141,6 +141,20 @@ def resolve_config_with_defaults(user_config: dict[str, Any], cluster_config: di
config["reporting"] = cluster_config["reporting"]
logger.debug("Applied cluster reporting config")
+ # Resolve extra_mount host path aliases through model_paths
+ extra_mounts = config.get("extra_mount", [])
+ if model_paths and extra_mounts:
+ resolved_mounts = []
+ for mount_spec in extra_mounts:
+ host_path, container_path = mount_spec.split(":", 1)
+ if host_path in model_paths:
+ resolved_host = model_paths[host_path]
+ resolved_mounts.append(f"{resolved_host}:{container_path}")
+ logger.debug(f"Resolved extra_mount alias '{host_path}' -> '{resolved_host}'")
+ else:
+ resolved_mounts.append(mount_spec)
+ config["extra_mount"] = resolved_mounts
+
# Resolve frontend nginx_container alias
frontend = config.get("frontend", {})
nginx_container = frontend.get("nginx_container", "")
diff --git a/src/srtctl/core/runtime.py b/src/srtctl/core/runtime.py
index 3e68bdd5..31195ed3 100644
--- a/src/srtctl/core/runtime.py
+++ b/src/srtctl/core/runtime.py
@@ -231,6 +231,14 @@ def from_config(
host_path, container_path = mount_spec.split(":", 1)
container_mounts[Path(host_path).resolve()] = Path(container_path)
+ # Mount InferenceX workspace if available (for lm-eval support).
+ # Skip exists() check: the orchestrator runs on the SLURM head node
+ # where the GH Actions workspace path may not be directly accessible,
+ # but it IS accessible from compute nodes via shared filesystem.
+ infmax_ws = os.environ.get("INFMAX_WORKSPACE")
+ if infmax_ws:
+ container_mounts[Path(infmax_ws)] = Path("/infmax-workspace")
+
# Add FormattablePath mounts from config.container_mounts
# These need to be expanded with the runtime context, so we create a
# temporary context first and then update
diff --git a/src/srtctl/core/schema.py b/src/srtctl/core/schema.py
index 97547fec..c535be39 100644
--- a/src/srtctl/core/schema.py
+++ b/src/srtctl/core/schema.py
@@ -539,6 +539,12 @@ class BenchmarkConfig:
ttft_threshold_ms: int | None = None # Goodput TTFT threshold in ms (default: 2000)
itl_threshold_ms: int | None = None # Goodput ITL threshold in ms (default: 25)
random_range_ratio: float | None = None # Random input/output length range ratio (default: 0.8)
+ num_prompts_mult: int | None = None # Multiplier for num_prompts = concurrency * mult (default: 10)
+ num_warmup_mult: int | None = None # Multiplier for warmup prompts = concurrency * mult (default: 2)
+ # Trace replay benchmark fields (uses aiperf with mooncake_trace dataset type)
+ trace_file: str | None = None # Path to trace JSONL file (container path, e.g., /traces/dataset.jsonl)
+ custom_tokenizer: str | None = None # Custom tokenizer class (e.g., "module.path.ClassName")
+ use_chat_template: bool = True # Pass --use-chat-template to benchmark (default: true)
def get_concurrency_list(self) -> list[int]:
if self.concurrencies is None:
@@ -711,7 +717,7 @@ def get_install_commands(self) -> str:
if self.version is not None:
return (
f"echo 'Installing dynamo {self.version}...' && "
- f"pip install --break-system-packages --quiet ai-dynamo-runtime=={self.version} ai-dynamo=={self.version} && "
+ f"pip install --break-system-packages --quiet --extra-index-url https://pypi.nvidia.com ai-dynamo-runtime=={self.version} ai-dynamo=={self.version} && "
f"echo 'Dynamo {self.version} installed'"
)
@@ -719,8 +725,8 @@ def get_install_commands(self) -> str:
git_ref = self.hash if self.hash else "HEAD"
checkout_cmd = f"git checkout {self.hash}" if self.hash else ""
- return (
- f"echo 'Installing dynamo from source ({git_ref})...' && "
+ # Original SGLang container path, UNCHANGED
+ sglang = (
"apt-get update -qq && apt-get install -y -qq libclang-dev > /dev/null 2>&1 && "
"cd /sgl-workspace/ && "
"git clone https://github.com/ai-dynamo/dynamo.git && "
@@ -736,6 +742,34 @@ def get_install_commands(self) -> str:
f"echo 'Dynamo installed from source ({git_ref})'"
)
+ # Portable path for non-SGLang containers (vLLM, etc.)
+ portable = (
+ "if ! command -v cargo &> /dev/null || ! command -v maturin &> /dev/null; then "
+ "apt-get update -qq && apt-get install -y -qq git curl libclang-dev protobuf-compiler > /dev/null 2>&1 && "
+ "if ! command -v cargo &> /dev/null; then "
+ "curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y && source $HOME/.cargo/env; fi && "
+ "if ! command -v maturin &> /dev/null; then "
+ "pip install --break-system-packages maturin; fi; fi && "
+ "ORIG_DIR=$(pwd) && rm -rf /tmp/dynamo_build && mkdir -p /tmp/dynamo_build && cd /tmp/dynamo_build && "
+ "git clone https://github.com/ai-dynamo/dynamo.git && "
+ "cd dynamo && "
+ f"{checkout_cmd + ' && ' if checkout_cmd else ''}"
+ "cd lib/bindings/python/ && "
+ 'export RUSTFLAGS="${RUSTFLAGS:-} -C target-cpu=native --cfg tokio_unstable" && '
+ "rm -f /tmp/ai_dynamo_runtime*.whl && "
+ "maturin build -o /tmp && "
+ "pip install --break-system-packages /tmp/ai_dynamo_runtime*.whl --force-reinstall && "
+ "cd /tmp/dynamo_build/dynamo/ && "
+ "pip install --break-system-packages -e . && "
+ "cd $ORIG_DIR && "
+ f"echo 'Dynamo installed from source ({git_ref})'"
+ )
+
+ return (
+ f"echo 'Installing dynamo from source ({git_ref})...' && "
+ f"if [ -d /sgl-workspace ]; then {sglang}; else {portable}; fi"
+ )
+
Schema: ClassVar[type[Schema]] = Schema
diff --git a/tests/test_benchmarks.py b/tests/test_benchmarks.py
index 261020c7..c15759b2 100644
--- a/tests/test_benchmarks.py
+++ b/tests/test_benchmarks.py
@@ -193,6 +193,62 @@ def test_build_command_includes_tokenizer_path(self):
assert cmd[7] == "/model" # tokenizer path
+class TestLMEvalRunner:
+ """Test LM-Eval runner."""
+
+ def test_registry_includes_lm_eval(self):
+ """lm-eval is in the benchmark registry."""
+ assert "lm-eval" in list_benchmarks()
+
+ def test_get_runner(self):
+ """Can get lm-eval runner."""
+ runner = get_runner("lm-eval")
+ assert runner.name == "lm-eval"
+
+ def test_script_path(self):
+ """Script path points to lm-eval bench.sh."""
+ runner = get_runner("lm-eval")
+ assert "lm-eval/bench.sh" in runner.script_path
+
+ def test_local_script_dir(self):
+ """Local script dir points to lm-eval scripts."""
+ runner = get_runner("lm-eval")
+ assert runner.local_script_dir.endswith("lm-eval")
+
+ def test_validate_config_always_valid(self):
+ """lm-eval accepts any config."""
+ from srtctl.benchmarks.lm_eval import LMEvalRunner
+ from srtctl.core.schema import BenchmarkConfig, ModelConfig, ResourceConfig, SrtConfig
+
+ runner = LMEvalRunner()
+ config = SrtConfig(
+ name="test",
+ model=ModelConfig(path="/model", container="/image", precision="fp4"),
+ resources=ResourceConfig(gpu_type="h100"),
+ benchmark=BenchmarkConfig(type="sa-bench"),
+ )
+ assert runner.validate_config(config) == []
+
+ def test_build_command(self):
+ """build_command returns correct bash command."""
+ from unittest.mock import MagicMock
+
+ from srtctl.benchmarks.lm_eval import LMEvalRunner
+
+ runner = LMEvalRunner()
+ runtime = MagicMock()
+ runtime.frontend_port = 8000
+
+ config = MagicMock()
+ cmd = runner.build_command(config, runtime)
+ assert cmd == [
+ "bash",
+ "/srtctl-benchmarks/lm-eval/bench.sh",
+ "http://localhost:8000",
+ "/infmax-workspace",
+ ]
+
+
class TestScriptsExist:
"""Test that benchmark scripts exist."""
@@ -209,3 +265,365 @@ def test_mmlu_script_exists(self):
"""MMLU script exists."""
script = SCRIPTS_DIR / "mmlu" / "bench.sh"
assert script.exists()
+
+
+class TestRunPostEval:
+ """Test SweepOrchestrator._run_post_eval method."""
+
+ @staticmethod
+ def _make_orchestrator():
+ """Create a SweepOrchestrator with mocked config/runtime."""
+ from pathlib import Path
+
+ from srtctl.cli.do_sweep import SweepOrchestrator
+ from srtctl.core.runtime import Nodes, RuntimeContext
+ from srtctl.core.schema import (
+ BenchmarkConfig,
+ FrontendConfig,
+ HealthCheckConfig,
+ ModelConfig,
+ ResourceConfig,
+ SrtConfig,
+ )
+
+ config = SrtConfig(
+ name="test",
+ model=ModelConfig(path="/model/test-model", container="/image", precision="fp4"),
+ resources=ResourceConfig(
+ gpu_type="h100",
+ gpus_per_node=8,
+ prefill_nodes=1,
+ decode_nodes=2,
+ prefill_workers=1,
+ decode_workers=2,
+ ),
+ benchmark=BenchmarkConfig(type="sa-bench", isl=1024, osl=1024, concurrencies="128x256x512"),
+ health_check=HealthCheckConfig(max_attempts=3, interval_seconds=1),
+ frontend=FrontendConfig(type="dynamo"),
+ )
+ runtime = RuntimeContext(
+ job_id="12345",
+ run_name="test-run",
+ nodes=Nodes(head="node0", bench="node0", infra="node0", worker=("node0", "node1", "node2")),
+ head_node_ip="10.0.0.1",
+ infra_node_ip="10.0.0.1",
+ log_dir=Path("/tmp/logs"),
+ model_path=Path("/model/test-model"),
+ container_image=Path("/path/to/container.sqsh"),
+ gpus_per_node=8,
+ network_interface=None,
+ container_mounts={},
+ environment={},
+ )
+ return SweepOrchestrator(config=config, runtime=runtime)
+
+ def test_post_benchmark_port_check_fails(self):
+ """Returns 1 when port check fails in post-benchmark mode."""
+ import os
+ import threading
+ from unittest.mock import patch
+
+ orch = self._make_orchestrator()
+ stop = threading.Event()
+ with patch.dict(os.environ, {"EVAL_ONLY": "false"}, clear=False):
+ with patch("srtctl.cli.do_sweep.wait_for_port", return_value=False):
+ result = orch._run_post_eval(stop)
+ assert result == 1
+
+ def test_eval_only_health_check_fails(self):
+ """Returns 1 when health check fails in eval-only mode."""
+ import os
+ import threading
+ from unittest.mock import patch
+
+ orch = self._make_orchestrator()
+ stop = threading.Event()
+ with patch.dict(os.environ, {"EVAL_ONLY": "true"}, clear=False):
+ with patch("srtctl.core.health.wait_for_model", return_value=False):
+ result = orch._run_post_eval(stop)
+ assert result == 1
+
+ def test_runner_not_available(self):
+ """Returns 1 when lm-eval runner is not registered."""
+ import os
+ import threading
+ from unittest.mock import patch
+
+ orch = self._make_orchestrator()
+ stop = threading.Event()
+ with patch.dict(os.environ, {"EVAL_ONLY": "false"}, clear=False):
+ with patch("srtctl.cli.do_sweep.wait_for_port", return_value=True):
+ with patch("srtctl.benchmarks.get_runner", side_effect=ValueError("not found")):
+ result = orch._run_post_eval(stop)
+ assert result == 1
+
+ def test_successful_eval(self):
+ """Returns 0 when eval completes successfully."""
+ import os
+ import threading
+ from unittest.mock import MagicMock, patch
+
+ orch = self._make_orchestrator()
+ stop = threading.Event()
+
+ mock_proc = MagicMock()
+ mock_proc.poll.side_effect = [None, 0]
+ mock_proc.returncode = 0
+
+ with patch.dict(os.environ, {"EVAL_ONLY": "false"}, clear=False):
+ with patch("srtctl.cli.do_sweep.wait_for_port", return_value=True):
+ with patch("srtctl.cli.do_sweep.start_srun_process", return_value=mock_proc):
+ result = orch._run_post_eval(stop)
+ assert result == 0
+
+ def test_eval_only_successful(self):
+ """Returns 0 in eval-only mode when health check and eval succeed."""
+ import os
+ import threading
+ from unittest.mock import MagicMock, patch
+
+ orch = self._make_orchestrator()
+ stop = threading.Event()
+
+ mock_proc = MagicMock()
+ mock_proc.poll.side_effect = [None, 0]
+ mock_proc.returncode = 0
+
+ with patch.dict(os.environ, {"EVAL_ONLY": "true"}, clear=False):
+ with patch("srtctl.core.health.wait_for_model", return_value=True):
+ with patch("srtctl.cli.do_sweep.start_srun_process", return_value=mock_proc):
+ result = orch._run_post_eval(stop)
+ assert result == 0
+
+ def test_env_var_passthrough(self):
+ """Eval env vars are passed through to srun."""
+ import os
+ import threading
+ from unittest.mock import MagicMock, patch
+
+ orch = self._make_orchestrator()
+ stop = threading.Event()
+
+ mock_proc = MagicMock()
+ mock_proc.poll.return_value = 0
+ mock_proc.returncode = 0
+
+ env_vars = {
+ "EVAL_ONLY": "false",
+ "RUN_EVAL": "true",
+ "FRAMEWORK": "sglang",
+ "PRECISION": "fp4",
+ "MODEL": "test-model",
+ }
+
+ captured_kwargs = {}
+
+ def capture_srun(**kwargs):
+ captured_kwargs.update(kwargs)
+ return mock_proc
+
+ with patch.dict(os.environ, env_vars, clear=False):
+ with patch("srtctl.cli.do_sweep.wait_for_port", return_value=True):
+ with patch("srtctl.cli.do_sweep.start_srun_process", side_effect=capture_srun):
+ orch._run_post_eval(stop)
+
+ env_to_set = captured_kwargs["env_to_set"]
+ assert env_to_set["RUN_EVAL"] == "true"
+ assert env_to_set["FRAMEWORK"] == "sglang"
+ assert env_to_set["PRECISION"] == "fp4"
+ assert env_to_set["MODEL"] == "test-model"
+ assert env_to_set["MODEL_NAME"] == "test-model"
+
+ def test_eval_conc_from_env(self):
+ """EVAL_CONC from env takes priority over benchmark concurrencies."""
+ import os
+ import threading
+ from unittest.mock import MagicMock, patch
+
+ orch = self._make_orchestrator()
+ stop = threading.Event()
+
+ mock_proc = MagicMock()
+ mock_proc.poll.return_value = 0
+ mock_proc.returncode = 0
+
+ captured_kwargs = {}
+
+ def capture_srun(**kwargs):
+ captured_kwargs.update(kwargs)
+ return mock_proc
+
+ with patch.dict(os.environ, {"EVAL_ONLY": "false", "EVAL_CONC": "64"}, clear=False):
+ with patch("srtctl.cli.do_sweep.wait_for_port", return_value=True):
+ with patch("srtctl.cli.do_sweep.start_srun_process", side_effect=capture_srun):
+ orch._run_post_eval(stop)
+
+ assert captured_kwargs["env_to_set"]["EVAL_CONC"] == "64"
+
+ def test_eval_conc_fallback_to_max_concurrency(self):
+ """EVAL_CONC falls back to max of benchmark concurrencies."""
+ import os
+ import threading
+ from unittest.mock import MagicMock, patch
+
+ orch = self._make_orchestrator()
+ stop = threading.Event()
+
+ mock_proc = MagicMock()
+ mock_proc.poll.return_value = 0
+ mock_proc.returncode = 0
+
+ captured_kwargs = {}
+
+ def capture_srun(**kwargs):
+ captured_kwargs.update(kwargs)
+ return mock_proc
+
+ env = {"EVAL_ONLY": "false"}
+ # Remove EVAL_CONC if present
+ with patch.dict(os.environ, env, clear=False):
+ os.environ.pop("EVAL_CONC", None)
+ with patch("srtctl.cli.do_sweep.wait_for_port", return_value=True):
+ with patch("srtctl.cli.do_sweep.start_srun_process", side_effect=capture_srun):
+ orch._run_post_eval(stop)
+
+ # concurrencies="128x256x512", max is 512
+ assert captured_kwargs["env_to_set"]["EVAL_CONC"] == "512"
+
+ def test_stop_event_terminates_eval(self):
+ """Stop event terminates the eval process."""
+ import os
+ import threading
+ from unittest.mock import MagicMock, patch
+
+ orch = self._make_orchestrator()
+ stop = threading.Event()
+ stop.set()
+
+ mock_proc = MagicMock()
+ mock_proc.poll.return_value = None
+
+ with patch.dict(os.environ, {"EVAL_ONLY": "false"}, clear=False):
+ with patch("srtctl.cli.do_sweep.wait_for_port", return_value=True):
+ with patch("srtctl.cli.do_sweep.start_srun_process", return_value=mock_proc):
+ result = orch._run_post_eval(stop)
+
+ assert result == 1
+ mock_proc.terminate.assert_called_once()
+
+
+class TestSweepRunEvalIntegration:
+ """Test eval-related branches in SweepOrchestrator.run()."""
+
+ @staticmethod
+ def _make_orchestrator():
+ return TestRunPostEval._make_orchestrator()
+
+ def test_run_eval_only_mode(self):
+ """EVAL_ONLY=true skips benchmark and runs _run_post_eval."""
+ import os
+ from unittest.mock import MagicMock, patch
+
+ orch = self._make_orchestrator()
+
+ with patch.dict(os.environ, {"EVAL_ONLY": "true"}, clear=False):
+ with patch.object(orch, "start_head_infrastructure") as mock_head:
+ mock_head.return_value = MagicMock()
+ with patch.object(orch, "start_all_workers", return_value={}):
+ with patch.object(orch, "start_frontend", return_value=[]):
+ with patch.object(orch, "_run_post_eval", return_value=0) as mock_eval:
+ with patch.object(orch, "run_benchmark") as mock_bench:
+ with patch.object(orch, "run_postprocess"):
+ with patch("srtctl.cli.do_sweep.StatusReporter") as mock_reporter_cls:
+ mock_reporter_cls.from_config.return_value = MagicMock()
+ exit_code = orch.run()
+
+ mock_eval.assert_called_once()
+ mock_bench.assert_not_called()
+ assert exit_code == 0
+
+ def test_run_with_post_benchmark_eval(self):
+ """RUN_EVAL=true runs benchmark then _run_post_eval."""
+ import os
+ from unittest.mock import MagicMock, patch
+
+ orch = self._make_orchestrator()
+
+ with patch.dict(os.environ, {"EVAL_ONLY": "false", "RUN_EVAL": "true"}, clear=False):
+ with patch.object(orch, "start_head_infrastructure") as mock_head:
+ mock_head.return_value = MagicMock()
+ with patch.object(orch, "start_all_workers", return_value={}):
+ with patch.object(orch, "start_frontend", return_value=[]):
+ with patch.object(orch, "run_benchmark", return_value=0) as mock_bench:
+ with patch.object(orch, "_run_post_eval", return_value=0) as mock_eval:
+ with patch.object(orch, "run_postprocess"):
+ with patch("srtctl.cli.do_sweep.StatusReporter") as mock_reporter_cls:
+ mock_reporter_cls.from_config.return_value = MagicMock()
+ exit_code = orch.run()
+
+ mock_bench.assert_called_once()
+ mock_eval.assert_called_once()
+ assert exit_code == 0
+
+ def test_run_eval_only_failure(self):
+ """EVAL_ONLY=true with eval failure returns non-zero exit code."""
+ import os
+ from unittest.mock import MagicMock, patch
+
+ orch = self._make_orchestrator()
+
+ with patch.dict(os.environ, {"EVAL_ONLY": "true"}, clear=False):
+ with patch.object(orch, "start_head_infrastructure") as mock_head:
+ mock_head.return_value = MagicMock()
+ with patch.object(orch, "start_all_workers", return_value={}):
+ with patch.object(orch, "start_frontend", return_value=[]):
+ with patch.object(orch, "_run_post_eval", return_value=1):
+ with patch.object(orch, "run_postprocess"):
+ with patch("srtctl.cli.do_sweep.StatusReporter") as mock_reporter_cls:
+ mock_reporter_cls.from_config.return_value = MagicMock()
+ exit_code = orch.run()
+
+ assert exit_code == 1
+
+ def test_run_post_benchmark_eval_failure_nonfatal(self):
+ """RUN_EVAL=true with eval failure still returns benchmark exit code 0."""
+ import os
+ from unittest.mock import MagicMock, patch
+
+ orch = self._make_orchestrator()
+
+ with patch.dict(os.environ, {"EVAL_ONLY": "false", "RUN_EVAL": "true"}, clear=False):
+ with patch.object(orch, "start_head_infrastructure") as mock_head:
+ mock_head.return_value = MagicMock()
+ with patch.object(orch, "start_all_workers", return_value={}):
+ with patch.object(orch, "start_frontend", return_value=[]):
+ with patch.object(orch, "run_benchmark", return_value=0):
+ with patch.object(orch, "_run_post_eval", return_value=1):
+ with patch.object(orch, "run_postprocess"):
+ with patch("srtctl.cli.do_sweep.StatusReporter") as mock_reporter_cls:
+ mock_reporter_cls.from_config.return_value = MagicMock()
+ exit_code = orch.run()
+
+ assert exit_code == 0
+
+ def test_run_eval_skipped_when_benchmark_fails(self):
+ """RUN_EVAL=true but benchmark fails: eval is skipped."""
+ import os
+ from unittest.mock import MagicMock, patch
+
+ orch = self._make_orchestrator()
+
+ with patch.dict(os.environ, {"EVAL_ONLY": "false", "RUN_EVAL": "true"}, clear=False):
+ with patch.object(orch, "start_head_infrastructure") as mock_head:
+ mock_head.return_value = MagicMock()
+ with patch.object(orch, "start_all_workers", return_value={}):
+ with patch.object(orch, "start_frontend", return_value=[]):
+ with patch.object(orch, "run_benchmark", return_value=1):
+ with patch.object(orch, "_run_post_eval") as mock_eval:
+ with patch.object(orch, "run_postprocess"):
+ with patch("srtctl.cli.do_sweep.StatusReporter") as mock_reporter_cls:
+ mock_reporter_cls.from_config.return_value = MagicMock()
+ exit_code = orch.run()
+
+ mock_eval.assert_not_called()
+ assert exit_code == 1
diff --git a/tests/test_configs.py b/tests/test_configs.py
index 1c23fb30..0b4138d5 100644
--- a/tests/test_configs.py
+++ b/tests/test_configs.py
@@ -127,7 +127,11 @@ def test_hash_install_command(self):
assert "git clone" in cmd
assert "git checkout abc123" in cmd
assert "maturin build" in cmd
- assert "pip install -e" in cmd
+ assert "if [ -d /sgl-workspace ]" in cmd
+ assert "/tmp/dynamo_build" in cmd
+ assert "protobuf-compiler" in cmd
+ assert "if ! command -v cargo" in cmd
+ assert "if ! command -v maturin" in cmd
def test_top_of_tree_install_command(self):
"""Top-of-tree config generates source install without checkout."""
@@ -140,6 +144,10 @@ def test_top_of_tree_install_command(self):
assert "git clone" in cmd
assert "git checkout" not in cmd
assert "maturin build" in cmd
+ assert "if [ -d /sgl-workspace ]" in cmd
+ assert "/tmp/dynamo_build" in cmd
+ assert "--break-system-packages" in cmd
+ assert "--force-reinstall" in cmd
def test_hash_and_top_of_tree_not_allowed(self):
"""Cannot specify both hash and top_of_tree."""
@@ -1072,6 +1080,8 @@ def test_standard_tp_mode_still_works(self):
def test_vllm_get_process_environment(self):
"""Test vLLM sets port environment variables from process."""
+ from unittest.mock import patch
+
from srtctl.backends import VLLMProtocol
from srtctl.core.topology import Process
@@ -1090,10 +1100,12 @@ def test_vllm_get_process_environment(self):
nixl_port=6550,
)
- env = backend.get_process_environment(process)
+ with patch("srtctl.core.slurm.get_hostname_ip", return_value="10.0.0.1"):
+ env = backend.get_process_environment(process)
assert env["DYN_VLLM_KV_EVENT_PORT"] == "5550"
assert env["VLLM_NIXL_SIDE_CHANNEL_PORT"] == "6550"
+ assert env["VLLM_NIXL_SIDE_CHANNEL_HOST"] == "10.0.0.1"
def test_vllm_get_process_environment_none_ports(self):
"""Test vLLM handles None ports gracefully."""
@@ -1370,3 +1382,113 @@ def test_agg_mode_no_disaggregation_flag(self):
assert "--disaggregation-mode" not in cmd
assert "--is-prefill-worker" not in cmd
assert "--is-decode-worker" not in cmd
+
+
+class TestInfmaxWorkspaceMount:
+ """Test that INFMAX_WORKSPACE env var creates a container mount."""
+
+ def test_infmax_workspace_mount_added(self, tmp_path):
+ """RuntimeContext includes /infmax-workspace mount when env var is set."""
+ import os
+ import subprocess
+ from pathlib import Path
+ from unittest.mock import MagicMock, patch
+
+ from srtctl.core.runtime import RuntimeContext
+ from srtctl.core.schema import ModelConfig, ResourceConfig, SrtConfig
+
+ model_path = tmp_path / "model"
+ model_path.mkdir()
+ container_path = tmp_path / "container.sqsh"
+ container_path.touch()
+
+ slurm_env = {
+ "SLURM_JOB_ID": "12345",
+ "SLURM_JOBID": "12345",
+ "SLURM_NODELIST": "gpu-[01-02]",
+ "SLURM_JOB_NUM_NODES": "2",
+ "SRTCTL_SOURCE_DIR": str(Path(__file__).parent.parent),
+ "INFMAX_WORKSPACE": "/actions/runner/workspace",
+ }
+
+ def mock_scontrol(cmd, **kwargs):
+ if cmd[0] == "scontrol" and "hostnames" in cmd:
+ result = MagicMock()
+ result.stdout = "gpu-01\ngpu-02"
+ result.returncode = 0
+ return result
+ raise subprocess.CalledProcessError(1, cmd)
+
+ with patch.dict(os.environ, slurm_env):
+ with patch("subprocess.run", mock_scontrol):
+ with patch("srtctl.core.slurm.get_hostname_ip", return_value="10.0.0.1"):
+ config = SrtConfig(
+ name="test",
+ model=ModelConfig(
+ path=str(model_path),
+ container=str(container_path),
+ precision="fp8",
+ ),
+ resources=ResourceConfig(
+ gpu_type="h100",
+ gpus_per_node=8,
+ prefill_nodes=1,
+ decode_nodes=1,
+ ),
+ )
+ runtime = RuntimeContext.from_config(config, job_id="12345")
+
+ assert Path("/infmax-workspace") in runtime.container_mounts.values()
+
+ def test_infmax_workspace_mount_not_added_without_env(self, tmp_path):
+ """RuntimeContext does not include /infmax-workspace without env var."""
+ import os
+ import subprocess
+ from pathlib import Path
+ from unittest.mock import MagicMock, patch
+
+ from srtctl.core.runtime import RuntimeContext
+ from srtctl.core.schema import ModelConfig, ResourceConfig, SrtConfig
+
+ model_path = tmp_path / "model"
+ model_path.mkdir()
+ container_path = tmp_path / "container.sqsh"
+ container_path.touch()
+
+ slurm_env = {
+ "SLURM_JOB_ID": "12345",
+ "SLURM_JOBID": "12345",
+ "SLURM_NODELIST": "gpu-[01-02]",
+ "SLURM_JOB_NUM_NODES": "2",
+ "SRTCTL_SOURCE_DIR": str(Path(__file__).parent.parent),
+ }
+
+ def mock_scontrol(cmd, **kwargs):
+ if cmd[0] == "scontrol" and "hostnames" in cmd:
+ result = MagicMock()
+ result.stdout = "gpu-01\ngpu-02"
+ result.returncode = 0
+ return result
+ raise subprocess.CalledProcessError(1, cmd)
+
+ with patch.dict(os.environ, slurm_env):
+ os.environ.pop("INFMAX_WORKSPACE", None)
+ with patch("subprocess.run", mock_scontrol):
+ with patch("srtctl.core.slurm.get_hostname_ip", return_value="10.0.0.1"):
+ config = SrtConfig(
+ name="test",
+ model=ModelConfig(
+ path=str(model_path),
+ container=str(container_path),
+ precision="fp8",
+ ),
+ resources=ResourceConfig(
+ gpu_type="h100",
+ gpus_per_node=8,
+ prefill_nodes=1,
+ decode_nodes=1,
+ ),
+ )
+ runtime = RuntimeContext.from_config(config, job_id="12345")
+
+ assert Path("/infmax-workspace") not in runtime.container_mounts.values()