diff --git a/.github/workflows/ci.yaml b/.github/workflows/ci.yaml
index eba897bb..dccdba05 100644
--- a/.github/workflows/ci.yaml
+++ b/.github/workflows/ci.yaml
@@ -4,7 +4,7 @@ on:
   push:
     branches: [main, master]
   pull_request:
-    branches: [main, master]
+    branches: [main, master, sa-submission-q2-2026]
 
 jobs:
   lint:
@@ -119,3 +119,4 @@ jobs:
               exit(1)
           print(f'\nAll {len(recipes)} recipes valid')
           "
+
diff --git a/docs/accuracy.md b/docs/accuracy.md
index f5588c9f..98b69b46 100644
--- a/docs/accuracy.md
+++ b/docs/accuracy.md
@@ -1,6 +1,6 @@
 # Accuracy Benchmarks
 
-In srt-slurm, users can run different accuracy benchmarks by setting the benchmark section in the config yaml file. Supported benchmarks include `mmlu`, `gpqa` and `longbenchv2`.
+In srt-slurm, users can run different accuracy benchmarks by setting the benchmark section in the config yaml file. Supported benchmarks include `mmlu`, `gpqa`, `longbenchv2`, and `lm-eval`.
 
 ## Table of Contents
 
@@ -14,6 +14,7 @@ In srt-slurm, users can run different accuracy benchmarks by setting the benchma
   - [Example: Quick Validation](#example-quick-validation)
   - [Output](#output)
   - [Important Notes](#important-notes)
+- [lm-eval (InferenceX)](#lm-eval-inferencex)
 
 ---
 
@@ -191,3 +192,84 @@ The output includes per-category scores and aggregate metrics:
 4. **Categories**: Running specific categories is useful for targeted validation (e.g., just testing summarization capabilities)
 
 
+## lm-eval (InferenceX)
+
+The `lm-eval` benchmark runner integrates [EleutherAI/lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) via InferenceX's `benchmark_lib.sh`. Unlike the built-in benchmarks above, this runner sources evaluation logic from an external InferenceX workspace mounted at `/infmax-workspace`.
+
+This is used by InferenceX CI to run evals such as GSM8K and GPQA against NVIDIA multi-node disaggregated deployments on GB200, GB300, B200, B300, H100, and H200. AMD MI355X multi-node evals are handled by InferenceX's upstreamed AMD Slurm path, not by this srt-slurm runner.
+
+In InferenceX CI, recipes normally keep their throughput benchmark configuration. `do_sweep.py` invokes the registered `lm-eval` runner as a post-step when `RUN_EVAL=true`, or as the only benchmark-like step when `EVAL_ONLY=true`. There is no separate `infmax-eval` benchmark type.
+
+### How it works
+
+1. `RuntimeContext` mounts the host path from `INFMAX_WORKSPACE` at `/infmax-workspace` inside the Slurm container.
+2. `do_sweep.py` starts infrastructure, workers, and the frontend for the normal recipe topology.
+3. For `EVAL_ONLY=true`, `do_sweep.py` skips the throughput benchmark stage and runs `_run_post_eval()` directly after frontend startup.
+4. `_run_post_eval()` waits for the OpenAI-compatible endpoint on port 8000 and, in eval-only mode, performs the full `wait_for_model()` health check for the configured prefill/decode or aggregated topology.
+5. `_run_post_eval()` launches the registered `lm-eval` runner on the head node and passes through InferenceX metadata such as framework, precision, sequence length, prefill/decode topology, and eval concurrency.
+6. The runner script (`benchmarks/scripts/lm-eval/bench.sh`) uses `MODEL_NAME` from `do_sweep.py`, or auto-discovers the served model from `/v1/models` as a fallback.
+7. The runner sources `/infmax-workspace/benchmarks/benchmark_lib.sh`, runs `run_eval --framework lm-eval`, and calls `append_lm_eval_summary`.
+8. Eval artifacts are copied to `/logs/eval_results/` for InferenceX launcher-side artifact pickup.
+
+### EVAL_ONLY mode
+
+srt-slurm supports an `EVAL_ONLY` mode for CI jobs that should only validate accuracy. This is controlled by environment variables from the InferenceX workflow:
+
+| Env var | Description |
+|---------|-------------|
+| `EVAL_ONLY` | Set to `true` to skip the throughput benchmark stage and run eval only |
+| `RUN_EVAL` | Set to `true` to run eval after the throughput benchmark completes |
+| `EVAL_CONC` | Concurrent requests for lm-eval, normally set by InferenceX from the generated `eval-conc` value |
+| `INFMAX_WORKSPACE` | Host path to the InferenceX checkout that should be mounted at `/infmax-workspace` |
+| `MODEL_NAME` | Served model alias for OpenAI-compatible requests; set by `do_sweep.py` from `config.served_model_name` |
+
+When `EVAL_ONLY=true`:
+- Stage 4 skips the throughput benchmark entirely. No throughput result JSON is expected from srt-slurm.
+- The eval path uses the full `wait_for_model()` health check before starting lm-eval.
+- `_run_post_eval()` launches the `lm-eval` runner and returns its exit code.
+- Eval failure is fatal because eval is the only purpose of the job.
+
+When `RUN_EVAL=true` (without `EVAL_ONLY`):
+- Throughput benchmark runs normally
+- After benchmark completes successfully, eval runs as a post-step
+- Eval failure is non-fatal; the benchmark job still succeeds if throughput passed
+
+### Environment variables
+
+The following env vars are passed through to the lm-eval runner container:
+
+| Env var | Purpose |
+|---------|---------|
+| `RUN_EVAL`, `EVAL_ONLY`, `IS_MULTINODE` | Control whether eval runs and how InferenceX classifies the artifact |
+| `FRAMEWORK`, `PRECISION`, `MODEL_PREFIX`, `RUNNER_TYPE`, `SPEC_DECODING` | Benchmark identity metadata for `meta_env.json` |
+| `ISL`, `OSL`, `RESULT_FILENAME` | Sequence length and result-file metadata |
+| `MODEL`, `MODEL_PATH`, `MODEL_NAME` | Model metadata and the served model alias used for requests |
+| `MAX_MODEL_LEN`, `EVAL_MAX_MODEL_LEN` | Context-length metadata used by InferenceX eval helpers when available |
+| `PREFILL_TP`, `PREFILL_EP`, `PREFILL_NUM_WORKERS`, `PREFILL_DP_ATTN` | Prefill-side topology metadata |
+| `DECODE_TP`, `DECODE_EP`, `DECODE_NUM_WORKERS`, `DECODE_DP_ATTN` | Decode-side topology metadata |
+| `EVAL_CONC`, `EVAL_CONCURRENT_REQUESTS` | Eval concurrency controls |
+
+The runner maps srt-slurm's `PREFILL_DP_ATTN` and `DECODE_DP_ATTN` names to InferenceX's `PREFILL_DP_ATTENTION` and `DECODE_DP_ATTENTION` names before calling `append_lm_eval_summary`. This is required for multi-node summary tables to preserve prefill/decode DPA state.
+
+### Concurrency
+
+Eval concurrency is ultimately read by InferenceX's `benchmark_lib.sh` from `EVAL_CONCURRENT_REQUESTS`. The runner script sets that value from `EVAL_CONC` when present, preserves an existing `EVAL_CONCURRENT_REQUESTS` otherwise, and falls back to `256` only if neither variable is set:
+
+```bash
+export EVAL_CONCURRENT_REQUESTS="${EVAL_CONC:-${EVAL_CONCURRENT_REQUESTS:-256}}"
+```
+
+The InferenceX workflow sets `EVAL_CONC` from the generated `eval-conc` value. For multi-node configs, InferenceX selects the `8k1k` entry with the highest max eligible concurrency for each `(model, runner, framework, precision, spec-decoding, prefill-dp-attn, decode-dp-attn)` group, then sets `eval-conc` to the upper median of that config's eligible concurrency list. If `EVAL_CONC` is not set in the environment, `do_sweep.py` falls back to the max of the recipe benchmark concurrency list.
+
+### Output
+
+Eval artifacts are written to `/logs/eval_results/` inside the container:
+- `meta_env.json` - metadata used by InferenceX aggregation and summary tables
+- `results*.json` - lm-eval scores per task
+- `sample*.jsonl` - per-sample outputs
+
+These are collected by the InferenceX NVIDIA launch scripts and uploaded as workflow artifacts. In eval-only mode the InferenceX workflow expects eval artifacts, not throughput benchmark artifacts.
+
+### Intricacies
+1. Eval floor of 16
+  - There is 1 sweep config of conc: [1], which causes evals to take >4hrs to complete.
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen1dep16_batch32_eplb0_mtp2.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen1dep16_batch32_eplb0_mtp2.yaml
new file mode 100644
index 00000000..21edc148
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen1dep16_batch32_eplb0_mtp2.yaml
@@ -0,0 +1,135 @@
+name: "glm5_nvfp4_ISL1K_OSL1K_ctx1dep4_gen1dep16_batch32_eplb0_mtp2"
+
+# ctx: 1 prefill worker, TP4/EP4
+# gen: 1 decode worker, TP16/EP16, enable_attention_dp=true, max_batch=32
+# concurrency: 666
+
+model:
+  path: "nvidia/GLM5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 4
+
+  decode_workers: 1
+  decode_nodes: 4
+  gpus_per_decode: 16
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1064
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: CUTEDSL
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 2
+
+    decode:
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 32
+      max_num_tokens: 96
+      max_seq_len: 2088
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+          - 16
+          - 24
+          - 32
+      moe_config:
+        backend: CUTEDSL
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.7
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 2
+
+benchmark:
+  type: "sa-bench"
+  isl: 1024
+  osl: 1024
+  concurrencies: "666"
+  req_rate: "inf"
+  custom_tokenizer: "glm_moe_dsa"
+  use_chat_template: false
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen1dep16_batch64_eplb0_mtp1.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen1dep16_batch64_eplb0_mtp1.yaml
new file mode 100644
index 00000000..ebcd45d1
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen1dep16_batch64_eplb0_mtp1.yaml
@@ -0,0 +1,139 @@
+name: "glm5_nvfp4_ISL1K_OSL1K_ctx1dep4_gen1dep16_batch64_eplb0_mtp1"
+
+# ctx: 1 prefill worker, TP4/EP4
+# gen: 1 decode worker, TP16/EP16, enable_attention_dp=true, max_batch=64
+# concurrency: 1229
+
+model:
+  path: "nvidia/GLM5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 4
+
+  decode_workers: 1
+  decode_nodes: 4
+  gpus_per_decode: 16
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1064
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: CUTEDSL
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 1
+
+    decode:
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 64
+      max_num_tokens: 128
+      max_seq_len: 2088
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+          - 16
+          - 24
+          - 32
+          - 40
+          - 48
+          - 56
+          - 64
+      moe_config:
+        backend: CUTEDSL
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.7
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 1
+
+benchmark:
+  type: "sa-bench"
+  isl: 1024
+  osl: 1024
+  concurrencies: "1229"
+  req_rate: "inf"
+  custom_tokenizer: "glm_moe_dsa"
+  use_chat_template: false
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen1dep32_batch16_allconc_eplb0_mtp3.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen1dep32_batch16_allconc_eplb0_mtp3.yaml
new file mode 100644
index 00000000..68af65ee
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen1dep32_batch16_allconc_eplb0_mtp3.yaml
@@ -0,0 +1,133 @@
+name: "glm5_nvfp4_ISL1K_OSL1K_ctx1dep4_gen1dep32_batch16_allconc_eplb0_mtp3"
+
+# ctx: 1 prefill worker, TP4/EP4
+# gen: 1 decode worker, TP32/EP32, enable_attention_dp=true, max_batch=16
+# concurrencies: 333 (batch8), 666 (batch16)
+
+model:
+  path: "nvidia/GLM5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 4
+
+  decode_workers: 1
+  decode_nodes: 8
+  gpus_per_decode: 32
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1064
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: CUTEDSL
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+    decode:
+      tensor_parallel_size: 32
+      moe_expert_parallel_size: 32
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 16
+      max_num_tokens: 64
+      max_seq_len: 2088
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+          - 16
+      moe_config:
+        backend: CUTEDSL
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+benchmark:
+  type: "sa-bench"
+  isl: 1024
+  osl: 1024
+  concurrencies: "333x666"
+  req_rate: "inf"
+  custom_tokenizer: "glm_moe_dsa"
+  use_chat_template: false
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen4tep8_batch16_eplb0_mtp2.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen4tep8_batch16_eplb0_mtp2.yaml
new file mode 100644
index 00000000..d6d3dcf1
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen4tep8_batch16_eplb0_mtp2.yaml
@@ -0,0 +1,134 @@
+name: "glm5_nvfp4_ISL1K_OSL1K_ctx1dep4_gen4tep8_batch16_eplb0_mtp2"
+
+# ctx: 1 prefill worker, TP4/EP4
+# gen: 4 decode workers, TP8/EP8, enable_attention_dp=false, max_batch=16
+# concurrency: 96
+
+model:
+  path: "nvidia/GLM5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 4
+
+  decode_workers: 4
+  decode_nodes: 8
+  gpus_per_decode: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1064
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: CUTEDSL
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 2
+
+    decode:
+      allreduce_strategy: MNNVL
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 16
+      max_num_tokens: 48
+      max_seq_len: 2088
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+          - 16
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 2
+
+benchmark:
+  type: "sa-bench"
+  isl: 1024
+  osl: 1024
+  concurrencies: "96"
+  req_rate: "inf"
+  custom_tokenizer: "glm_moe_dsa"
+  use_chat_template: false
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen4tep8_batch32_allconc_eplb0_mtp3.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen4tep8_batch32_allconc_eplb0_mtp3.yaml
new file mode 100644
index 00000000..da187faf
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen4tep8_batch32_allconc_eplb0_mtp3.yaml
@@ -0,0 +1,136 @@
+name: "glm5_nvfp4_ISL1K_OSL1K_ctx1dep4_gen4tep8_batch32_allconc_eplb0_mtp3"
+
+# ctx: 1 prefill worker, TP4/EP4
+# gen: 4 decode workers, TP8/EP8, enable_attention_dp=false, max_batch=32
+# concurrencies: 8 (batch1), 44 (batch8), 192 (batch32)
+
+model:
+  path: "nvidia/GLM5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 4
+
+  decode_workers: 4
+  decode_nodes: 8
+  gpus_per_decode: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1064
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: CUTEDSL
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+    decode:
+      allreduce_strategy: MNNVL
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 32
+      max_num_tokens: 128
+      max_seq_len: 2088
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+          - 16
+          - 24
+          - 32
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+benchmark:
+  type: "sa-bench"
+  isl: 1024
+  osl: 1024
+  concurrencies: "8x44x192"
+  req_rate: "inf"
+  custom_tokenizer: "glm_moe_dsa"
+  use_chat_template: false
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen5tep4_batch1_eplb0_mtp3.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen5tep4_batch1_eplb0_mtp3.yaml
new file mode 100644
index 00000000..a6121cd0
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen5tep4_batch1_eplb0_mtp3.yaml
@@ -0,0 +1,131 @@
+name: "glm5_nvfp4_ISL1K_OSL1K_ctx1dep4_gen5tep4_batch1_eplb0_mtp3"
+
+# ctx: 1 prefill worker, TP4/EP4
+# gen: 5 decode workers, TP4/EP4, enable_attention_dp=false, max_batch=1
+# concurrency: 10
+
+model:
+  path: "nvidia/GLM5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 4
+
+  decode_workers: 5
+  decode_nodes: 5
+  gpus_per_decode: 4
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1064
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: CUTEDSL
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+    decode:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 1
+      max_num_tokens: 4
+      max_seq_len: 2088
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.85
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+benchmark:
+  type: "sa-bench"
+  isl: 1024
+  osl: 1024
+  concurrencies: "10"
+  req_rate: "inf"
+  custom_tokenizer: "glm_moe_dsa"
+  use_chat_template: false
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/MTP/ctx2dep4_gen1dep16_batch256_eplb256_mtp1.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/MTP/ctx2dep4_gen1dep16_batch256_eplb256_mtp1.yaml
new file mode 100644
index 00000000..dc176b2d
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/MTP/ctx2dep4_gen1dep16_batch256_eplb256_mtp1.yaml
@@ -0,0 +1,167 @@
+name: "glm5_nvfp4_ISL1K_OSL1K_ctx2dep4_gen1dep16_batch256_eplb256_mtp1"
+
+# ctx: 2 prefill workers, TP4/EP4
+# gen: 1 decode worker, TP16/EP16, enable_attention_dp=true, max_batch=256
+# EPLB: num_slots=256
+# concurrency: 4301
+
+model:
+  path: "nvidia/GLM5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+
+  prefill_nodes: 2
+  prefill_workers: 2
+  gpus_per_prefill: 4
+
+  decode_workers: 1
+  decode_nodes: 4
+  gpus_per_decode: 16
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1064
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: CUTEDSL
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 1
+
+    decode:
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 256
+      max_num_tokens: 512
+      max_seq_len: 2088
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+          - 16
+          - 24
+          - 32
+          - 40
+          - 48
+          - 56
+          - 64
+          - 72
+          - 80
+          - 88
+          - 96
+          - 104
+          - 112
+          - 120
+          - 128
+          - 136
+          - 144
+          - 152
+          - 160
+          - 168
+          - 176
+          - 184
+          - 192
+          - 200
+          - 208
+          - 216
+          - 224
+          - 232
+          - 240
+          - 248
+          - 256
+      moe_config:
+        backend: CUTEDSL
+        use_low_precision_moe_combine: true
+        load_balancer:
+          layer_updates_per_iter: 1
+          num_slots: 256
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.7
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 1
+
+benchmark:
+  type: "sa-bench"
+  isl: 1024
+  osl: 1024
+  concurrencies: "4301"
+  req_rate: "inf"
+  custom_tokenizer: "glm_moe_dsa"
+  use_chat_template: false
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/MTP/ctx3dep4_gen1dep32_batch128_eplb288_mtp1.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/MTP/ctx3dep4_gen1dep32_batch128_eplb288_mtp1.yaml
new file mode 100644
index 00000000..a7a1c790
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/MTP/ctx3dep4_gen1dep32_batch128_eplb288_mtp1.yaml
@@ -0,0 +1,151 @@
+name: "glm5_nvfp4_ISL1K_OSL1K_ctx3dep4_gen1dep32_batch128_eplb288_mtp1"
+
+# ctx: 3 prefill workers, TP4/EP4
+# gen: 1 decode worker, TP32/EP32, enable_attention_dp=true, max_batch=128
+# EPLB: num_slots=288
+# concurrency: 4301
+
+model:
+  path: "nvidia/GLM5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+
+  prefill_nodes: 3
+  prefill_workers: 3
+  gpus_per_prefill: 4
+
+  decode_workers: 1
+  decode_nodes: 8
+  gpus_per_decode: 32
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1064
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: CUTEDSL
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 1
+
+    decode:
+      tensor_parallel_size: 32
+      moe_expert_parallel_size: 32
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 128
+      max_num_tokens: 256
+      max_seq_len: 2088
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+          - 16
+          - 24
+          - 32
+          - 40
+          - 48
+          - 56
+          - 64
+          - 72
+          - 80
+          - 88
+          - 96
+          - 104
+          - 112
+          - 120
+          - 128
+      moe_config:
+        backend: CUTEDSL
+        use_low_precision_moe_combine: true
+        load_balancer:
+          layer_updates_per_iter: 1
+          num_slots: 288
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 1
+
+benchmark:
+  type: "sa-bench"
+  isl: 1024
+  osl: 1024
+  concurrencies: "4301"
+  req_rate: "inf"
+  custom_tokenizer: "glm_moe_dsa"
+  use_chat_template: false
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen1dep32_batch32_eplb0_mtp0.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen1dep32_batch32_eplb0_mtp0.yaml
new file mode 100644
index 00000000..7412a109
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen1dep32_batch32_eplb0_mtp0.yaml
@@ -0,0 +1,129 @@
+name: "glm5_nvfp4_ISL1K_OSL1K_ctx1dep4_gen1dep32_batch32_eplb0_mtp0"
+
+# ctx: 1 prefill worker, TP4/EP4
+# gen: 1 decode worker, TP32/EP32, enable_attention_dp=true, max_batch=32
+# concurrency: 1229
+
+model:
+  path: "nvidia/GLM5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 4
+
+  decode_workers: 1
+  decode_nodes: 8
+  gpus_per_decode: 32
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1064
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: CUTEDSL
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+
+    decode:
+      tensor_parallel_size: 32
+      moe_expert_parallel_size: 32
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 32
+      max_num_tokens: 32
+      max_seq_len: 2088
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+          - 16
+          - 24
+          - 32
+      moe_config:
+        backend: CUTEDSL
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.7
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+
+benchmark:
+  type: "sa-bench"
+  isl: 1024
+  osl: 1024
+  concurrencies: "1229"
+  req_rate: "inf"
+  custom_tokenizer: "glm_moe_dsa"
+  use_chat_template: false
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen4tep8_batch128_allconc_eplb0_mtp0.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen4tep8_batch128_allconc_eplb0_mtp0.yaml
new file mode 100644
index 00000000..e969c07d
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen4tep8_batch128_allconc_eplb0_mtp0.yaml
@@ -0,0 +1,142 @@
+name: "glm5_nvfp4_ISL1K_OSL1K_ctx1dep4_gen4tep8_batch128_allconc_eplb0_mtp0"
+
+# ctx: 1 prefill worker, TP4/EP4
+# gen: 4 decode workers, TP8/EP8, enable_attention_dp=false, max_batch=128
+# Merged concurrencies: batch1(4), batch32(180), batch64(360), batch128(616)
+
+model:
+  path: "nvidia/GLM5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 4
+
+  decode_workers: 4
+  decode_nodes: 8
+  gpus_per_decode: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1064
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: CUTEDSL
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+
+    decode:
+      allreduce_strategy: MNNVL
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 128
+      max_num_tokens: 128
+      max_seq_len: 2088
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+          - 16
+          - 24
+          - 32
+          - 40
+          - 48
+          - 56
+          - 64
+          - 72
+          - 80
+          - 88
+          - 96
+          - 104
+          - 112
+          - 120
+          - 128
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+
+benchmark:
+  type: "sa-bench"
+  isl: 1024
+  osl: 1024
+  concurrencies: "4x180x360x616"
+  req_rate: "inf"
+  custom_tokenizer: "glm_moe_dsa"
+  use_chat_template: false
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen5tep4_batch8_allconc_eplb0_mtp0.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen5tep4_batch8_allconc_eplb0_mtp0.yaml
new file mode 100644
index 00000000..fb583747
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen5tep4_batch8_allconc_eplb0_mtp0.yaml
@@ -0,0 +1,126 @@
+name: "glm5_nvfp4_ISL1K_OSL1K_ctx1dep4_gen5tep4_batch8_allconc_eplb0_mtp0"
+
+# ctx: 1 prefill worker, TP4/EP4
+# gen: 5 decode workers, TP4/EP4, enable_attention_dp=false, max_batch=8
+# Merged concurrencies: batch1(5), batch2(15), batch4(30), batch8(50)
+
+model:
+  path: "nvidia/GLM5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 4
+
+  decode_workers: 5
+  decode_nodes: 5
+  gpus_per_decode: 4
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1064
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: CUTEDSL
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+
+    decode:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 8
+      max_num_tokens: 8
+      max_seq_len: 2088
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+
+benchmark:
+  type: "sa-bench"
+  isl: 1024
+  osl: 1024
+  concurrencies: "5x15x30x50"
+  req_rate: "inf"
+  custom_tokenizer: "glm_moe_dsa"
+  use_chat_template: false
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/STP/ctx2dep4_gen1dep16_batch128_eplb0_mtp0.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/STP/ctx2dep4_gen1dep16_batch128_eplb0_mtp0.yaml
new file mode 100644
index 00000000..e057ce05
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/STP/ctx2dep4_gen1dep16_batch128_eplb0_mtp0.yaml
@@ -0,0 +1,141 @@
+name: "glm5_nvfp4_ISL1K_OSL1K_ctx2dep4_gen1dep16_batch128_eplb0_mtp0"
+
+# ctx: 2 prefill workers, TP4/EP4
+# gen: 1 decode worker, TP16/EP16, enable_attention_dp=true, max_batch=128
+# concurrency: 2253
+
+model:
+  path: "nvidia/GLM5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+
+  prefill_nodes: 2
+  prefill_workers: 2
+  gpus_per_prefill: 4
+
+  decode_workers: 1
+  decode_nodes: 4
+  gpus_per_decode: 16
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1064
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: CUTEDSL
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+
+    decode:
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 128
+      max_num_tokens: 128
+      max_seq_len: 2088
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+          - 16
+          - 24
+          - 32
+          - 40
+          - 48
+          - 56
+          - 64
+          - 72
+          - 80
+          - 88
+          - 96
+          - 104
+          - 112
+          - 120
+          - 128
+      moe_config:
+        backend: CUTEDSL
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.75
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+
+benchmark:
+  type: "sa-bench"
+  isl: 1024
+  osl: 1024
+  concurrencies: "2253"
+  req_rate: "inf"
+  custom_tokenizer: "glm_moe_dsa"
+  use_chat_template: false
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/STP/ctx2dep4_gen1dep16_batch512_eplb256_mtp0.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/STP/ctx2dep4_gen1dep16_batch512_eplb256_mtp0.yaml
new file mode 100644
index 00000000..d221dde2
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/STP/ctx2dep4_gen1dep16_batch512_eplb256_mtp0.yaml
@@ -0,0 +1,193 @@
+name: "glm5_nvfp4_ISL1K_OSL1K_ctx2dep4_gen1dep16_batch512_eplb256_mtp0"
+
+# ctx: 2 prefill workers, TP4/EP4
+# gen: 1 decode worker, TP16/EP16, enable_attention_dp=true, max_batch=512
+# EPLB: num_slots=256
+# concurrency: 8192
+
+model:
+  path: "nvidia/GLM5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+
+  prefill_nodes: 2
+  prefill_workers: 2
+  gpus_per_prefill: 4
+
+  decode_workers: 1
+  decode_nodes: 4
+  gpus_per_decode: 16
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1064
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: CUTEDSL
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+
+    decode:
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 512
+      max_num_tokens: 512
+      max_seq_len: 2088
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+          - 16
+          - 24
+          - 32
+          - 40
+          - 48
+          - 56
+          - 64
+          - 72
+          - 80
+          - 88
+          - 96
+          - 104
+          - 112
+          - 120
+          - 128
+          - 136
+          - 144
+          - 152
+          - 160
+          - 168
+          - 176
+          - 184
+          - 192
+          - 200
+          - 208
+          - 216
+          - 224
+          - 232
+          - 240
+          - 248
+          - 256
+          - 264
+          - 272
+          - 280
+          - 288
+          - 296
+          - 304
+          - 312
+          - 320
+          - 328
+          - 336
+          - 344
+          - 352
+          - 360
+          - 368
+          - 376
+          - 384
+          - 392
+          - 400
+          - 408
+          - 416
+          - 424
+          - 432
+          - 440
+          - 448
+          - 456
+          - 464
+          - 472
+          - 480
+          - 488
+          - 496
+          - 504
+          - 512
+      moe_config:
+        backend: CUTEDSL
+        use_low_precision_moe_combine: true
+        load_balancer:
+          layer_updates_per_iter: 1
+          num_slots: 256
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.75
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+
+benchmark:
+  type: "sa-bench"
+  isl: 1024
+  osl: 1024
+  concurrencies: "8192"
+  req_rate: "inf"
+  custom_tokenizer: "glm_moe_dsa"
+  use_chat_template: false
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/STP/ctx2dep4_gen1dep32_batch64_eplb0_mtp0.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/STP/ctx2dep4_gen1dep32_batch64_eplb0_mtp0.yaml
new file mode 100644
index 00000000..bbad79c1
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/STP/ctx2dep4_gen1dep32_batch64_eplb0_mtp0.yaml
@@ -0,0 +1,133 @@
+name: "glm5_nvfp4_ISL1K_OSL1K_ctx2dep4_gen1dep32_batch64_eplb0_mtp0"
+
+# ctx: 2 prefill workers, TP4/EP4
+# gen: 1 decode worker, TP32/EP32, enable_attention_dp=true, max_batch=64
+# concurrency: 2253
+
+model:
+  path: "nvidia/GLM5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+
+  prefill_nodes: 2
+  prefill_workers: 2
+  gpus_per_prefill: 4
+
+  decode_workers: 1
+  decode_nodes: 8
+  gpus_per_decode: 32
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1064
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: CUTEDSL
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+
+    decode:
+      tensor_parallel_size: 32
+      moe_expert_parallel_size: 32
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 64
+      max_num_tokens: 64
+      max_seq_len: 2088
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+          - 16
+          - 24
+          - 32
+          - 40
+          - 48
+          - 56
+          - 64
+      moe_config:
+        backend: CUTEDSL
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.7
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+
+benchmark:
+  type: "sa-bench"
+  isl: 1024
+  osl: 1024
+  concurrencies: "2253"
+  req_rate: "inf"
+  custom_tokenizer: "glm_moe_dsa"
+  use_chat_template: false
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/STP/ctx4dep4_gen1dep32_batch256_eplb288_mtp0.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/STP/ctx4dep4_gen1dep32_batch256_eplb288_mtp0.yaml
new file mode 100644
index 00000000..26d2d29e
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/STP/ctx4dep4_gen1dep32_batch256_eplb288_mtp0.yaml
@@ -0,0 +1,161 @@
+name: "glm5_nvfp4_ISL1K_OSL1K_ctx4dep4_gen1dep32_batch256_eplb288_mtp0"
+
+# ctx: 4 prefill workers, TP4/EP4
+# gen: 1 decode worker, TP32/EP32, enable_attention_dp=true, max_batch=256
+# EPLB: num_slots=288
+# concurrency: 8192
+
+model:
+  path: "nvidia/GLM5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+
+  prefill_nodes: 4
+  prefill_workers: 4
+  gpus_per_prefill: 4
+
+  decode_workers: 1
+  decode_nodes: 8
+  gpus_per_decode: 32
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1064
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: CUTEDSL
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+
+    decode:
+      tensor_parallel_size: 32
+      moe_expert_parallel_size: 32
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 256
+      max_num_tokens: 256
+      max_seq_len: 2088
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+          - 16
+          - 24
+          - 32
+          - 40
+          - 48
+          - 56
+          - 64
+          - 72
+          - 80
+          - 88
+          - 96
+          - 104
+          - 112
+          - 120
+          - 128
+          - 136
+          - 144
+          - 152
+          - 160
+          - 168
+          - 176
+          - 184
+          - 192
+          - 200
+          - 208
+          - 216
+          - 224
+          - 232
+          - 240
+          - 248
+          - 256
+      moe_config:
+        backend: CUTEDSL
+        use_low_precision_moe_combine: true
+        load_balancer:
+          layer_updates_per_iter: 1
+          num_slots: 288
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.7
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+
+benchmark:
+  type: "sa-bench"
+  isl: 1024
+  osl: 1024
+  concurrencies: "8192"
+  req_rate: "inf"
+  custom_tokenizer: "glm_moe_dsa"
+  use_chat_template: false
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/MTP/ctx10dep4_gen1dep16_batch64_eplb0_mtp1.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/MTP/ctx10dep4_gen1dep16_batch64_eplb0_mtp1.yaml
new file mode 100644
index 00000000..420192c2
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/MTP/ctx10dep4_gen1dep16_batch64_eplb0_mtp1.yaml
@@ -0,0 +1,139 @@
+name: "glm5_nvfp4_ISL8K_OSL1K_ctx10dep4_gen1dep16_batch64_eplb0_mtp1"
+
+# ctx: 10 prefill workers, TP4/EP4
+# gen: 1 decode worker, TP16/EP16, enable_attention_dp=true, max_batch=64
+# concurrency: 1229
+
+model:
+  path: "nvidia/GLM5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+
+  prefill_nodes: 10
+  prefill_workers: 10
+  gpus_per_prefill: 4
+
+  decode_workers: 1
+  decode_nodes: 4
+  gpus_per_decode: 16
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 2
+      max_num_tokens: 16640
+      max_seq_len: 8232
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: CUTEDSL
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 1
+
+    decode:
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 64
+      max_num_tokens: 128
+      max_seq_len: 9256
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+          - 16
+          - 24
+          - 32
+          - 40
+          - 48
+          - 56
+          - 64
+      moe_config:
+        backend: CUTEDSL
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.7
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 1
+
+benchmark:
+  type: "sa-bench"
+  isl: 8192
+  osl: 1024
+  concurrencies: "1229"
+  req_rate: "inf"
+  custom_tokenizer: "glm_moe_dsa"
+  use_chat_template: false
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/MTP/ctx1dep4_gen2tep8_batch16_eplb0_mtp3.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/MTP/ctx1dep4_gen2tep8_batch16_eplb0_mtp3.yaml
new file mode 100644
index 00000000..da3186e5
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/MTP/ctx1dep4_gen2tep8_batch16_eplb0_mtp3.yaml
@@ -0,0 +1,133 @@
+name: "glm5_nvfp4_ISL8K_OSL1K_ctx1dep4_gen2tep8_batch16_eplb0_mtp3"
+
+# ctx: 1 prefill worker, TP4/EP4
+# gen: 2 decode workers, TP8/EP8, max_batch=16, concurrency: 46
+
+model:
+  path: "nvidia/GLM5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 4
+
+  decode_workers: 2
+  decode_nodes: 4
+  gpus_per_decode: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 2
+      max_num_tokens: 16640
+      max_seq_len: 8232
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: CUTEDSL
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+    decode:
+      allreduce_strategy: MNNVL
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 16
+      max_num_tokens: 64
+      max_seq_len: 9256
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+          - 16
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+benchmark:
+  type: "sa-bench"
+  isl: 8192
+  osl: 1024
+  concurrencies: "46"
+  req_rate: "inf"
+  custom_tokenizer: "glm_moe_dsa"
+  use_chat_template: false
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/MTP/ctx1dep4_gen4tep8_batch8_allconc_eplb0_mtp3.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/MTP/ctx1dep4_gen4tep8_batch8_allconc_eplb0_mtp3.yaml
new file mode 100644
index 00000000..fb94a549
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/MTP/ctx1dep4_gen4tep8_batch8_allconc_eplb0_mtp3.yaml
@@ -0,0 +1,133 @@
+name: "glm5_nvfp4_ISL8K_OSL1K_ctx1dep4_gen4tep8_batch8_allconc_eplb0_mtp3"
+
+# ctx: 1 prefill worker, TP4/EP4
+# gen: 4 decode workers, TP8/EP8, max_batch=8
+# concurrencies: 4 (batch1), 48 (batch8)
+
+model:
+  path: "nvidia/GLM5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 4
+
+  decode_workers: 4
+  decode_nodes: 8
+  gpus_per_decode: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 2
+      max_num_tokens: 16640
+      max_seq_len: 8232
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: CUTEDSL
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+    decode:
+      allreduce_strategy: MNNVL
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 8
+      max_num_tokens: 32
+      max_seq_len: 9256
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+benchmark:
+  type: "sa-bench"
+  isl: 8192
+  osl: 1024
+  concurrencies: "4x48"
+  req_rate: "inf"
+  custom_tokenizer: "glm_moe_dsa"
+  use_chat_template: false
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/MTP/ctx1dep4_gen5tep4_batch1_eplb0_mtp3.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/MTP/ctx1dep4_gen5tep4_batch1_eplb0_mtp3.yaml
new file mode 100644
index 00000000..0a13cce4
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/MTP/ctx1dep4_gen5tep4_batch1_eplb0_mtp3.yaml
@@ -0,0 +1,130 @@
+name: "glm5_nvfp4_ISL8K_OSL1K_ctx1dep4_gen5tep4_batch1_eplb0_mtp3"
+
+# ctx: 1 prefill worker, TP4/EP4
+# gen: 5 decode workers, TP4/EP4, max_batch=1, concurrency: 5
+
+model:
+  path: "nvidia/GLM5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 4
+
+  decode_workers: 5
+  decode_nodes: 5
+  gpus_per_decode: 4
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 2
+      max_num_tokens: 16640
+      max_seq_len: 8232
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: CUTEDSL
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+    decode:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 1
+      max_num_tokens: 4
+      max_seq_len: 9256
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.85
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+benchmark:
+  type: "sa-bench"
+  isl: 8192
+  osl: 1024
+  concurrencies: "5"
+  req_rate: "inf"
+  custom_tokenizer: "glm_moe_dsa"
+  use_chat_template: false
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/MTP/ctx3dep4_gen1dep32_batch4_eplb0_mtp3.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/MTP/ctx3dep4_gen1dep32_batch4_eplb0_mtp3.yaml
new file mode 100644
index 00000000..440a4f73
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/MTP/ctx3dep4_gen1dep32_batch4_eplb0_mtp3.yaml
@@ -0,0 +1,130 @@
+name: "glm5_nvfp4_ISL8K_OSL1K_ctx3dep4_gen1dep32_batch4_eplb0_mtp3"
+
+# ctx: 3 prefill workers, TP4/EP4
+# gen: 1 decode worker, TP32/EP32, max_batch=4, concurrency: 167
+
+model:
+  path: "nvidia/GLM5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+
+  prefill_nodes: 3
+  prefill_workers: 3
+  gpus_per_prefill: 4
+
+  decode_workers: 1
+  decode_nodes: 8
+  gpus_per_decode: 32
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 2
+      max_num_tokens: 16640
+      max_seq_len: 8232
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: CUTEDSL
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+    decode:
+      tensor_parallel_size: 32
+      moe_expert_parallel_size: 32
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 4
+      max_num_tokens: 16
+      max_seq_len: 9256
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+      moe_config:
+        backend: CUTEDSL
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+benchmark:
+  type: "sa-bench"
+  isl: 8192
+  osl: 1024
+  concurrencies: "167"
+  req_rate: "inf"
+  custom_tokenizer: "glm_moe_dsa"
+  use_chat_template: false
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/MTP/ctx5dep4_gen1dep32_batch8_eplb0_mtp3.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/MTP/ctx5dep4_gen1dep32_batch8_eplb0_mtp3.yaml
new file mode 100644
index 00000000..492f1b4c
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/MTP/ctx5dep4_gen1dep32_batch8_eplb0_mtp3.yaml
@@ -0,0 +1,132 @@
+name: "glm5_nvfp4_ISL8K_OSL1K_ctx5dep4_gen1dep32_batch8_eplb0_mtp3"
+
+# ctx: 5 prefill workers, TP4/EP4
+# gen: 1 decode worker, TP32/EP32, enable_attention_dp=true, max_batch=8
+# concurrency: 333
+
+model:
+  path: "nvidia/GLM5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+
+  prefill_nodes: 5
+  prefill_workers: 5
+  gpus_per_prefill: 4
+
+  decode_workers: 1
+  decode_nodes: 8
+  gpus_per_decode: 32
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 2
+      max_num_tokens: 16640
+      max_seq_len: 8232
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: CUTEDSL
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+    decode:
+      tensor_parallel_size: 32
+      moe_expert_parallel_size: 32
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 8
+      max_num_tokens: 32
+      max_seq_len: 9256
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+      moe_config:
+        backend: CUTEDSL
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+benchmark:
+  type: "sa-bench"
+  isl: 8192
+  osl: 1024
+  concurrencies: "333"
+  req_rate: "inf"
+  custom_tokenizer: "glm_moe_dsa"
+  use_chat_template: false
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/MTP/ctx7dep4_gen1dep16_batch32_eplb0_mtp2.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/MTP/ctx7dep4_gen1dep16_batch32_eplb0_mtp2.yaml
new file mode 100644
index 00000000..d22fbcf1
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/MTP/ctx7dep4_gen1dep16_batch32_eplb0_mtp2.yaml
@@ -0,0 +1,135 @@
+name: "glm5_nvfp4_ISL8K_OSL1K_ctx7dep4_gen1dep16_batch32_eplb0_mtp2"
+
+# ctx: 7 prefill workers, TP4/EP4
+# gen: 1 decode worker, TP16/EP16, enable_attention_dp=true, max_batch=32
+# concurrency: 615
+
+model:
+  path: "nvidia/GLM5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+
+  prefill_nodes: 7
+  prefill_workers: 7
+  gpus_per_prefill: 4
+
+  decode_workers: 1
+  decode_nodes: 4
+  gpus_per_decode: 16
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 2
+      max_num_tokens: 16640
+      max_seq_len: 8232
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: CUTEDSL
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 2
+
+    decode:
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 32
+      max_num_tokens: 96
+      max_seq_len: 9256
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+          - 16
+          - 24
+          - 32
+      moe_config:
+        backend: CUTEDSL
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.7
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 2
+
+benchmark:
+  type: "sa-bench"
+  isl: 8192
+  osl: 1024
+  concurrencies: "615"
+  req_rate: "inf"
+  custom_tokenizer: "glm_moe_dsa"
+  use_chat_template: false
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/MTP/ctx7dep4_gen1dep8_batch128_eplb0_mtp1.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/MTP/ctx7dep4_gen1dep8_batch128_eplb0_mtp1.yaml
new file mode 100644
index 00000000..804e89b5
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/MTP/ctx7dep4_gen1dep8_batch128_eplb0_mtp1.yaml
@@ -0,0 +1,147 @@
+name: "glm5_nvfp4_ISL8K_OSL1K_ctx7dep4_gen1dep8_batch128_eplb0_mtp1"
+
+# ctx: 7 prefill workers, TP4/EP4
+# gen: 1 decode worker, TP8/EP8, enable_attention_dp=true, max_batch=128
+# concurrency: 1076
+
+model:
+  path: "nvidia/GLM5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+
+  prefill_nodes: 7
+  prefill_workers: 7
+  gpus_per_prefill: 4
+
+  decode_workers: 1
+  decode_nodes: 2
+  gpus_per_decode: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 2
+      max_num_tokens: 16640
+      max_seq_len: 8232
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: CUTEDSL
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 1
+
+    decode:
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 128
+      max_num_tokens: 256
+      max_seq_len: 9256
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+          - 16
+          - 24
+          - 32
+          - 40
+          - 48
+          - 56
+          - 64
+          - 72
+          - 80
+          - 88
+          - 96
+          - 104
+          - 112
+          - 120
+          - 128
+      moe_config:
+        backend: CUTEDSL
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 1
+
+benchmark:
+  type: "sa-bench"
+  isl: 8192
+  osl: 1024
+  concurrencies: "1076"
+  req_rate: "inf"
+  custom_tokenizer: "glm_moe_dsa"
+  use_chat_template: false
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/STP/ctx10dep4_gen1dep16_batch128_eplb0_mtp0.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/STP/ctx10dep4_gen1dep16_batch128_eplb0_mtp0.yaml
new file mode 100644
index 00000000..0fa8566d
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/STP/ctx10dep4_gen1dep16_batch128_eplb0_mtp0.yaml
@@ -0,0 +1,141 @@
+name: "glm5_nvfp4_ISL8K_OSL1K_ctx10dep4_gen1dep16_batch128_eplb0_mtp0"
+
+# ctx: 10 prefill workers, TP4/EP4
+# gen: 1 decode worker, TP16/EP16, enable_attention_dp=true, max_batch=128
+# concurrency: 2253
+
+model:
+  path: "nvidia/GLM5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+
+  prefill_nodes: 10
+  prefill_workers: 10
+  gpus_per_prefill: 4
+
+  decode_workers: 1
+  decode_nodes: 4
+  gpus_per_decode: 16
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 2
+      max_num_tokens: 16640
+      max_seq_len: 8232
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: CUTEDSL
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+
+    decode:
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 128
+      max_num_tokens: 128
+      max_seq_len: 9256
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+          - 16
+          - 24
+          - 32
+          - 40
+          - 48
+          - 56
+          - 64
+          - 72
+          - 80
+          - 88
+          - 96
+          - 104
+          - 112
+          - 120
+          - 128
+      moe_config:
+        backend: CUTEDSL
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+
+benchmark:
+  type: "sa-bench"
+  isl: 8192
+  osl: 1024
+  concurrencies: "2253"
+  req_rate: "inf"
+  custom_tokenizer: "glm_moe_dsa"
+  use_chat_template: false
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/STP/ctx1dep4_gen2tep8_batch32_eplb0_mtp0.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/STP/ctx1dep4_gen2tep8_batch32_eplb0_mtp0.yaml
new file mode 100644
index 00000000..478f6203
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/STP/ctx1dep4_gen2tep8_batch32_eplb0_mtp0.yaml
@@ -0,0 +1,130 @@
+name: "glm5_nvfp4_ISL8K_OSL1K_ctx1dep4_gen2tep8_batch32_eplb0_mtp0"
+
+# ctx: 1 prefill worker, TP4/EP4
+# gen: 2 decode workers, TP8/EP8, enable_attention_dp=false, max_batch=32
+# concurrency: 84
+
+model:
+  path: "nvidia/GLM5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 4
+
+  decode_workers: 2
+  decode_nodes: 4
+  gpus_per_decode: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 2
+      max_num_tokens: 16640
+      max_seq_len: 8232
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: CUTEDSL
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+
+    decode:
+      allreduce_strategy: MNNVL
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 32
+      max_num_tokens: 32
+      max_seq_len: 9256
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+          - 16
+          - 24
+          - 32
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+
+benchmark:
+  type: "sa-bench"
+  isl: 8192
+  osl: 1024
+  concurrencies: "84"
+  req_rate: "inf"
+  custom_tokenizer: "glm_moe_dsa"
+  use_chat_template: false
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/STP/ctx1dep4_gen3tep4_batch32_eplb0_mtp0.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/STP/ctx1dep4_gen3tep4_batch32_eplb0_mtp0.yaml
new file mode 100644
index 00000000..462401b6
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/STP/ctx1dep4_gen3tep4_batch32_eplb0_mtp0.yaml
@@ -0,0 +1,129 @@
+name: "glm5_nvfp4_ISL8K_OSL1K_ctx1dep4_gen3tep4_batch32_eplb0_mtp0"
+
+# ctx: 1 prefill worker, TP4/EP4
+# gen: 3 decode workers, TP4/EP4, enable_attention_dp=false, max_batch=32
+# concurrency: 117
+
+model:
+  path: "nvidia/GLM5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 4
+
+  decode_workers: 3
+  decode_nodes: 3
+  gpus_per_decode: 4
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 2
+      max_num_tokens: 16640
+      max_seq_len: 8232
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: CUTEDSL
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+
+    decode:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 32
+      max_num_tokens: 32
+      max_seq_len: 9256
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+          - 16
+          - 24
+          - 32
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+
+benchmark:
+  type: "sa-bench"
+  isl: 8192
+  osl: 1024
+  concurrencies: "117"
+  req_rate: "inf"
+  custom_tokenizer: "glm_moe_dsa"
+  use_chat_template: false
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/STP/ctx1dep4_gen5tep4_batch8_allconc_eplb0_mtp0.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/STP/ctx1dep4_gen5tep4_batch8_allconc_eplb0_mtp0.yaml
new file mode 100644
index 00000000..90e62af3
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/STP/ctx1dep4_gen5tep4_batch8_allconc_eplb0_mtp0.yaml
@@ -0,0 +1,126 @@
+name: "glm5_nvfp4_ISL8K_OSL1K_ctx1dep4_gen5tep4_batch8_allconc_eplb0_mtp0"
+
+# ctx: 1 prefill worker, TP4/EP4
+# gen: 5 decode workers, TP4/EP4, enable_attention_dp=false, max_batch=8
+# Merged concurrencies: batch1(5), batch2(10), batch4(25), batch8(50)
+
+model:
+  path: "nvidia/GLM5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 4
+
+  decode_workers: 5
+  decode_nodes: 5
+  gpus_per_decode: 4
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 2
+      max_num_tokens: 16640
+      max_seq_len: 8232
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: CUTEDSL
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+
+    decode:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 8
+      max_num_tokens: 8
+      max_seq_len: 9256
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+
+benchmark:
+  type: "sa-bench"
+  isl: 8192
+  osl: 1024
+  concurrencies: "5x10x25x50"
+  req_rate: "inf"
+  custom_tokenizer: "glm_moe_dsa"
+  use_chat_template: false
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/STP/ctx5dep4_gen1dep32_batch16_eplb0_mtp0.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/STP/ctx5dep4_gen1dep32_batch16_eplb0_mtp0.yaml
new file mode 100644
index 00000000..7a6ece31
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/STP/ctx5dep4_gen1dep32_batch16_eplb0_mtp0.yaml
@@ -0,0 +1,127 @@
+name: "glm5_nvfp4_ISL8K_OSL1K_ctx5dep4_gen1dep32_batch16_eplb0_mtp0"
+
+# ctx: 5 prefill workers, TP4/EP4
+# gen: 1 decode worker, TP32/EP32, enable_attention_dp=true, max_batch=16
+# concurrency: 615
+
+model:
+  path: "nvidia/GLM5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+
+  prefill_nodes: 5
+  prefill_workers: 5
+  gpus_per_prefill: 4
+
+  decode_workers: 1
+  decode_nodes: 8
+  gpus_per_decode: 32
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 2
+      max_num_tokens: 16640
+      max_seq_len: 8232
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: CUTEDSL
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+
+    decode:
+      tensor_parallel_size: 32
+      moe_expert_parallel_size: 32
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 16
+      max_num_tokens: 16
+      max_seq_len: 9256
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+          - 16
+      moe_config:
+        backend: CUTEDSL
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.75
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+
+benchmark:
+  type: "sa-bench"
+  isl: 8192
+  osl: 1024
+  concurrencies: "615"
+  req_rate: "inf"
+  custom_tokenizer: "glm_moe_dsa"
+  use_chat_template: false
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/STP/ctx8dep4_gen1dep32_batch32_eplb0_mtp0.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/STP/ctx8dep4_gen1dep32_batch32_eplb0_mtp0.yaml
new file mode 100644
index 00000000..7e34b6d9
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/STP/ctx8dep4_gen1dep32_batch32_eplb0_mtp0.yaml
@@ -0,0 +1,129 @@
+name: "glm5_nvfp4_ISL8K_OSL1K_ctx8dep4_gen1dep32_batch32_eplb0_mtp0"
+
+# ctx: 8 prefill workers, TP4/EP4
+# gen: 1 decode worker, TP32/EP32, enable_attention_dp=true, max_batch=32
+# concurrency: 1229
+
+model:
+  path: "nvidia/GLM5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+
+  prefill_nodes: 8
+  prefill_workers: 8
+  gpus_per_prefill: 4
+
+  decode_workers: 1
+  decode_nodes: 8
+  gpus_per_decode: 32
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 2
+      max_num_tokens: 16640
+      max_seq_len: 8232
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: CUTEDSL
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+
+    decode:
+      tensor_parallel_size: 32
+      moe_expert_parallel_size: 32
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 32
+      max_num_tokens: 32
+      max_seq_len: 9256
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+          - 16
+          - 24
+          - 32
+      moe_config:
+        backend: CUTEDSL
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.75
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+
+benchmark:
+  type: "sa-bench"
+  isl: 8192
+  osl: 1024
+  concurrencies: "1229"
+  req_rate: "inf"
+  custom_tokenizer: "glm_moe_dsa"
+  use_chat_template: false
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/MTP/ctx1dep2_gen1dep32_batch8_eplb0_mtp3.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/MTP/ctx1dep2_gen1dep32_batch8_eplb0_mtp3.yaml
new file mode 100644
index 00000000..80aacc6a
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/MTP/ctx1dep2_gen1dep32_batch8_eplb0_mtp3.yaml
@@ -0,0 +1,132 @@
+name: "glm5_nvfp4_ISL1K_OSL1K_ctx1dep2_gen1dep32_batch8_eplb0_mtp3"
+
+# ctx: 1 prefill worker, TP2/EP2
+# gen: 1 decode worker, TP32/EP32, enable_attention_dp=true, max_batch=8
+# concurrency: 333
+
+model:
+  path: "nvidia/GLM5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb300"
+
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 2
+
+  decode_workers: 1
+  decode_nodes: 8
+  gpus_per_decode: 32
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1064
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: CUTEDSL
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+    decode:
+      tensor_parallel_size: 32
+      moe_expert_parallel_size: 32
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 8
+      max_num_tokens: 32
+      max_seq_len: 2088
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+      moe_config:
+        backend: CUTEDSL
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+benchmark:
+  type: "sa-bench"
+  isl: 1024
+  osl: 1024
+  concurrencies: "333"
+  req_rate: "inf"
+  custom_tokenizer: "glm_moe_dsa"
+  use_chat_template: false
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/MTP/ctx1dep2_gen4tep8_batch16_allconc_eplb0_mtp3.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/MTP/ctx1dep2_gen4tep8_batch16_allconc_eplb0_mtp3.yaml
new file mode 100644
index 00000000..648ec949
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/MTP/ctx1dep2_gen4tep8_batch16_allconc_eplb0_mtp3.yaml
@@ -0,0 +1,134 @@
+name: "glm5_nvfp4_ISL1K_OSL1K_ctx1dep2_gen4tep8_batch16_allconc_eplb0_mtp3"
+
+# ctx: 1 prefill worker, TP2/EP2
+# gen: 4 decode workers, TP8/EP8, enable_attention_dp=false, max_batch=16
+# concurrencies: 24 (batch4), 44 (batch8), 92 (batch16)
+
+model:
+  path: "nvidia/GLM5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb300"
+
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 2
+
+  decode_workers: 4
+  decode_nodes: 8
+  gpus_per_decode: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1064
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: CUTEDSL
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+    decode:
+      allreduce_strategy: MNNVL
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 16
+      max_num_tokens: 64
+      max_seq_len: 2088
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+          - 16
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+benchmark:
+  type: "sa-bench"
+  isl: 1024
+  osl: 1024
+  concurrencies: "24x44x92"
+  req_rate: "inf"
+  custom_tokenizer: "glm_moe_dsa"
+  use_chat_template: false
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/MTP/ctx1dep2_gen4tep8_batch32_eplb0_mtp2.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/MTP/ctx1dep2_gen4tep8_batch32_eplb0_mtp2.yaml
new file mode 100644
index 00000000..823624ac
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/MTP/ctx1dep2_gen4tep8_batch32_eplb0_mtp2.yaml
@@ -0,0 +1,136 @@
+name: "glm5_nvfp4_ISL1K_OSL1K_ctx1dep2_gen4tep8_batch32_eplb0_mtp2"
+
+# ctx: 1 prefill worker, TP2/EP2
+# gen: 4 decode workers, TP8/EP8, enable_attention_dp=false, max_batch=32
+# concurrency: 180
+
+model:
+  path: "nvidia/GLM5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb300"
+
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 2
+
+  decode_workers: 4
+  decode_nodes: 8
+  gpus_per_decode: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1064
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: CUTEDSL
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 2
+
+    decode:
+      allreduce_strategy: MNNVL
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 32
+      max_num_tokens: 96
+      max_seq_len: 2088
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+          - 16
+          - 24
+          - 32
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 2
+
+benchmark:
+  type: "sa-bench"
+  isl: 1024
+  osl: 1024
+  concurrencies: "180"
+  req_rate: "inf"
+  custom_tokenizer: "glm_moe_dsa"
+  use_chat_template: false
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/MTP/ctx1dep2_gen5tep4_batch1_eplb0_mtp3.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/MTP/ctx1dep2_gen5tep4_batch1_eplb0_mtp3.yaml
new file mode 100644
index 00000000..64b61b9f
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/MTP/ctx1dep2_gen5tep4_batch1_eplb0_mtp3.yaml
@@ -0,0 +1,131 @@
+name: "glm5_nvfp4_ISL1K_OSL1K_ctx1dep2_gen5tep4_batch1_eplb0_mtp3"
+
+# ctx: 1 prefill worker, TP2/EP2
+# gen: 5 decode workers, TP4/EP4, enable_attention_dp=false, max_batch=1
+# concurrency: 10
+
+model:
+  path: "nvidia/GLM5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb300"
+
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 2
+
+  decode_workers: 5
+  decode_nodes: 5
+  gpus_per_decode: 4
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1064
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: CUTEDSL
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+    decode:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 1
+      max_num_tokens: 4
+      max_seq_len: 2088
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.85
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+benchmark:
+  type: "sa-bench"
+  isl: 1024
+  osl: 1024
+  concurrencies: "10"
+  req_rate: "inf"
+  custom_tokenizer: "glm_moe_dsa"
+  use_chat_template: false
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/MTP/ctx2dep2_gen1dep16_batch64_eplb0_mtp2.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/MTP/ctx2dep2_gen1dep16_batch64_eplb0_mtp2.yaml
new file mode 100644
index 00000000..66d211aa
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/MTP/ctx2dep2_gen1dep16_batch64_eplb0_mtp2.yaml
@@ -0,0 +1,139 @@
+name: "glm5_nvfp4_ISL1K_OSL1K_ctx2dep2_gen1dep16_batch64_eplb0_mtp2"
+
+# ctx: 2 prefill workers, TP2/EP2
+# gen: 1 decode worker, TP16/EP16, enable_attention_dp=true, max_batch=64
+# concurrency: 1229
+
+model:
+  path: "nvidia/GLM5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb300"
+
+  prefill_nodes: 1
+  prefill_workers: 2
+  gpus_per_prefill: 2
+
+  decode_workers: 1
+  decode_nodes: 4
+  gpus_per_decode: 16
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1064
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: CUTEDSL
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 2
+
+    decode:
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 64
+      max_num_tokens: 192
+      max_seq_len: 2088
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+          - 16
+          - 24
+          - 32
+          - 40
+          - 48
+          - 56
+          - 64
+      moe_config:
+        backend: CUTEDSL
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.7
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 2
+
+benchmark:
+  type: "sa-bench"
+  isl: 1024
+  osl: 1024
+  concurrencies: "1229"
+  req_rate: "inf"
+  custom_tokenizer: "glm_moe_dsa"
+  use_chat_template: false
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/MTP/ctx2dep2_gen1dep32_batch16_eplb0_mtp3.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/MTP/ctx2dep2_gen1dep32_batch16_eplb0_mtp3.yaml
new file mode 100644
index 00000000..fe754372
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/MTP/ctx2dep2_gen1dep32_batch16_eplb0_mtp3.yaml
@@ -0,0 +1,133 @@
+name: "glm5_nvfp4_ISL1K_OSL1K_ctx2dep2_gen1dep32_batch16_eplb0_mtp3"
+
+# ctx: 2 prefill workers, TP2/EP2
+# gen: 1 decode worker, TP32/EP32, enable_attention_dp=true, max_batch=16
+# concurrency: 666
+
+model:
+  path: "nvidia/GLM5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb300"
+
+  prefill_nodes: 1
+  prefill_workers: 2
+  gpus_per_prefill: 2
+
+  decode_workers: 1
+  decode_nodes: 8
+  gpus_per_decode: 32
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1064
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: CUTEDSL
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+    decode:
+      tensor_parallel_size: 32
+      moe_expert_parallel_size: 32
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 16
+      max_num_tokens: 64
+      max_seq_len: 2088
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+          - 16
+      moe_config:
+        backend: CUTEDSL
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+benchmark:
+  type: "sa-bench"
+  isl: 1024
+  osl: 1024
+  concurrencies: "666"
+  req_rate: "inf"
+  custom_tokenizer: "glm_moe_dsa"
+  use_chat_template: false
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/MTP/ctx3dep2_gen1dep32_batch32_eplb0_mtp2.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/MTP/ctx3dep2_gen1dep32_batch32_eplb0_mtp2.yaml
new file mode 100644
index 00000000..70821f3e
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/MTP/ctx3dep2_gen1dep32_batch32_eplb0_mtp2.yaml
@@ -0,0 +1,135 @@
+name: "glm5_nvfp4_ISL1K_OSL1K_ctx3dep2_gen1dep32_batch32_eplb0_mtp2"
+
+# ctx: 3 prefill workers, TP2/EP2
+# gen: 1 decode worker, TP32/EP32, enable_attention_dp=true, max_batch=32
+# concurrency: 1229
+
+model:
+  path: "nvidia/GLM5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb300"
+
+  prefill_nodes: 2
+  prefill_workers: 3
+  gpus_per_prefill: 2
+
+  decode_workers: 1
+  decode_nodes: 8
+  gpus_per_decode: 32
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1064
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: CUTEDSL
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 2
+
+    decode:
+      tensor_parallel_size: 32
+      moe_expert_parallel_size: 32
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 32
+      max_num_tokens: 96
+      max_seq_len: 2088
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+          - 16
+          - 24
+          - 32
+      moe_config:
+        backend: CUTEDSL
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 2
+
+benchmark:
+  type: "sa-bench"
+  isl: 1024
+  osl: 1024
+  concurrencies: "1229"
+  req_rate: "inf"
+  custom_tokenizer: "glm_moe_dsa"
+  use_chat_template: false
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/MTP/ctx4dep2_gen1dep16_batch256_eplb256_mtp1.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/MTP/ctx4dep2_gen1dep16_batch256_eplb256_mtp1.yaml
new file mode 100644
index 00000000..bf3183b7
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/MTP/ctx4dep2_gen1dep16_batch256_eplb256_mtp1.yaml
@@ -0,0 +1,166 @@
+name: "glm5_nvfp4_ISL1K_OSL1K_ctx4dep2_gen1dep16_batch256_eplb256_mtp1"
+
+# ctx: 4 prefill workers, TP2/EP2, EPLB: num_slots=256, max_batch=256
+# gen: 1 decode worker, TP16/EP16, enable_attention_dp=true, max_batch=256
+# concurrency: 4301
+
+model:
+  path: "nvidia/GLM5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb300"
+
+  prefill_nodes: 2
+  prefill_workers: 4
+  gpus_per_prefill: 2
+
+  decode_workers: 1
+  decode_nodes: 4
+  gpus_per_decode: 16
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1064
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: CUTEDSL
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 1
+
+    decode:
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 256
+      max_num_tokens: 512
+      max_seq_len: 2088
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+          - 16
+          - 24
+          - 32
+          - 40
+          - 48
+          - 56
+          - 64
+          - 72
+          - 80
+          - 88
+          - 96
+          - 104
+          - 112
+          - 120
+          - 128
+          - 136
+          - 144
+          - 152
+          - 160
+          - 168
+          - 176
+          - 184
+          - 192
+          - 200
+          - 208
+          - 216
+          - 224
+          - 232
+          - 240
+          - 248
+          - 256
+      moe_config:
+        backend: CUTEDSL
+        use_low_precision_moe_combine: true
+        load_balancer:
+          layer_updates_per_iter: 1
+          num_slots: 256
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.7
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 1
+
+benchmark:
+  type: "sa-bench"
+  isl: 1024
+  osl: 1024
+  concurrencies: "4301"
+  req_rate: "inf"
+  custom_tokenizer: "glm_moe_dsa"
+  use_chat_template: false
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/MTP/ctx5dep2_gen2dep8_batch512_eplb0_mtp1.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/MTP/ctx5dep2_gen2dep8_batch512_eplb0_mtp1.yaml
new file mode 100644
index 00000000..1d9f4f10
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/MTP/ctx5dep2_gen2dep8_batch512_eplb0_mtp1.yaml
@@ -0,0 +1,195 @@
+name: "glm5_nvfp4_ISL1K_OSL1K_ctx5dep2_gen2dep8_batch512_eplb0_mtp1"
+
+# ctx: 5 prefill workers, TP2/EP2
+# gen: 2 decode workers, TP8/EP8, enable_attention_dp=true, max_batch=512
+# concurrency: 8602
+
+model:
+  path: "nvidia/GLM5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb300"
+
+  prefill_nodes: 3
+  prefill_workers: 5
+  gpus_per_prefill: 2
+
+  decode_workers: 2
+  decode_nodes: 4
+  gpus_per_decode: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1064
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: CUTEDSL
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 1
+
+    decode:
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 512
+      max_num_tokens: 1024
+      max_seq_len: 2088
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+          - 16
+          - 24
+          - 32
+          - 40
+          - 48
+          - 56
+          - 64
+          - 72
+          - 80
+          - 88
+          - 96
+          - 104
+          - 112
+          - 120
+          - 128
+          - 136
+          - 144
+          - 152
+          - 160
+          - 168
+          - 176
+          - 184
+          - 192
+          - 200
+          - 208
+          - 216
+          - 224
+          - 232
+          - 240
+          - 248
+          - 256
+          - 264
+          - 272
+          - 280
+          - 288
+          - 296
+          - 304
+          - 312
+          - 320
+          - 328
+          - 336
+          - 344
+          - 352
+          - 360
+          - 368
+          - 376
+          - 384
+          - 392
+          - 400
+          - 408
+          - 416
+          - 424
+          - 432
+          - 440
+          - 448
+          - 456
+          - 464
+          - 472
+          - 480
+          - 488
+          - 496
+          - 504
+          - 512
+      moe_config:
+        backend: CUTEDSL
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 1
+
+benchmark:
+  type: "sa-bench"
+  isl: 1024
+  osl: 1024
+  concurrencies: "8602"
+  req_rate: "inf"
+  custom_tokenizer: "glm_moe_dsa"
+  use_chat_template: false
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/MTP/ctx6dep2_gen1dep32_batch128_eplb288_mtp1.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/MTP/ctx6dep2_gen1dep32_batch128_eplb288_mtp1.yaml
new file mode 100644
index 00000000..44b81b3c
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/MTP/ctx6dep2_gen1dep32_batch128_eplb288_mtp1.yaml
@@ -0,0 +1,150 @@
+name: "glm5_nvfp4_ISL1K_OSL1K_ctx6dep2_gen1dep32_batch128_eplb288_mtp1"
+
+# ctx: 6 prefill workers, TP2/EP2, EPLB: num_slots=288, max_batch=128
+# gen: 1 decode worker, TP32/EP32, enable_attention_dp=true, max_batch=128
+# concurrency: 4301
+
+model:
+  path: "nvidia/GLM5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb300"
+
+  prefill_nodes: 3
+  prefill_workers: 6
+  gpus_per_prefill: 2
+
+  decode_workers: 1
+  decode_nodes: 8
+  gpus_per_decode: 32
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1064
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: CUTEDSL
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 1
+
+    decode:
+      tensor_parallel_size: 32
+      moe_expert_parallel_size: 32
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 128
+      max_num_tokens: 256
+      max_seq_len: 2088
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+          - 16
+          - 24
+          - 32
+          - 40
+          - 48
+          - 56
+          - 64
+          - 72
+          - 80
+          - 88
+          - 96
+          - 104
+          - 112
+          - 120
+          - 128
+      moe_config:
+        backend: CUTEDSL
+        use_low_precision_moe_combine: true
+        load_balancer:
+          layer_updates_per_iter: 1
+          num_slots: 288
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 1
+
+benchmark:
+  type: "sa-bench"
+  isl: 1024
+  osl: 1024
+  concurrencies: "4301"
+  req_rate: "inf"
+  custom_tokenizer: "glm_moe_dsa"
+  use_chat_template: false
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/STP/ctx1dep2_gen1dep32_batch16_eplb0_mtp0.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/STP/ctx1dep2_gen1dep32_batch16_eplb0_mtp0.yaml
new file mode 100644
index 00000000..0410623b
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/STP/ctx1dep2_gen1dep32_batch16_eplb0_mtp0.yaml
@@ -0,0 +1,127 @@
+name: "glm5_nvfp4_ISL1K_OSL1K_ctx1dep2_gen1dep32_batch16_eplb0_mtp0"
+
+# ctx: 1 prefill worker, TP2/EP2
+# gen: 1 decode worker, TP32/EP32, enable_attention_dp=true, max_batch=16
+# concurrency: 615
+
+model:
+  path: "nvidia/GLM5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb300"
+
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 2
+
+  decode_workers: 1
+  decode_nodes: 8
+  gpus_per_decode: 32
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1064
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: CUTEDSL
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+
+    decode:
+      tensor_parallel_size: 32
+      moe_expert_parallel_size: 32
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 16
+      max_num_tokens: 16
+      max_seq_len: 2088
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+          - 16
+      moe_config:
+        backend: CUTEDSL
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.7
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+
+benchmark:
+  type: "sa-bench"
+  isl: 1024
+  osl: 1024
+  concurrencies: "615"
+  req_rate: "inf"
+  custom_tokenizer: "glm_moe_dsa"
+  use_chat_template: false
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/STP/ctx1dep2_gen4tep8_batch64_allconc_eplb0_mtp0.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/STP/ctx1dep2_gen4tep8_batch64_allconc_eplb0_mtp0.yaml
new file mode 100644
index 00000000..d967e3b2
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/STP/ctx1dep2_gen4tep8_batch64_allconc_eplb0_mtp0.yaml
@@ -0,0 +1,134 @@
+name: "glm5_nvfp4_ISL1K_OSL1K_ctx1dep2_gen4tep8_batch64_allconc_eplb0_mtp0"
+
+# ctx: 1 prefill worker, TP2/EP2
+# gen: 4 decode workers, TP8/EP8, enable_attention_dp=false, max_batch=64
+# Merged concurrencies: batch16(84), batch32(180), batch64(336)
+
+model:
+  path: "nvidia/GLM5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb300"
+
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 2
+
+  decode_workers: 4
+  decode_nodes: 8
+  gpus_per_decode: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1064
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: CUTEDSL
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+
+    decode:
+      allreduce_strategy: MNNVL
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 64
+      max_num_tokens: 64
+      max_seq_len: 2088
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+          - 16
+          - 24
+          - 32
+          - 40
+          - 48
+          - 56
+          - 64
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+
+benchmark:
+  type: "sa-bench"
+  isl: 1024
+  osl: 1024
+  concurrencies: "84x180x336"
+  req_rate: "inf"
+  custom_tokenizer: "glm_moe_dsa"
+  use_chat_template: false
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/STP/ctx1dep2_gen5tep4_batch4_allconc_eplb0_mtp0.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/STP/ctx1dep2_gen5tep4_batch4_allconc_eplb0_mtp0.yaml
new file mode 100644
index 00000000..d9f9ea2f
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/STP/ctx1dep2_gen5tep4_batch4_allconc_eplb0_mtp0.yaml
@@ -0,0 +1,125 @@
+name: "glm5_nvfp4_ISL1K_OSL1K_ctx1dep2_gen5tep4_batch4_allconc_eplb0_mtp0"
+
+# ctx: 1 prefill worker, TP2/EP2
+# gen: 5 decode workers, TP4/EP4, enable_attention_dp=false, max_batch=4
+# Merged concurrencies: batch1(5), batch2(10), batch4(25)
+
+model:
+  path: "nvidia/GLM5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb300"
+
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 2
+
+  decode_workers: 5
+  decode_nodes: 5
+  gpus_per_decode: 4
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1064
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: CUTEDSL
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+
+    decode:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 4
+      max_num_tokens: 4
+      max_seq_len: 2088
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+
+benchmark:
+  type: "sa-bench"
+  isl: 1024
+  osl: 1024
+  concurrencies: "5x10x25"
+  req_rate: "inf"
+  custom_tokenizer: "glm_moe_dsa"
+  use_chat_template: false
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/STP/ctx2dep2_gen1dep32_batch32_eplb0_mtp0.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/STP/ctx2dep2_gen1dep32_batch32_eplb0_mtp0.yaml
new file mode 100644
index 00000000..26ddd7b1
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/STP/ctx2dep2_gen1dep32_batch32_eplb0_mtp0.yaml
@@ -0,0 +1,129 @@
+name: "glm5_nvfp4_ISL1K_OSL1K_ctx2dep2_gen1dep32_batch32_eplb0_mtp0"
+
+# ctx: 2 prefill workers, TP2/EP2, max_batch=32
+# gen: 1 decode worker, TP32/EP32, enable_attention_dp=true, max_batch=32
+# concurrency: 1229
+
+model:
+  path: "nvidia/GLM5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb300"
+
+  prefill_nodes: 1
+  prefill_workers: 2
+  gpus_per_prefill: 2
+
+  decode_workers: 1
+  decode_nodes: 8
+  gpus_per_decode: 32
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1064
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: CUTEDSL
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+
+    decode:
+      tensor_parallel_size: 32
+      moe_expert_parallel_size: 32
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 32
+      max_num_tokens: 32
+      max_seq_len: 2088
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+          - 16
+          - 24
+          - 32
+      moe_config:
+        backend: CUTEDSL
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.7
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+
+benchmark:
+  type: "sa-bench"
+  isl: 1024
+  osl: 1024
+  concurrencies: "1229"
+  req_rate: "inf"
+  custom_tokenizer: "glm_moe_dsa"
+  use_chat_template: false
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/STP/ctx3dep2_gen1dep32_batch64_eplb0_mtp0.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/STP/ctx3dep2_gen1dep32_batch64_eplb0_mtp0.yaml
new file mode 100644
index 00000000..081e96da
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/STP/ctx3dep2_gen1dep32_batch64_eplb0_mtp0.yaml
@@ -0,0 +1,133 @@
+name: "glm5_nvfp4_ISL1K_OSL1K_ctx3dep2_gen1dep32_batch64_eplb0_mtp0"
+
+# ctx: 3 prefill workers, TP2/EP2, max_batch=64
+# gen: 1 decode worker, TP32/EP32, enable_attention_dp=true, max_batch=64
+# concurrency: 2253
+
+model:
+  path: "nvidia/GLM5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb300"
+
+  prefill_nodes: 2
+  prefill_workers: 3
+  gpus_per_prefill: 2
+
+  decode_workers: 1
+  decode_nodes: 8
+  gpus_per_decode: 32
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1064
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: CUTEDSL
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+
+    decode:
+      tensor_parallel_size: 32
+      moe_expert_parallel_size: 32
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 64
+      max_num_tokens: 64
+      max_seq_len: 2088
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+          - 16
+          - 24
+          - 32
+          - 40
+          - 48
+          - 56
+          - 64
+      moe_config:
+        backend: CUTEDSL
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.7
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+
+benchmark:
+  type: "sa-bench"
+  isl: 1024
+  osl: 1024
+  concurrencies: "2253"
+  req_rate: "inf"
+  custom_tokenizer: "glm_moe_dsa"
+  use_chat_template: false
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/STP/ctx4dep2_gen1dep16_batch512_eplb256_mtp0.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/STP/ctx4dep2_gen1dep16_batch512_eplb256_mtp0.yaml
new file mode 100644
index 00000000..dbca4fd5
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/STP/ctx4dep2_gen1dep16_batch512_eplb256_mtp0.yaml
@@ -0,0 +1,191 @@
+name: "glm5_nvfp4_ISL1K_OSL1K_ctx4dep2_gen1dep16_batch512_eplb256_mtp0"
+
+# ctx: 4 prefill workers, TP2/EP2
+# gen: 1 decode worker, TP16/EP16, EPLB: num_slots=256, max_batch=512, concurrency: 8192
+
+model:
+  path: "nvidia/GLM5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+  precision: "fp4"
+  
+resources:
+  gpu_type: "gb300"
+
+  prefill_nodes: 2
+  prefill_workers: 4
+  gpus_per_prefill: 2
+
+  decode_workers: 1
+  decode_nodes: 4
+  gpus_per_decode: 16
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1064
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: CUTEDSL
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+
+    decode:
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 512
+      max_num_tokens: 512
+      max_seq_len: 2088
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+          - 16
+          - 24
+          - 32
+          - 40
+          - 48
+          - 56
+          - 64
+          - 72
+          - 80
+          - 88
+          - 96
+          - 104
+          - 112
+          - 120
+          - 128
+          - 136
+          - 144
+          - 152
+          - 160
+          - 168
+          - 176
+          - 184
+          - 192
+          - 200
+          - 208
+          - 216
+          - 224
+          - 232
+          - 240
+          - 248
+          - 256
+          - 264
+          - 272
+          - 280
+          - 288
+          - 296
+          - 304
+          - 312
+          - 320
+          - 328
+          - 336
+          - 344
+          - 352
+          - 360
+          - 368
+          - 376
+          - 384
+          - 392
+          - 400
+          - 408
+          - 416
+          - 424
+          - 432
+          - 440
+          - 448
+          - 456
+          - 464
+          - 472
+          - 480
+          - 488
+          - 496
+          - 504
+          - 512
+      moe_config:
+        backend: CUTEDSL
+        use_low_precision_moe_combine: true
+        load_balancer:
+          layer_updates_per_iter: 1
+          num_slots: 256
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.75
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+
+benchmark:
+  type: "sa-bench"
+  isl: 1024
+  osl: 1024
+  concurrencies: "8192"
+  req_rate: "inf"
+  custom_tokenizer: "glm_moe_dsa"
+  use_chat_template: false
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/STP/ctx4dep2_gen1dep32_batch128_eplb0_mtp0.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/STP/ctx4dep2_gen1dep32_batch128_eplb0_mtp0.yaml
new file mode 100644
index 00000000..1c8d2d78
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/STP/ctx4dep2_gen1dep32_batch128_eplb0_mtp0.yaml
@@ -0,0 +1,141 @@
+name: "glm5_nvfp4_ISL1K_OSL1K_ctx4dep2_gen1dep32_batch128_eplb0_mtp0"
+
+# ctx: 4 prefill workers, TP2/EP2, max_batch=128
+# gen: 1 decode worker, TP32/EP32, enable_attention_dp=true, max_batch=128
+# concurrency: 4301
+
+model:
+  path: "nvidia/GLM5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb300"
+
+  prefill_nodes: 2
+  prefill_workers: 4
+  gpus_per_prefill: 2
+
+  decode_workers: 1
+  decode_nodes: 8
+  gpus_per_decode: 32
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1064
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: CUTEDSL
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+
+    decode:
+      tensor_parallel_size: 32
+      moe_expert_parallel_size: 32
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 128
+      max_num_tokens: 128
+      max_seq_len: 2088
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+          - 16
+          - 24
+          - 32
+          - 40
+          - 48
+          - 56
+          - 64
+          - 72
+          - 80
+          - 88
+          - 96
+          - 104
+          - 112
+          - 120
+          - 128
+      moe_config:
+        backend: CUTEDSL
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.7
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+
+benchmark:
+  type: "sa-bench"
+  isl: 1024
+  osl: 1024
+  concurrencies: "4301"
+  req_rate: "inf"
+  custom_tokenizer: "glm_moe_dsa"
+  use_chat_template: false
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/STP/ctx6dep2_gen1dep32_batch256_eplb288_mtp0.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/STP/ctx6dep2_gen1dep32_batch256_eplb288_mtp0.yaml
new file mode 100644
index 00000000..0d6870ff
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/STP/ctx6dep2_gen1dep32_batch256_eplb288_mtp0.yaml
@@ -0,0 +1,160 @@
+name: "glm5_nvfp4_ISL1K_OSL1K_ctx6dep2_gen1dep32_batch256_eplb288_mtp0"
+
+# ctx: 6 prefill workers, EPLB: num_slots=288, TP2/EP2, max_batch=256
+# gen: 1 decode worker, TP32/EP32, enable_attention_dp=true, max_batch=256
+# concurrency: 8192
+
+model:
+  path: "nvidia/GLM5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb300"
+
+  prefill_nodes: 3
+  prefill_workers: 6
+  gpus_per_prefill: 2
+
+  decode_workers: 1
+  decode_nodes: 8
+  gpus_per_decode: 32
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1064
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: CUTEDSL
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+
+    decode:
+      tensor_parallel_size: 32
+      moe_expert_parallel_size: 32
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 256
+      max_num_tokens: 256
+      max_seq_len: 2088
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+          - 16
+          - 24
+          - 32
+          - 40
+          - 48
+          - 56
+          - 64
+          - 72
+          - 80
+          - 88
+          - 96
+          - 104
+          - 112
+          - 120
+          - 128
+          - 136
+          - 144
+          - 152
+          - 160
+          - 168
+          - 176
+          - 184
+          - 192
+          - 200
+          - 208
+          - 216
+          - 224
+          - 232
+          - 240
+          - 248
+          - 256
+      moe_config:
+        backend: CUTEDSL
+        use_low_precision_moe_combine: true
+        load_balancer:
+          layer_updates_per_iter: 1
+          num_slots: 288
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.7
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+
+benchmark:
+  type: "sa-bench"
+  isl: 1024
+  osl: 1024
+  concurrencies: "8192"
+  req_rate: "inf"
+  custom_tokenizer: "glm_moe_dsa"
+  use_chat_template: false
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/MTP/ctx12dep2_gen1dep16_batch32_eplb0_mtp2.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/MTP/ctx12dep2_gen1dep16_batch32_eplb0_mtp2.yaml
new file mode 100644
index 00000000..8940ea72
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/MTP/ctx12dep2_gen1dep16_batch32_eplb0_mtp2.yaml
@@ -0,0 +1,135 @@
+name: "glm5_nvfp4_ISL8K_OSL1K_ctx12dep2_gen1dep16_batch32_eplb0_mtp2"
+
+# ctx: 12 prefill workers, TP2/EP2
+# gen: 1 decode worker, TP16/EP16, enable_attention_dp=true, max_batch=32
+# concurrency: 666
+
+model:
+  path: "nvidia/GLM5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb300"
+
+  prefill_nodes: 6
+  prefill_workers: 12
+  gpus_per_prefill: 2
+
+  decode_workers: 1
+  decode_nodes: 4
+  gpus_per_decode: 16
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 2
+      max_num_tokens: 16640
+      max_seq_len: 8232
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: CUTEDSL
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 2
+
+    decode:
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 32
+      max_num_tokens: 96
+      max_seq_len: 9256
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+          - 16
+          - 24
+          - 32
+      moe_config:
+        backend: CUTEDSL
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.7
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 2
+
+benchmark:
+  type: "sa-bench"
+  isl: 8192
+  osl: 1024
+  concurrencies: "666"
+  req_rate: "inf"
+  custom_tokenizer: "glm_moe_dsa"
+  use_chat_template: false
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/MTP/ctx13dep2_gen1dep8_batch128_eplb0_mtp1.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/MTP/ctx13dep2_gen1dep8_batch128_eplb0_mtp1.yaml
new file mode 100644
index 00000000..29eba0b3
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/MTP/ctx13dep2_gen1dep8_batch128_eplb0_mtp1.yaml
@@ -0,0 +1,147 @@
+name: "glm5_nvfp4_ISL8K_OSL1K_ctx13dep2_gen1dep8_batch128_eplb0_mtp1"
+
+# ctx: 13 prefill workers, TP2/EP2
+# gen: 1 decode worker, TP8/EP8, enable_attention_dp=true, max_batch=128
+# concurrency: 1076
+
+model:
+  path: "nvidia/GLM5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb300"
+
+  prefill_nodes: 7
+  prefill_workers: 13
+  gpus_per_prefill: 2
+
+  decode_workers: 1
+  decode_nodes: 2
+  gpus_per_decode: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 2
+      max_num_tokens: 16640
+      max_seq_len: 8232
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: CUTEDSL
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 1
+
+    decode:
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 128
+      max_num_tokens: 256
+      max_seq_len: 9256
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+          - 16
+          - 24
+          - 32
+          - 40
+          - 48
+          - 56
+          - 64
+          - 72
+          - 80
+          - 88
+          - 96
+          - 104
+          - 112
+          - 120
+          - 128
+      moe_config:
+        backend: CUTEDSL
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 1
+
+benchmark:
+  type: "sa-bench"
+  isl: 8192
+  osl: 1024
+  concurrencies: "1076"
+  req_rate: "inf"
+  custom_tokenizer: "glm_moe_dsa"
+  use_chat_template: false
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/MTP/ctx15dep2_gen1dep32_batch16_eplb0_mtp3.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/MTP/ctx15dep2_gen1dep32_batch16_eplb0_mtp3.yaml
new file mode 100644
index 00000000..f8fcdac9
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/MTP/ctx15dep2_gen1dep32_batch16_eplb0_mtp3.yaml
@@ -0,0 +1,133 @@
+name: "glm5_nvfp4_ISL8K_OSL1K_ctx15dep2_gen1dep32_batch16_eplb0_mtp3"
+
+# ctx: 15 prefill workers, TP2/EP2
+# gen: 1 decode worker, TP32/EP32, enable_attention_dp=true, max_batch=16
+# concurrency: 666
+
+model:
+  path: "nvidia/GLM5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb300"
+
+  prefill_nodes: 8
+  prefill_workers: 15
+  gpus_per_prefill: 2
+
+  decode_workers: 1
+  decode_nodes: 8
+  gpus_per_decode: 32
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 2
+      max_num_tokens: 16640
+      max_seq_len: 8232
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: CUTEDSL
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+    decode:
+      tensor_parallel_size: 32
+      moe_expert_parallel_size: 32
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 16
+      max_num_tokens: 64
+      max_seq_len: 9256
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+          - 16
+      moe_config:
+        backend: CUTEDSL
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+benchmark:
+  type: "sa-bench"
+  isl: 8192
+  osl: 1024
+  concurrencies: "666"
+  req_rate: "inf"
+  custom_tokenizer: "glm_moe_dsa"
+  use_chat_template: false
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/MTP/ctx18dep2_gen1dep16_batch64_eplb0_mtp1.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/MTP/ctx18dep2_gen1dep16_batch64_eplb0_mtp1.yaml
new file mode 100644
index 00000000..775fa68f
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/MTP/ctx18dep2_gen1dep16_batch64_eplb0_mtp1.yaml
@@ -0,0 +1,139 @@
+name: "glm5_nvfp4_ISL8K_OSL1K_ctx18dep2_gen1dep16_batch64_eplb0_mtp1"
+
+# ctx: 18 prefill workers, TP2/EP2
+# gen: 1 decode worker, TP16/EP16, enable_attention_dp=true, max_batch=64
+# concurrency: 1229
+
+model:
+  path: "nvidia/GLM5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb300"
+
+  prefill_nodes: 9
+  prefill_workers: 18
+  gpus_per_prefill: 2
+
+  decode_workers: 1
+  decode_nodes: 4
+  gpus_per_decode: 16
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 2
+      max_num_tokens: 16640
+      max_seq_len: 8232
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: CUTEDSL
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 1
+
+    decode:
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 64
+      max_num_tokens: 128
+      max_seq_len: 9256
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+          - 16
+          - 24
+          - 32
+          - 40
+          - 48
+          - 56
+          - 64
+      moe_config:
+        backend: CUTEDSL
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.7
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 1
+
+benchmark:
+  type: "sa-bench"
+  isl: 8192
+  osl: 1024
+  concurrencies: "1229"
+  req_rate: "inf"
+  custom_tokenizer: "glm_moe_dsa"
+  use_chat_template: false
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/MTP/ctx1dep2_gen1tep8_batch16_eplb0_mtp3.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/MTP/ctx1dep2_gen1tep8_batch16_eplb0_mtp3.yaml
new file mode 100644
index 00000000..c457cce0
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/MTP/ctx1dep2_gen1tep8_batch16_eplb0_mtp3.yaml
@@ -0,0 +1,134 @@
+name: "glm5_nvfp4_ISL8K_OSL1K_ctx1dep2_gen1tep8_batch16_eplb0_mtp3"
+
+# ctx: 1 prefill worker, TP2/EP2
+# gen: 1 decode worker, TP8/EP8 (MNNVL), max_batch=16
+# concurrency: 24
+
+model:
+  path: "nvidia/GLM5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+  precision: "fp4"
+  
+resources:
+  gpu_type: "gb300"
+
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 2
+
+  decode_workers: 1
+  decode_nodes: 2
+  gpus_per_decode: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 2
+      max_num_tokens: 16640
+      max_seq_len: 8232
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: CUTEDSL
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+    decode:
+      allreduce_strategy: MNNVL
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 16
+      max_num_tokens: 64
+      max_seq_len: 9256
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+          - 16
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+benchmark:
+  type: "sa-bench"
+  isl: 8192
+  osl: 1024
+  concurrencies: "24"
+  req_rate: "inf"
+  custom_tokenizer: "glm_moe_dsa"
+  use_chat_template: false
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/MTP/ctx1dep2_gen2tep8_batch8_eplb0_mtp3.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/MTP/ctx1dep2_gen2tep8_batch8_eplb0_mtp3.yaml
new file mode 100644
index 00000000..517cf361
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/MTP/ctx1dep2_gen2tep8_batch8_eplb0_mtp3.yaml
@@ -0,0 +1,133 @@
+name: "glm5_nvfp4_ISL8K_OSL1K_ctx1dep2_gen2tep8_batch8_eplb0_mtp3"
+
+# ctx: 1 prefill worker, TP2/EP2
+# gen: 2 decode workers, TP8/EP8 (MNNVL), max_batch=8
+# concurrency: 22
+
+model:
+  path: "nvidia/GLM5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb300"
+
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 2
+
+  decode_workers: 2
+  decode_nodes: 4
+  gpus_per_decode: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 2
+      max_num_tokens: 16640
+      max_seq_len: 8232
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: CUTEDSL
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+    decode:
+      allreduce_strategy: MNNVL
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 8
+      max_num_tokens: 32
+      max_seq_len: 9256
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+benchmark:
+  type: "sa-bench"
+  isl: 8192
+  osl: 1024
+  concurrencies: "22"
+  req_rate: "inf"
+  custom_tokenizer: "glm_moe_dsa"
+  use_chat_template: false
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/MTP/ctx1dep2_gen4tep8_batch4_allconc_eplb0_mtp3.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/MTP/ctx1dep2_gen4tep8_batch4_allconc_eplb0_mtp3.yaml
new file mode 100644
index 00000000..20599c3f
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/MTP/ctx1dep2_gen4tep8_batch4_allconc_eplb0_mtp3.yaml
@@ -0,0 +1,132 @@
+name: "glm5_nvfp4_ISL8K_OSL1K_ctx1dep2_gen4tep8_batch4_allconc_eplb0_mtp3"
+
+# ctx: 1 prefill worker, TP2/EP2
+# gen: 4 decode workers, TP8/EP8 (MNNVL), max_batch=4
+# concurrencies: 4 (batch1), 24 (batch4)
+
+model:
+  path: "nvidia/GLM5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+  precision: "fp4"
+  
+resources:
+  gpu_type: "gb300"
+
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 2
+
+  decode_workers: 4
+  decode_nodes: 8
+  gpus_per_decode: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 2
+      max_num_tokens: 16640
+      max_seq_len: 8232
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: CUTEDSL
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+    decode:
+      allreduce_strategy: MNNVL
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 4
+      max_num_tokens: 16
+      max_seq_len: 9256
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+benchmark:
+  type: "sa-bench"
+  isl: 8192
+  osl: 1024
+  concurrencies: "4x24"
+  req_rate: "inf"
+  custom_tokenizer: "glm_moe_dsa"
+  use_chat_template: false
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/MTP/ctx1dep2_gen5tep4_batch1_eplb0_mtp3.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/MTP/ctx1dep2_gen5tep4_batch1_eplb0_mtp3.yaml
new file mode 100644
index 00000000..0037f722
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/MTP/ctx1dep2_gen5tep4_batch1_eplb0_mtp3.yaml
@@ -0,0 +1,131 @@
+name: "glm5_nvfp4_ISL8K_OSL1K_ctx1dep2_gen5tep4_batch1_eplb0_mtp3"
+
+# ctx: 1 prefill worker, TP2/EP2
+# gen: 5 decode workers, TP4/EP4, max_batch=1
+# concurrency: 5
+
+model:
+  path: "nvidia/GLM5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb300"
+
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 2
+
+  decode_workers: 5
+  decode_nodes: 5
+  gpus_per_decode: 4
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 2
+      max_num_tokens: 16640
+      max_seq_len: 8232
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: CUTEDSL
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+    decode:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 1
+      max_num_tokens: 4
+      max_seq_len: 9256
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.85
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+benchmark:
+  type: "sa-bench"
+  isl: 8192
+  osl: 1024
+  concurrencies: "5"
+  req_rate: "inf"
+  custom_tokenizer: "glm_moe_dsa"
+  use_chat_template: false
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/MTP/ctx5dep2_gen1dep32_batch4_eplb0_mtp3.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/MTP/ctx5dep2_gen1dep32_batch4_eplb0_mtp3.yaml
new file mode 100644
index 00000000..6e233408
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/MTP/ctx5dep2_gen1dep32_batch4_eplb0_mtp3.yaml
@@ -0,0 +1,131 @@
+name: "glm5_nvfp4_ISL8K_OSL1K_ctx5dep2_gen1dep32_batch4_eplb0_mtp3"
+
+# ctx: 5 prefill workers, TP2/EP2
+# gen: 1 decode worker, TP32/EP32, enable_attention_dp=true, enable_lm_head_tp_in_adp=true, max_batch=4
+# concurrency: 180
+
+model:
+  path: "nvidia/GLM5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb300"
+
+  prefill_nodes: 3
+  prefill_workers: 5
+  gpus_per_prefill: 2
+
+  decode_workers: 1
+  decode_nodes: 8
+  gpus_per_decode: 32
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 2
+      max_num_tokens: 16640
+      max_seq_len: 8232
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: CUTEDSL
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+    decode:
+      tensor_parallel_size: 32
+      moe_expert_parallel_size: 32
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 4
+      max_num_tokens: 16
+      max_seq_len: 9256
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+      moe_config:
+        backend: CUTEDSL
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+benchmark:
+  type: "sa-bench"
+  isl: 8192
+  osl: 1024
+  concurrencies: "180"
+  req_rate: "inf"
+  custom_tokenizer: "glm_moe_dsa"
+  use_chat_template: false
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/MTP/ctx9dep2_gen1dep32_batch8_eplb0_mtp3.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/MTP/ctx9dep2_gen1dep32_batch8_eplb0_mtp3.yaml
new file mode 100644
index 00000000..bd1cb583
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/MTP/ctx9dep2_gen1dep32_batch8_eplb0_mtp3.yaml
@@ -0,0 +1,132 @@
+name: "glm5_nvfp4_ISL8K_OSL1K_ctx9dep2_gen1dep32_batch8_eplb0_mtp3"
+
+# ctx: 9 prefill workers, TP2/EP2
+# gen: 1 decode worker, TP32/EP32, enable_attention_dp=true, max_batch=8
+# concurrency: 333
+
+model:
+  path: "nvidia/GLM5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb300"
+
+  prefill_nodes: 5
+  prefill_workers: 9
+  gpus_per_prefill: 2
+
+  decode_workers: 1
+  decode_nodes: 8
+  gpus_per_decode: 32
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 2
+      max_num_tokens: 16640
+      max_seq_len: 8232
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: CUTEDSL
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+    decode:
+      tensor_parallel_size: 32
+      moe_expert_parallel_size: 32
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 8
+      max_num_tokens: 32
+      max_seq_len: 9256
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+      moe_config:
+        backend: CUTEDSL
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+      speculative_config:
+        decoding_type: MTP
+        num_nextn_predict_layers: 3
+
+benchmark:
+  type: "sa-bench"
+  isl: 8192
+  osl: 1024
+  concurrencies: "333"
+  req_rate: "inf"
+  custom_tokenizer: "glm_moe_dsa"
+  use_chat_template: false
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/STP/ctx12dep2_gen1dep16_batch64_eplb0_mtp0.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/STP/ctx12dep2_gen1dep16_batch64_eplb0_mtp0.yaml
new file mode 100644
index 00000000..611aebb6
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/STP/ctx12dep2_gen1dep16_batch64_eplb0_mtp0.yaml
@@ -0,0 +1,133 @@
+name: "glm5_nvfp4_ISL8K_OSL1K_ctx12dep2_gen1dep16_batch64_eplb0_mtp0"
+
+# ctx: 12 prefill workers, TP2/EP2
+# gen: 1 decode worker, TP16/EP16, enable_attention_dp=true, max_batch=64
+# concurrency: 1127
+
+model:
+  path: "nvidia/GLM5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb300"
+
+  prefill_nodes: 6
+  prefill_workers: 12
+  gpus_per_prefill: 2
+
+  decode_workers: 1
+  decode_nodes: 4
+  gpus_per_decode: 16
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 2
+      max_num_tokens: 16640
+      max_seq_len: 8232
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: CUTEDSL
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+
+    decode:
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 64
+      max_num_tokens: 64
+      max_seq_len: 9256
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+          - 16
+          - 24
+          - 32
+          - 40
+          - 48
+          - 56
+          - 64
+      moe_config:
+        backend: CUTEDSL
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+
+benchmark:
+  type: "sa-bench"
+  isl: 8192
+  osl: 1024
+  concurrencies: "1127"
+  req_rate: "inf"
+  custom_tokenizer: "glm_moe_dsa"
+  use_chat_template: false
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/STP/ctx15dep2_gen1dep32_batch32_eplb0_mtp0.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/STP/ctx15dep2_gen1dep32_batch32_eplb0_mtp0.yaml
new file mode 100644
index 00000000..831e703d
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/STP/ctx15dep2_gen1dep32_batch32_eplb0_mtp0.yaml
@@ -0,0 +1,129 @@
+name: "glm5_nvfp4_ISL8K_OSL1K_ctx15dep2_gen1dep32_batch32_eplb0_mtp0"
+
+# ctx: 15 prefill workers, TP2/EP2
+# gen: 1 decode worker, TP32/EP32, enable_attention_dp=true, max_batch=32
+# concurrency: 1229
+
+model:
+  path: "nvidia/GLM5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb300"
+
+  prefill_nodes: 8
+  prefill_workers: 15
+  gpus_per_prefill: 2
+
+  decode_workers: 1
+  decode_nodes: 8
+  gpus_per_decode: 32
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 2
+      max_num_tokens: 16640
+      max_seq_len: 8232
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: CUTEDSL
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+
+    decode:
+      tensor_parallel_size: 32
+      moe_expert_parallel_size: 32
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 32
+      max_num_tokens: 32
+      max_seq_len: 9256
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+          - 16
+          - 24
+          - 32
+      moe_config:
+        backend: CUTEDSL
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.75
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+
+benchmark:
+  type: "sa-bench"
+  isl: 8192
+  osl: 1024
+  concurrencies: "1229"
+  req_rate: "inf"
+  custom_tokenizer: "glm_moe_dsa"
+  use_chat_template: false
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/STP/ctx1dep2_gen2tep8_batch16_eplb0_mtp0.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/STP/ctx1dep2_gen2tep8_batch16_eplb0_mtp0.yaml
new file mode 100644
index 00000000..8ff2f420
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/STP/ctx1dep2_gen2tep8_batch16_eplb0_mtp0.yaml
@@ -0,0 +1,127 @@
+name: "glm5_nvfp4_ISL8K_OSL1K_ctx1dep2_gen2tep8_batch16_eplb0_mtp0"
+
+# ctx: 1 prefill worker, TP2/EP2
+# gen: 2 decode workers, TP8/EP8 (MNNVL), max_batch=16
+# concurrency: 42
+model:
+  path: "nvidia/GLM5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+  precision: "fp4"
+  
+resources:
+  gpu_type: "gb300"
+
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 2
+
+  decode_workers: 2
+  decode_nodes: 4
+  gpus_per_decode: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 2
+      max_num_tokens: 16640
+      max_seq_len: 8232
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: CUTEDSL
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+
+    decode:
+      allreduce_strategy: MNNVL
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 16
+      max_num_tokens: 16
+      max_seq_len: 9256
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+          - 16
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+
+benchmark:
+  type: "sa-bench"
+  isl: 8192
+  osl: 1024
+  concurrencies: "42"
+  req_rate: "inf"
+  custom_tokenizer: "glm_moe_dsa"
+  use_chat_template: false
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/STP/ctx1dep2_gen4tep8_batch1_eplb0_mtp0.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/STP/ctx1dep2_gen4tep8_batch1_eplb0_mtp0.yaml
new file mode 100644
index 00000000..cc8faa11
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/STP/ctx1dep2_gen4tep8_batch1_eplb0_mtp0.yaml
@@ -0,0 +1,126 @@
+name: "glm5_nvfp4_ISL8K_OSL1K_ctx1dep2_gen4tep8_batch1_eplb0_mtp0"
+
+# ctx: 1 prefill worker, TP2/EP2
+# gen: 4 decode workers, TP8/EP8 (MNNVL), max_batch=1
+# concurrency: 4
+
+model:
+  path: "nvidia/GLM5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb300"
+
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 2
+
+  decode_workers: 4
+  decode_nodes: 8
+  gpus_per_decode: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 2
+      max_num_tokens: 16640
+      max_seq_len: 8232
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: CUTEDSL
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+
+    decode:
+      allreduce_strategy: MNNVL
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 1
+      max_num_tokens: 1
+      max_seq_len: 9256
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+
+benchmark:
+  type: "sa-bench"
+  isl: 8192
+  osl: 1024
+  concurrencies: "4"
+  req_rate: "inf"
+  custom_tokenizer: "glm_moe_dsa"
+  use_chat_template: false
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/STP/ctx1dep2_gen5tep4_batch4_allconc_eplb0_mtp0.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/STP/ctx1dep2_gen5tep4_batch4_allconc_eplb0_mtp0.yaml
new file mode 100644
index 00000000..06d02024
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/STP/ctx1dep2_gen5tep4_batch4_allconc_eplb0_mtp0.yaml
@@ -0,0 +1,125 @@
+name: "glm5_nvfp4_ISL8K_OSL1K_ctx1dep2_gen5tep4_batch4_allconc_eplb0_mtp0"
+
+# ctx: 1 prefill worker, TP2/EP2
+# gen: 5 decode workers, TP4/EP4, max_batch=4
+# concurrencies: 5 (batch1), 10 (batch2), 25 (batch4) — merged as 5x10x25
+
+model:
+  path: "nvidia/GLM5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb300"
+
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 2
+
+  decode_workers: 5
+  decode_nodes: 5
+  gpus_per_decode: 4
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 2
+      max_num_tokens: 16640
+      max_seq_len: 8232
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: CUTEDSL
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+
+    decode:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 4
+      max_num_tokens: 4
+      max_seq_len: 9256
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+
+benchmark:
+  type: "sa-bench"
+  isl: 8192
+  osl: 1024
+  concurrencies: "5x10x25"
+  req_rate: "inf"
+  custom_tokenizer: "glm_moe_dsa"
+  use_chat_template: false
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/STP/ctx20dep2_gen1dep16_batch128_eplb0_mtp0.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/STP/ctx20dep2_gen1dep16_batch128_eplb0_mtp0.yaml
new file mode 100644
index 00000000..ead937c9
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/STP/ctx20dep2_gen1dep16_batch128_eplb0_mtp0.yaml
@@ -0,0 +1,141 @@
+name: "glm5_nvfp4_ISL8K_OSL1K_ctx20dep2_gen1dep16_batch128_eplb0_mtp0"
+
+# ctx: 20 prefill workers, TP2/EP2
+# gen: 1 decode worker, TP16/EP16, enable_attention_dp=true, max_batch=128
+# concurrency: 2151
+
+model:
+  path: "nvidia/GLM5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+  precision: "fp4"
+  
+resources:
+  gpu_type: "gb300"
+
+  prefill_nodes: 10
+  prefill_workers: 20
+  gpus_per_prefill: 2
+
+  decode_workers: 1
+  decode_nodes: 4
+  gpus_per_decode: 16
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 2
+      max_num_tokens: 16640
+      max_seq_len: 8232
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: CUTEDSL
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+
+    decode:
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 128
+      max_num_tokens: 128
+      max_seq_len: 9256
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+          - 16
+          - 24
+          - 32
+          - 40
+          - 48
+          - 56
+          - 64
+          - 72
+          - 80
+          - 88
+          - 96
+          - 104
+          - 112
+          - 120
+          - 128
+      moe_config:
+        backend: CUTEDSL
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+
+benchmark:
+  type: "sa-bench"
+  isl: 8192
+  osl: 1024
+  concurrencies: "2151"
+  req_rate: "inf"
+  custom_tokenizer: "glm_moe_dsa"
+  use_chat_template: false
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/STP/ctx2dep2_gen3tep8_batch32_eplb0_mtp0.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/STP/ctx2dep2_gen3tep8_batch32_eplb0_mtp0.yaml
new file mode 100644
index 00000000..e06ea268
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/STP/ctx2dep2_gen3tep8_batch32_eplb0_mtp0.yaml
@@ -0,0 +1,130 @@
+name: "glm5_nvfp4_ISL8K_OSL1K_ctx2dep2_gen3tep8_batch32_eplb0_mtp0"
+
+# ctx: 2 prefill workers, TP2/EP2
+# gen: 3 decode workers, TP8/EP8 (MNNVL), max_batch=32
+# concurrency: 117
+
+model:
+  path: "nvidia/GLM5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb300"
+
+  prefill_nodes: 1
+  prefill_workers: 2
+  gpus_per_prefill: 2
+
+  decode_workers: 3
+  decode_nodes: 6
+  gpus_per_decode: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 2
+      max_num_tokens: 16640
+      max_seq_len: 8232
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: CUTEDSL
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+
+    decode:
+      allreduce_strategy: MNNVL
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 32
+      max_num_tokens: 32
+      max_seq_len: 9256
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+          - 16
+          - 24
+          - 32
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+
+benchmark:
+  type: "sa-bench"
+  isl: 8192
+  osl: 1024
+  concurrencies: "117"
+  req_rate: "inf"
+  custom_tokenizer: "glm_moe_dsa"
+  use_chat_template: false
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/STP/ctx4dep2_gen3tep8_batch64_eplb0_mtp0.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/STP/ctx4dep2_gen3tep8_batch64_eplb0_mtp0.yaml
new file mode 100644
index 00000000..f4b3cc09
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/STP/ctx4dep2_gen3tep8_batch64_eplb0_mtp0.yaml
@@ -0,0 +1,134 @@
+name: "glm5_nvfp4_ISL8K_OSL1K_ctx4dep2_gen3tep8_batch64_eplb0_mtp0"
+
+# ctx: 4 prefill workers, TP2/EP2
+# gen: 3 decode workers, TP8/EP8 (MNNVL), max_batch=64
+# concurrency: 231
+
+model:
+  path: "nvidia/GLM5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb300"
+
+  prefill_nodes: 2
+  prefill_workers: 4
+  gpus_per_prefill: 2
+
+  decode_workers: 3
+  decode_nodes: 6
+  gpus_per_decode: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 2
+      max_num_tokens: 16640
+      max_seq_len: 8232
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: CUTEDSL
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+
+    decode:
+      allreduce_strategy: MNNVL
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 64
+      max_num_tokens: 64
+      max_seq_len: 9256
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+          - 16
+          - 24
+          - 32
+          - 40
+          - 48
+          - 56
+          - 64
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+
+benchmark:
+  type: "sa-bench"
+  isl: 8192
+  osl: 1024
+  concurrencies: "231"
+  req_rate: "inf"
+  custom_tokenizer: "glm_moe_dsa"
+  use_chat_template: false
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/STP/ctx9dep2_gen1dep32_batch16_eplb0_mtp0.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/STP/ctx9dep2_gen1dep32_batch16_eplb0_mtp0.yaml
new file mode 100644
index 00000000..75f56785
--- /dev/null
+++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/STP/ctx9dep2_gen1dep32_batch16_eplb0_mtp0.yaml
@@ -0,0 +1,127 @@
+name: "glm5_nvfp4_ISL8K_OSL1K_ctx9dep2_gen1dep32_batch16_eplb0_mtp0"
+
+# ctx: 9 prefill workers, TP2/EP2
+# gen: 1 decode worker, TP32/EP32, enable_attention_dp=true, max_batch=16
+# concurrency: 615
+
+model:
+  path: "nvidia/GLM5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb300"
+
+  prefill_nodes: 5
+  prefill_workers: 9
+  gpus_per_prefill: 2
+
+  decode_workers: 1
+  decode_nodes: 8
+  gpus_per_decode: 32
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    MIMALLOC_PURGE_DELAY: "0"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 2
+      moe_expert_parallel_size: 2
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 2
+      max_num_tokens: 16640
+      max_seq_len: 8232
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: CUTEDSL
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+
+    decode:
+      tensor_parallel_size: 32
+      moe_expert_parallel_size: 32
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      trust_remote_code: true
+      custom_tokenizer: "glm_moe_dsa"
+      max_batch_size: 16
+      max_num_tokens: 16
+      max_seq_len: 9256
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+          - 16
+      moe_config:
+        backend: CUTEDSL
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.75
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+
+benchmark:
+  type: "sa-bench"
+  isl: 8192
+  osl: 1024
+  concurrencies: "615"
+  req_rate: "inf"
+  custom_tokenizer: "glm_moe_dsa"
+  use_chat_template: false
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen1dep16_batch32_eplb0_mtp3.yaml b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen1dep16_batch32_eplb0_mtp3.yaml
new file mode 100644
index 00000000..03462b07
--- /dev/null
+++ b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen1dep16_batch32_eplb0_mtp3.yaml
@@ -0,0 +1,136 @@
+name: "kimi_k25_nvfp4_ISL1K_OSL1K_ctx1dep4_gen1dep16_batch32_eplb0_mtp3"
+
+# ctx: 1 prefill worker, TP4/EP4
+# gen: 1 decode worker, TP16/EP16, enable_attention_dp=true, max_batch=32
+# MTP (Eagle speculative decoding, max_draft_len=3)
+# concurrency: 666
+
+model:
+  path: "nvidia/Kimi-K2.5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 4
+
+  decode_workers: 1
+  decode_nodes: 4
+  gpus_per_decode: 16
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1064
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.4
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      speculative_config:
+        decoding_type: Eagle
+        max_draft_len: 3
+        speculative_model_dir: "/eagle-model"
+
+    decode:
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      trust_remote_code: true
+      max_batch_size: 32
+      max_num_tokens: 128
+      max_seq_len: 2088
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+          - 16
+          - 24
+          - 32
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.7
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+      speculative_config:
+        decoding_type: Eagle
+        max_draft_len: 3
+        speculative_model_dir: "/eagle-model"
+
+extra_mount:
+  - "nvidia/Kimi-K2.5-Thinking-Eagle3:/eagle-model"
+
+benchmark:
+  type: "sa-bench"
+  isl: 1024
+  osl: 1024
+  concurrencies: "666"
+  req_rate: "inf"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen1dep32_batch16_eplb0_mtp3.yaml b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen1dep32_batch16_eplb0_mtp3.yaml
new file mode 100644
index 00000000..6a29059c
--- /dev/null
+++ b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen1dep32_batch16_eplb0_mtp3.yaml
@@ -0,0 +1,134 @@
+name: "kimi_k25_nvfp4_ISL1K_OSL1K_ctx1dep4_gen1dep32_batch16_eplb0_mtp3"
+
+# ctx: 1 prefill worker, TP4/EP4
+# gen: 1 decode worker, TP32/EP32, enable_attention_dp=true, max_batch=16
+# MTP (Eagle speculative decoding, max_draft_len=3)
+# concurrency: 666
+
+model:
+  path: "nvidia/Kimi-K2.5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 4
+
+  decode_workers: 1
+  decode_nodes: 8
+  gpus_per_decode: 32
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1064
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.4
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      speculative_config:
+        decoding_type: Eagle
+        max_draft_len: 3
+        speculative_model_dir: "/eagle-model"
+
+    decode:
+      tensor_parallel_size: 32
+      moe_expert_parallel_size: 32
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      trust_remote_code: true
+      max_batch_size: 16
+      max_num_tokens: 64
+      max_seq_len: 2088
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+          - 16
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+      speculative_config:
+        decoding_type: Eagle
+        max_draft_len: 3
+        speculative_model_dir: "/eagle-model"
+
+extra_mount:
+  - "nvidia/Kimi-K2.5-Thinking-Eagle3:/eagle-model"
+
+benchmark:
+  type: "sa-bench"
+  isl: 1024
+  osl: 1024
+  concurrencies: "666"
+  req_rate: "inf"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen1dep8_batch512_eplb0_mtp1.yaml b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen1dep8_batch512_eplb0_mtp1.yaml
new file mode 100644
index 00000000..739bd487
--- /dev/null
+++ b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen1dep8_batch512_eplb0_mtp1.yaml
@@ -0,0 +1,196 @@
+name: "kimi_k25_nvfp4_ISL1K_OSL1K_ctx1dep4_gen1dep8_batch512_eplb0_mtp1"
+
+# ctx: 1 prefill worker, TP4/EP4
+# gen: 1 decode worker, TP8/EP8, enable_attention_dp=true, max_batch=512
+# MTP (Eagle speculative decoding, max_draft_len=1)
+# concurrency: 4301
+
+model:
+  path: "nvidia/Kimi-K2.5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 4
+
+  decode_workers: 1
+  decode_nodes: 2
+  gpus_per_decode: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1064
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.4
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      speculative_config:
+        decoding_type: Eagle
+        max_draft_len: 1
+        speculative_model_dir: "/eagle-model"
+
+    decode:
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      trust_remote_code: true
+      max_batch_size: 512
+      max_num_tokens: 1024
+      max_seq_len: 2088
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+          - 16
+          - 24
+          - 32
+          - 40
+          - 48
+          - 56
+          - 64
+          - 72
+          - 80
+          - 88
+          - 96
+          - 104
+          - 112
+          - 120
+          - 128
+          - 136
+          - 144
+          - 152
+          - 160
+          - 168
+          - 176
+          - 184
+          - 192
+          - 200
+          - 208
+          - 216
+          - 224
+          - 232
+          - 240
+          - 248
+          - 256
+          - 264
+          - 272
+          - 280
+          - 288
+          - 296
+          - 304
+          - 312
+          - 320
+          - 328
+          - 336
+          - 344
+          - 352
+          - 360
+          - 368
+          - 376
+          - 384
+          - 392
+          - 400
+          - 408
+          - 416
+          - 424
+          - 432
+          - 440
+          - 448
+          - 456
+          - 464
+          - 472
+          - 480
+          - 488
+          - 496
+          - 504
+          - 512
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+      speculative_config:
+        decoding_type: Eagle
+        max_draft_len: 1
+        speculative_model_dir: "/eagle-model"
+
+extra_mount:
+  - "nvidia/Kimi-K2.5-Thinking-Eagle3:/eagle-model"
+
+benchmark:
+  type: "sa-bench"
+  isl: 1024
+  osl: 1024
+  concurrencies: "4301"
+  req_rate: "inf"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen4tep8_batch64_allconc_eplb0_mtp3.yaml b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen4tep8_batch64_allconc_eplb0_mtp3.yaml
new file mode 100644
index 00000000..a768bec4
--- /dev/null
+++ b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen4tep8_batch64_allconc_eplb0_mtp3.yaml
@@ -0,0 +1,141 @@
+name: "kimi_k25_nvfp4_ISL1K_OSL1K_ctx1dep4_gen4tep8_batch64_allconc_eplb0_mtp3"
+
+# ctx: 1 prefill worker, TP4/EP4
+# gen: 4 decode workers, TP8/EP8, allreduce_strategy=MNNVL, max_batch=64
+# MTP (Eagle speculative decoding, max_draft_len=3)
+# Covers all gen4tep8 concurrencies: 8, 48, 92, 192, 336
+
+model:
+  path: "nvidia/Kimi-K2.5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 4
+
+  decode_workers: 4
+  decode_nodes: 8
+  gpus_per_decode: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1064
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.4
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      speculative_config:
+        decoding_type: Eagle
+        max_draft_len: 3
+        speculative_model_dir: "/eagle-model"
+
+    decode:
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      allreduce_strategy: MNNVL
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      trust_remote_code: true
+      max_batch_size: 64
+      max_num_tokens: 256
+      max_seq_len: 2088
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+          - 16
+          - 24
+          - 32
+          - 40
+          - 48
+          - 56
+          - 64
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+      speculative_config:
+        decoding_type: Eagle
+        max_draft_len: 3
+        speculative_model_dir: "/eagle-model"
+
+extra_mount:
+  - "nvidia/Kimi-K2.5-Thinking-Eagle3:/eagle-model"
+
+benchmark:
+  type: "sa-bench"
+  isl: 1024
+  osl: 1024
+  concurrencies: "8x48x92x192x336"
+  req_rate: "inf"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen5tep4_batch2_allconc_eplb0_mtp3.yaml b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen5tep4_batch2_allconc_eplb0_mtp3.yaml
new file mode 100644
index 00000000..c2e24b41
--- /dev/null
+++ b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen5tep4_batch2_allconc_eplb0_mtp3.yaml
@@ -0,0 +1,132 @@
+name: "kimi_k25_nvfp4_ISL1K_OSL1K_ctx1dep4_gen5tep4_batch2_allconc_eplb0_mtp3"
+
+# ctx: 1 prefill worker, TP4/EP4
+# gen: 5 decode workers, TP4/EP4, max_batch=2
+# MTP (Eagle speculative decoding, max_draft_len=3)
+# Covers all gen5tep4 concurrencies: 10, 15
+
+model:
+  path: "nvidia/Kimi-K2.5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 4
+
+  decode_workers: 5
+  decode_nodes: 5
+  gpus_per_decode: 4
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1064
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.4
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      speculative_config:
+        decoding_type: Eagle
+        max_draft_len: 3
+        speculative_model_dir: "/eagle-model"
+
+    decode:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      trust_remote_code: true
+      max_batch_size: 2
+      max_num_tokens: 8
+      max_seq_len: 2088
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.85
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+      speculative_config:
+        decoding_type: Eagle
+        max_draft_len: 3
+        speculative_model_dir: "/eagle-model"
+
+extra_mount:
+  - "nvidia/Kimi-K2.5-Thinking-Eagle3:/eagle-model"
+
+benchmark:
+  type: "sa-bench"
+  isl: 1024
+  osl: 1024
+  concurrencies: "10x15"
+  req_rate: "inf"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/MTP/ctx2dep4_gen1dep16_batch128_eplb0_mtp1.yaml b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/MTP/ctx2dep4_gen1dep16_batch128_eplb0_mtp1.yaml
new file mode 100644
index 00000000..68d7dd06
--- /dev/null
+++ b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/MTP/ctx2dep4_gen1dep16_batch128_eplb0_mtp1.yaml
@@ -0,0 +1,148 @@
+name: "kimi_k25_nvfp4_ISL1K_OSL1K_ctx2dep4_gen1dep16_batch128_eplb0_mtp1"
+
+# ctx: 2 prefill workers, TP4/EP4
+# gen: 1 decode worker, TP16/EP16, enable_attention_dp=true, max_batch=128
+# MTP (Eagle speculative decoding, max_draft_len=1)
+# concurrency: 2253
+
+model:
+  path: "nvidia/Kimi-K2.5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+
+  prefill_nodes: 2
+  prefill_workers: 2
+  gpus_per_prefill: 4
+
+  decode_workers: 1
+  decode_nodes: 4
+  gpus_per_decode: 16
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1064
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.4
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      speculative_config:
+        decoding_type: Eagle
+        max_draft_len: 1
+        speculative_model_dir: "/eagle-model"
+
+    decode:
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      trust_remote_code: true
+      max_batch_size: 128
+      max_num_tokens: 256
+      max_seq_len: 2088
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+          - 16
+          - 24
+          - 32
+          - 40
+          - 48
+          - 56
+          - 64
+          - 72
+          - 80
+          - 88
+          - 96
+          - 104
+          - 112
+          - 120
+          - 128
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.7
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+      speculative_config:
+        decoding_type: Eagle
+        max_draft_len: 1
+        speculative_model_dir: "/eagle-model"
+
+extra_mount:
+  - "nvidia/Kimi-K2.5-Thinking-Eagle3:/eagle-model"
+
+benchmark:
+  type: "sa-bench"
+  isl: 1024
+  osl: 1024
+  concurrencies: "2253"
+  req_rate: "inf"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/MTP/ctx2dep4_gen1dep32_batch64_eplb0_mtp1.yaml b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/MTP/ctx2dep4_gen1dep32_batch64_eplb0_mtp1.yaml
new file mode 100644
index 00000000..1cb17478
--- /dev/null
+++ b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/MTP/ctx2dep4_gen1dep32_batch64_eplb0_mtp1.yaml
@@ -0,0 +1,140 @@
+name: "kimi_k25_nvfp4_ISL1K_OSL1K_ctx2dep4_gen1dep32_batch64_eplb0_mtp1"
+
+# ctx: 2 prefill workers, TP4/EP4
+# gen: 1 decode worker, TP32/EP32, enable_attention_dp=true, max_batch=64
+# MTP (Eagle speculative decoding, max_draft_len=1)
+# concurrency: 2253
+
+model:
+  path: "nvidia/Kimi-K2.5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+
+  prefill_nodes: 2
+  prefill_workers: 2
+  gpus_per_prefill: 4
+
+  decode_workers: 1
+  decode_nodes: 8
+  gpus_per_decode: 32
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1064
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.4
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      speculative_config:
+        decoding_type: Eagle
+        max_draft_len: 1
+        speculative_model_dir: "/eagle-model"
+
+    decode:
+      tensor_parallel_size: 32
+      moe_expert_parallel_size: 32
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      trust_remote_code: true
+      max_batch_size: 64
+      max_num_tokens: 128
+      max_seq_len: 2088
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+          - 16
+          - 24
+          - 32
+          - 40
+          - 48
+          - 56
+          - 64
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.6
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+      speculative_config:
+        decoding_type: Eagle
+        max_draft_len: 1
+        speculative_model_dir: "/eagle-model"
+
+extra_mount:
+  - "nvidia/Kimi-K2.5-Thinking-Eagle3:/eagle-model"
+
+benchmark:
+  type: "sa-bench"
+  isl: 1024
+  osl: 1024
+  concurrencies: "2253"
+  req_rate: "inf"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/MTP/ctx2dep4_gen3dep8_batch256_eplb0_mtp1.yaml b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/MTP/ctx2dep4_gen3dep8_batch256_eplb0_mtp1.yaml
new file mode 100644
index 00000000..eb43aab7
--- /dev/null
+++ b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/MTP/ctx2dep4_gen3dep8_batch256_eplb0_mtp1.yaml
@@ -0,0 +1,164 @@
+name: "kimi_k25_nvfp4_ISL1K_OSL1K_ctx2dep4_gen3dep8_batch256_eplb0_mtp1"
+
+# ctx: 2 prefill workers, TP4/EP4
+# gen: 3 decode workers, TP8/EP8, enable_attention_dp=true, max_batch=256
+# MTP (Eagle speculative decoding, max_draft_len=1)
+# concurrency: 6759
+
+model:
+  path: "nvidia/Kimi-K2.5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+
+  prefill_nodes: 2
+  prefill_workers: 2
+  gpus_per_prefill: 4
+
+  decode_workers: 3
+  decode_nodes: 6
+  gpus_per_decode: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1064
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.4
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      speculative_config:
+        decoding_type: Eagle
+        max_draft_len: 1
+        speculative_model_dir: "/eagle-model"
+
+    decode:
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      trust_remote_code: true
+      max_batch_size: 256
+      max_num_tokens: 512
+      max_seq_len: 2088
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+          - 16
+          - 24
+          - 32
+          - 40
+          - 48
+          - 56
+          - 64
+          - 72
+          - 80
+          - 88
+          - 96
+          - 104
+          - 112
+          - 120
+          - 128
+          - 136
+          - 144
+          - 152
+          - 160
+          - 168
+          - 176
+          - 184
+          - 192
+          - 200
+          - 208
+          - 216
+          - 224
+          - 232
+          - 240
+          - 248
+          - 256
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+      speculative_config:
+        decoding_type: Eagle
+        max_draft_len: 1
+        speculative_model_dir: "/eagle-model"
+
+extra_mount:
+  - "nvidia/Kimi-K2.5-Thinking-Eagle3:/eagle-model"
+
+benchmark:
+  type: "sa-bench"
+  isl: 1024
+  osl: 1024
+  concurrencies: "6759"
+  req_rate: "inf"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen1dep16_batch32_eplb0_mtp0.yaml b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen1dep16_batch32_eplb0_mtp0.yaml
new file mode 100644
index 00000000..ce3eff43
--- /dev/null
+++ b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen1dep16_batch32_eplb0_mtp0.yaml
@@ -0,0 +1,125 @@
+name: "kimi_k25_nvfp4_ISL1K_OSL1K_ctx1dep4_gen1dep16_batch32_eplb0_mtp0"
+
+# ctx: 1 prefill worker, TP4/EP4
+# gen: 1 decode worker, TP16/EP16, enable_attention_dp=true, max_batch=32
+# STP (no speculative decoding)
+# concurrency: 666
+
+model:
+  path: "nvidia/Kimi-K2.5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 4
+
+  decode_workers: 1
+  decode_nodes: 4
+  gpus_per_decode: 16
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1064
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.4
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+
+    decode:
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      trust_remote_code: true
+      max_batch_size: 32
+      max_num_tokens: 32
+      max_seq_len: 2088
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+          - 16
+          - 24
+          - 32
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.75
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+
+benchmark:
+  type: "sa-bench"
+  isl: 1024
+  osl: 1024
+  concurrencies: "666"
+  req_rate: "inf"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen1dep32_batch64_eplb0_mtp0.yaml b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen1dep32_batch64_eplb0_mtp0.yaml
new file mode 100644
index 00000000..105b84bf
--- /dev/null
+++ b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen1dep32_batch64_eplb0_mtp0.yaml
@@ -0,0 +1,129 @@
+name: "kimi_k25_nvfp4_ISL1K_OSL1K_ctx1dep4_gen1dep32_batch64_eplb0_mtp0"
+
+# ctx: 1 prefill worker, TP4/EP4
+# gen: 1 decode worker, TP32/EP32, enable_attention_dp=true, max_batch=64
+# STP (no speculative decoding)
+# concurrency: 2253
+
+model:
+  path: "nvidia/Kimi-K2.5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 4
+
+  decode_workers: 1
+  decode_nodes: 8
+  gpus_per_decode: 32
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1064
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.4
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+
+    decode:
+      tensor_parallel_size: 32
+      moe_expert_parallel_size: 32
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      trust_remote_code: true
+      max_batch_size: 64
+      max_num_tokens: 64
+      max_seq_len: 2088
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+          - 16
+          - 24
+          - 32
+          - 40
+          - 48
+          - 56
+          - 64
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.7
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+
+benchmark:
+  type: "sa-bench"
+  isl: 1024
+  osl: 1024
+  concurrencies: "2253"
+  req_rate: "inf"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: true
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen1dep8_batch768_allconc_eplb0_mtp0.yaml b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen1dep8_batch768_allconc_eplb0_mtp0.yaml
new file mode 100644
index 00000000..9fb194dd
--- /dev/null
+++ b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen1dep8_batch768_allconc_eplb0_mtp0.yaml
@@ -0,0 +1,217 @@
+name: "kimi_k25_nvfp4_ISL1K_OSL1K_ctx1dep4_gen1dep8_batch768_allconc_eplb0_mtp0"
+
+# ctx: 1 prefill worker, TP4/EP4
+# gen: 1 decode worker, TP8/EP8, enable_attention_dp=true, max_batch=768
+# STP (no speculative decoding)
+# Covers all dep8 concurrencies: 4301, 6452
+
+model:
+  path: "nvidia/Kimi-K2.5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 4
+
+  decode_workers: 1
+  decode_nodes: 2
+  gpus_per_decode: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1064
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.4
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+
+    decode:
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      trust_remote_code: true
+      max_batch_size: 768
+      max_num_tokens: 768
+      max_seq_len: 2088
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+          - 16
+          - 24
+          - 32
+          - 40
+          - 48
+          - 56
+          - 64
+          - 72
+          - 80
+          - 88
+          - 96
+          - 104
+          - 112
+          - 120
+          - 128
+          - 136
+          - 144
+          - 152
+          - 160
+          - 168
+          - 176
+          - 184
+          - 192
+          - 200
+          - 208
+          - 216
+          - 224
+          - 232
+          - 240
+          - 248
+          - 256
+          - 264
+          - 272
+          - 280
+          - 288
+          - 296
+          - 304
+          - 312
+          - 320
+          - 328
+          - 336
+          - 344
+          - 352
+          - 360
+          - 368
+          - 376
+          - 384
+          - 392
+          - 400
+          - 408
+          - 416
+          - 424
+          - 432
+          - 440
+          - 448
+          - 456
+          - 464
+          - 472
+          - 480
+          - 488
+          - 496
+          - 504
+          - 512
+          - 520
+          - 528
+          - 536
+          - 544
+          - 552
+          - 560
+          - 568
+          - 576
+          - 584
+          - 592
+          - 600
+          - 608
+          - 616
+          - 624
+          - 632
+          - 640
+          - 648
+          - 656
+          - 664
+          - 672
+          - 680
+          - 688
+          - 696
+          - 704
+          - 712
+          - 720
+          - 728
+          - 736
+          - 744
+          - 752
+          - 760
+          - 768
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+
+benchmark:
+  type: "sa-bench"
+  isl: 1024
+  osl: 1024
+  concurrencies: "4301x6452"
+  req_rate: "inf"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen4tep8_batch128_allconc_eplb0_mtp0.yaml b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen4tep8_batch128_allconc_eplb0_mtp0.yaml
new file mode 100644
index 00000000..5639da41
--- /dev/null
+++ b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen4tep8_batch128_allconc_eplb0_mtp0.yaml
@@ -0,0 +1,138 @@
+name: "kimi_k25_nvfp4_ISL1K_OSL1K_ctx1dep4_gen4tep8_batch128_allconc_eplb0_mtp0"
+
+# ctx: 1 prefill worker, TP4/EP4
+# gen: 4 decode workers, TP8/EP8, allreduce_strategy=MNNVL, max_batch=128
+# STP (no speculative decoding)
+# Covers all gen4tep8 concurrencies: 4, 192, 360, 668
+
+model:
+  path: "nvidia/Kimi-K2.5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 4
+
+  decode_workers: 4
+  decode_nodes: 8
+  gpus_per_decode: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1064
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.4
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+
+    decode:
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      allreduce_strategy: MNNVL
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      trust_remote_code: true
+      max_batch_size: 128
+      max_num_tokens: 128
+      max_seq_len: 2088
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+          - 16
+          - 24
+          - 32
+          - 40
+          - 48
+          - 56
+          - 64
+          - 72
+          - 80
+          - 88
+          - 96
+          - 104
+          - 112
+          - 120
+          - 128
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+
+benchmark:
+  type: "sa-bench"
+  isl: 1024
+  osl: 1024
+  concurrencies: "4x192x360x668"
+  req_rate: "inf"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen5tep4_batch8_allconc_eplb0_mtp0.yaml b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen5tep4_batch8_allconc_eplb0_mtp0.yaml
new file mode 100644
index 00000000..f9496feb
--- /dev/null
+++ b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen5tep4_batch8_allconc_eplb0_mtp0.yaml
@@ -0,0 +1,122 @@
+name: "kimi_k25_nvfp4_ISL1K_OSL1K_ctx1dep4_gen5tep4_batch8_allconc_eplb0_mtp0"
+
+# ctx: 1 prefill worker, TP4/EP4
+# gen: 5 decode workers, TP4/EP4, max_batch=8
+# STP (no speculative decoding)
+# Covers all gen5tep4 concurrencies: 5, 15, 30, 55
+
+model:
+  path: "nvidia/Kimi-K2.5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 4
+
+  decode_workers: 5
+  decode_nodes: 5
+  gpus_per_decode: 4
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1064
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.4
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+
+    decode:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      trust_remote_code: true
+      max_batch_size: 8
+      max_num_tokens: 8
+      max_seq_len: 2088
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+
+benchmark:
+  type: "sa-bench"
+  isl: 1024
+  osl: 1024
+  concurrencies: "5x15x30x55"
+  req_rate: "inf"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/STP/ctx2dep4_gen1dep16_batch256_eplb0_mtp0.yaml b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/STP/ctx2dep4_gen1dep16_batch256_eplb0_mtp0.yaml
new file mode 100644
index 00000000..71b016c4
--- /dev/null
+++ b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/STP/ctx2dep4_gen1dep16_batch256_eplb0_mtp0.yaml
@@ -0,0 +1,153 @@
+name: "kimi_k25_nvfp4_ISL1K_OSL1K_ctx2dep4_gen1dep16_batch256_eplb0_mtp0"
+
+# ctx: 2 prefill workers, TP4/EP4
+# gen: 1 decode worker, TP16/EP16, enable_attention_dp=true, max_batch=256
+# STP (no speculative decoding)
+# concurrency: 4301
+
+model:
+  path: "nvidia/Kimi-K2.5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+
+  prefill_nodes: 2
+  prefill_workers: 2
+  gpus_per_prefill: 4
+
+  decode_workers: 1
+  decode_nodes: 4
+  gpus_per_decode: 16
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1064
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.4
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+
+    decode:
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      trust_remote_code: true
+      max_batch_size: 256
+      max_num_tokens: 256
+      max_seq_len: 2088
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+          - 16
+          - 24
+          - 32
+          - 40
+          - 48
+          - 56
+          - 64
+          - 72
+          - 80
+          - 88
+          - 96
+          - 104
+          - 112
+          - 120
+          - 128
+          - 136
+          - 144
+          - 152
+          - 160
+          - 168
+          - 176
+          - 184
+          - 192
+          - 200
+          - 208
+          - 216
+          - 224
+          - 232
+          - 240
+          - 248
+          - 256
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.75
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+
+benchmark:
+  type: "sa-bench"
+  isl: 1024
+  osl: 1024
+  concurrencies: "4301"
+  req_rate: "inf"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/STP/ctx2dep4_gen1dep32_batch128_eplb0_mtp0.yaml b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/STP/ctx2dep4_gen1dep32_batch128_eplb0_mtp0.yaml
new file mode 100644
index 00000000..52b75bb4
--- /dev/null
+++ b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/STP/ctx2dep4_gen1dep32_batch128_eplb0_mtp0.yaml
@@ -0,0 +1,137 @@
+name: "kimi_k25_nvfp4_ISL1K_OSL1K_ctx2dep4_gen1dep32_batch128_eplb0_mtp0"
+
+# ctx: 2 prefill workers, TP4/EP4
+# gen: 1 decode worker, TP32/EP32, enable_attention_dp=true, max_batch=128
+# STP (no speculative decoding)
+# concurrency: 4301
+
+model:
+  path: "nvidia/Kimi-K2.5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+
+  prefill_nodes: 2
+  prefill_workers: 2
+  gpus_per_prefill: 4
+
+  decode_workers: 1
+  decode_nodes: 8
+  gpus_per_decode: 32
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      max_batch_size: 16
+      max_num_tokens: 16384
+      max_seq_len: 1064
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.4
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+
+    decode:
+      tensor_parallel_size: 32
+      moe_expert_parallel_size: 32
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      trust_remote_code: true
+      max_batch_size: 128
+      max_num_tokens: 128
+      max_seq_len: 2088
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+          - 16
+          - 24
+          - 32
+          - 40
+          - 48
+          - 56
+          - 64
+          - 72
+          - 80
+          - 88
+          - 96
+          - 104
+          - 112
+          - 120
+          - 128
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.7
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+
+benchmark:
+  type: "sa-bench"
+  isl: 1024
+  osl: 1024
+  concurrencies: "4301"
+  req_rate: "inf"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/MTP/ctx1dep4_gen2tep8_batch32_eplb0_mtp3.yaml b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/MTP/ctx1dep4_gen2tep8_batch32_eplb0_mtp3.yaml
new file mode 100644
index 00000000..bb3f8d1e
--- /dev/null
+++ b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/MTP/ctx1dep4_gen2tep8_batch32_eplb0_mtp3.yaml
@@ -0,0 +1,137 @@
+name: "kimi_k25_nvfp4_ISL8K_OSL1K_ctx1dep4_gen2tep8_batch32_eplb0_mtp3"
+
+# ctx: 1 prefill worker, TP4/EP4
+# gen: 2 decode workers, TP8/EP8, allreduce_strategy=MNNVL, max_batch=32
+# MTP Eagle speculative decoding, max_draft_len=3
+# concurrency: 90
+
+model:
+  path: "nvidia/Kimi-K2.5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 4
+
+  decode_workers: 2
+  decode_nodes: 4
+  gpus_per_decode: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      max_batch_size: 2
+      max_num_tokens: 16384
+      max_seq_len: 8232
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.4
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      speculative_config:
+        decoding_type: Eagle
+        max_draft_len: 3
+        speculative_model_dir: "/eagle-model"
+
+    decode:
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      allreduce_strategy: MNNVL
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      trust_remote_code: true
+      max_batch_size: 32
+      max_num_tokens: 128
+      max_seq_len: 9256
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+          - 16
+          - 24
+          - 32
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+      speculative_config:
+        decoding_type: Eagle
+        max_draft_len: 3
+        speculative_model_dir: "/eagle-model"
+
+extra_mount:
+  - "nvidia/Kimi-K2.5-Thinking-Eagle3:/eagle-model"
+
+benchmark:
+  type: "sa-bench"
+  isl: 8192
+  osl: 1024
+  concurrencies: "90"
+  req_rate: "inf"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/MTP/ctx1dep4_gen4tep8_batch1_eplb0_mtp3.yaml b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/MTP/ctx1dep4_gen4tep8_batch1_eplb0_mtp3.yaml
new file mode 100644
index 00000000..8b7f02d6
--- /dev/null
+++ b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/MTP/ctx1dep4_gen4tep8_batch1_eplb0_mtp3.yaml
@@ -0,0 +1,133 @@
+name: "kimi_k25_nvfp4_ISL8K_OSL1K_ctx1dep4_gen4tep8_batch1_eplb0_mtp3"
+
+# ctx: 1 prefill worker, TP4/EP4
+# gen: 4 decode workers, TP8/EP8, allreduce_strategy=MNNVL, max_batch=1
+# MTP Eagle speculative decoding, max_draft_len=3
+# concurrency: 8
+
+model:
+  path: "nvidia/Kimi-K2.5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 4
+
+  decode_workers: 4
+  decode_nodes: 8
+  gpus_per_decode: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      max_batch_size: 2
+      max_num_tokens: 16384
+      max_seq_len: 8232
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.4
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      speculative_config:
+        decoding_type: Eagle
+        max_draft_len: 3
+        speculative_model_dir: "/eagle-model"
+
+    decode:
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      allreduce_strategy: MNNVL
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      trust_remote_code: true
+      max_batch_size: 1
+      max_num_tokens: 4
+      max_seq_len: 9256
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+      speculative_config:
+        decoding_type: Eagle
+        max_draft_len: 3
+        speculative_model_dir: "/eagle-model"
+
+extra_mount:
+  - "nvidia/Kimi-K2.5-Thinking-Eagle3:/eagle-model"
+
+benchmark:
+  type: "sa-bench"
+  isl: 8192
+  osl: 1024
+  concurrencies: "8"
+  req_rate: "inf"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/MTP/ctx1dep4_gen5tep4_batch8_allconc_eplb0_mtp3.yaml b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/MTP/ctx1dep4_gen5tep4_batch8_allconc_eplb0_mtp3.yaml
new file mode 100644
index 00000000..1883e739
--- /dev/null
+++ b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/MTP/ctx1dep4_gen5tep4_batch8_allconc_eplb0_mtp3.yaml
@@ -0,0 +1,133 @@
+name: "kimi_k25_nvfp4_ISL8K_OSL1K_ctx1dep4_gen5tep4_batch8_allconc_eplb0_mtp3"
+
+# ctx: 1 prefill worker, TP4/EP4
+# gen: 5 decode workers, TP4/EP4, max_batch=8
+# MTP Eagle speculative decoding, max_draft_len=3
+# Covers all gen5tep4 concurrencies: 10, 15, 60
+
+model:
+  path: "nvidia/Kimi-K2.5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 4
+
+  decode_workers: 5
+  decode_nodes: 5
+  gpus_per_decode: 4
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      max_batch_size: 2
+      max_num_tokens: 16384
+      max_seq_len: 8232
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.4
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      speculative_config:
+        decoding_type: Eagle
+        max_draft_len: 3
+        speculative_model_dir: "/eagle-model"
+
+    decode:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      trust_remote_code: true
+      max_batch_size: 8
+      max_num_tokens: 32
+      max_seq_len: 9256
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.85
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+      speculative_config:
+        decoding_type: Eagle
+        max_draft_len: 3
+        speculative_model_dir: "/eagle-model"
+
+extra_mount:
+  - "nvidia/Kimi-K2.5-Thinking-Eagle3:/eagle-model"
+
+benchmark:
+  type: "sa-bench"
+  isl: 8192
+  osl: 1024
+  concurrencies: "10x15x60"
+  req_rate: "inf"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/MTP/ctx2dep4_gen1dep16_batch8_eplb0_mtp3.yaml b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/MTP/ctx2dep4_gen1dep16_batch8_eplb0_mtp3.yaml
new file mode 100644
index 00000000..5aced422
--- /dev/null
+++ b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/MTP/ctx2dep4_gen1dep16_batch8_eplb0_mtp3.yaml
@@ -0,0 +1,133 @@
+name: "kimi_k25_nvfp4_ISL8K_OSL1K_ctx2dep4_gen1dep16_batch8_eplb0_mtp3"
+
+# ctx: 2 prefill workers, TP4/EP4
+# gen: 1 decode worker, TP16/EP16, enable_attention_dp=true, max_batch=8
+# MTP Eagle speculative decoding, max_draft_len=3
+# concurrency: 180
+
+model:
+  path: "nvidia/Kimi-K2.5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+
+  prefill_nodes: 2
+  prefill_workers: 2
+  gpus_per_prefill: 4
+
+  decode_workers: 1
+  decode_nodes: 4
+  gpus_per_decode: 16
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      max_batch_size: 2
+      max_num_tokens: 16384
+      max_seq_len: 8232
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.4
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      speculative_config:
+        decoding_type: Eagle
+        max_draft_len: 3
+        speculative_model_dir: "/eagle-model"
+
+    decode:
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      trust_remote_code: true
+      max_batch_size: 8
+      max_num_tokens: 32
+      max_seq_len: 9256
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+      speculative_config:
+        decoding_type: Eagle
+        max_draft_len: 3
+        speculative_model_dir: "/eagle-model"
+
+extra_mount:
+  - "nvidia/Kimi-K2.5-Thinking-Eagle3:/eagle-model"
+
+benchmark:
+  type: "sa-bench"
+  isl: 8192
+  osl: 1024
+  concurrencies: "180"
+  req_rate: "inf"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/MTP/ctx5dep4_gen1dep32_batch16_eplb0_mtp3.yaml b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/MTP/ctx5dep4_gen1dep32_batch16_eplb0_mtp3.yaml
new file mode 100644
index 00000000..764f2d46
--- /dev/null
+++ b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/MTP/ctx5dep4_gen1dep32_batch16_eplb0_mtp3.yaml
@@ -0,0 +1,134 @@
+name: "kimi_k25_nvfp4_ISL8K_OSL1K_ctx5dep4_gen1dep32_batch16_eplb0_mtp3"
+
+# ctx: 5 prefill workers, TP4/EP4
+# gen: 1 decode worker, TP32/EP32, enable_attention_dp=true, max_batch=16
+# MTP Eagle speculative decoding, max_draft_len=3
+# concurrency: 666
+
+model:
+  path: "nvidia/Kimi-K2.5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+
+  prefill_nodes: 5
+  prefill_workers: 5
+  gpus_per_prefill: 4
+
+  decode_workers: 1
+  decode_nodes: 8
+  gpus_per_decode: 32
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      max_batch_size: 2
+      max_num_tokens: 16384
+      max_seq_len: 8232
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.4
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      speculative_config:
+        decoding_type: Eagle
+        max_draft_len: 3
+        speculative_model_dir: "/eagle-model"
+
+    decode:
+      tensor_parallel_size: 32
+      moe_expert_parallel_size: 32
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      trust_remote_code: true
+      max_batch_size: 16
+      max_num_tokens: 64
+      max_seq_len: 9256
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+          - 16
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.75
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+      speculative_config:
+        decoding_type: Eagle
+        max_draft_len: 3
+        speculative_model_dir: "/eagle-model"
+
+extra_mount:
+  - "nvidia/Kimi-K2.5-Thinking-Eagle3:/eagle-model"
+
+benchmark:
+  type: "sa-bench"
+  isl: 8192
+  osl: 1024
+  concurrencies: "666"
+  req_rate: "inf"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/MTP/ctx5dep4_gen1dep8_batch256_allconc_eplb0_mtp1.yaml b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/MTP/ctx5dep4_gen1dep8_batch256_allconc_eplb0_mtp1.yaml
new file mode 100644
index 00000000..31308fe6
--- /dev/null
+++ b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/MTP/ctx5dep4_gen1dep8_batch256_allconc_eplb0_mtp1.yaml
@@ -0,0 +1,164 @@
+name: "kimi_k25_nvfp4_ISL8K_OSL1K_ctx5dep4_gen1dep8_batch256_allconc_eplb0_mtp1"
+
+# ctx: 5 prefill workers, TP4/EP4
+# gen: 1 decode worker, TP8/EP8, enable_attention_dp=true, max_batch=256
+# MTP Eagle speculative decoding, max_draft_len=1
+# Covers all dep8 mtp1 concurrencies: 1229, 2253
+
+model:
+  path: "nvidia/Kimi-K2.5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+
+  prefill_nodes: 5
+  prefill_workers: 5
+  gpus_per_prefill: 4
+
+  decode_workers: 1
+  decode_nodes: 2
+  gpus_per_decode: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      max_batch_size: 2
+      max_num_tokens: 16384
+      max_seq_len: 8232
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.4
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      speculative_config:
+        decoding_type: Eagle
+        max_draft_len: 1
+        speculative_model_dir: "/eagle-model"
+
+    decode:
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      trust_remote_code: true
+      max_batch_size: 256
+      max_num_tokens: 512
+      max_seq_len: 9256
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+          - 16
+          - 24
+          - 32
+          - 40
+          - 48
+          - 56
+          - 64
+          - 72
+          - 80
+          - 88
+          - 96
+          - 104
+          - 112
+          - 120
+          - 128
+          - 136
+          - 144
+          - 152
+          - 160
+          - 168
+          - 176
+          - 184
+          - 192
+          - 200
+          - 208
+          - 216
+          - 224
+          - 232
+          - 240
+          - 248
+          - 256
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+      speculative_config:
+        decoding_type: Eagle
+        max_draft_len: 1
+        speculative_model_dir: "/eagle-model"
+
+extra_mount:
+  - "nvidia/Kimi-K2.5-Thinking-Eagle3:/eagle-model"
+
+benchmark:
+  type: "sa-bench"
+  isl: 8192
+  osl: 1024
+  concurrencies: "1229x2253"
+  req_rate: "inf"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/MTP/ctx8dep4_gen1dep32_batch32_eplb0_mtp3.yaml b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/MTP/ctx8dep4_gen1dep32_batch32_eplb0_mtp3.yaml
new file mode 100644
index 00000000..9bd03c05
--- /dev/null
+++ b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/MTP/ctx8dep4_gen1dep32_batch32_eplb0_mtp3.yaml
@@ -0,0 +1,136 @@
+name: "kimi_k25_nvfp4_ISL8K_OSL1K_ctx8dep4_gen1dep32_batch32_eplb0_mtp3"
+
+# ctx: 8 prefill workers, TP4/EP4
+# gen: 1 decode worker, TP32/EP32, enable_attention_dp=true, max_batch=32
+# MTP Eagle speculative decoding, max_draft_len=3
+# concurrency: 1229
+
+model:
+  path: "nvidia/Kimi-K2.5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+
+  prefill_nodes: 8
+  prefill_workers: 8
+  gpus_per_prefill: 4
+
+  decode_workers: 1
+  decode_nodes: 8
+  gpus_per_decode: 32
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      max_batch_size: 2
+      max_num_tokens: 16384
+      max_seq_len: 8232
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.4
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      speculative_config:
+        decoding_type: Eagle
+        max_draft_len: 3
+        speculative_model_dir: "/eagle-model"
+
+    decode:
+      tensor_parallel_size: 32
+      moe_expert_parallel_size: 32
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: true
+      trust_remote_code: true
+      max_batch_size: 32
+      max_num_tokens: 128
+      max_seq_len: 9256
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+          - 16
+          - 24
+          - 32
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.75
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+      speculative_config:
+        decoding_type: Eagle
+        max_draft_len: 3
+        speculative_model_dir: "/eagle-model"
+
+extra_mount:
+  - "nvidia/Kimi-K2.5-Thinking-Eagle3:/eagle-model"
+
+benchmark:
+  type: "sa-bench"
+  isl: 8192
+  osl: 1024
+  concurrencies: "1229"
+  req_rate: "inf"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/STP/ctx1dep4_gen4tep4_batch32_allconc_eplb0_mtp0.yaml b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/STP/ctx1dep4_gen4tep4_batch32_allconc_eplb0_mtp0.yaml
new file mode 100644
index 00000000..8c1f0aa8
--- /dev/null
+++ b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/STP/ctx1dep4_gen4tep4_batch32_allconc_eplb0_mtp0.yaml
@@ -0,0 +1,126 @@
+name: "kimi_k25_nvfp4_ISL8K_OSL1K_ctx1dep4_gen4tep4_batch32_allconc_eplb0_mtp0"
+
+# ctx: 1 prefill worker, TP4/EP4
+# gen: 4 decode workers, TP4/EP4, max_batch=32
+# Single concurrency point: 156
+
+model:
+  path: "nvidia/Kimi-K2.5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+
+  # Prefill: 1 worker x TP4 = 4 GPUs = 1 node
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 4
+
+  # Decode: 4 workers x TP4 = 16 GPUs = 4 nodes
+  decode_workers: 4
+  decode_nodes: 4
+  gpus_per_decode: 4
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      max_batch_size: 2
+      max_num_tokens: 16384
+      max_seq_len: 8232
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.4
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+
+    decode:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      trust_remote_code: true
+      max_batch_size: 32
+      max_num_tokens: 32
+      max_seq_len: 9256
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+          - 16
+          - 24
+          - 32
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+
+benchmark:
+  type: "sa-bench"
+  isl: 8192
+  osl: 1024
+  concurrencies: "156"
+  req_rate: "inf"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/STP/ctx1dep4_gen4tep8_batch1_allconc_eplb0_mtp0.yaml b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/STP/ctx1dep4_gen4tep8_batch1_allconc_eplb0_mtp0.yaml
new file mode 100644
index 00000000..d4c5086b
--- /dev/null
+++ b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/STP/ctx1dep4_gen4tep8_batch1_allconc_eplb0_mtp0.yaml
@@ -0,0 +1,123 @@
+name: "kimi_k25_nvfp4_ISL8K_OSL1K_ctx1dep4_gen4tep8_batch1_allconc_eplb0_mtp0"
+
+# ctx: 1 prefill worker, TP4/EP4
+# gen: 4 decode workers, TP8/EP8, allreduce_strategy=MNNVL, max_batch=1
+# Single concurrency point: 4
+
+model:
+  path: "nvidia/Kimi-K2.5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+
+  # Prefill: 1 worker x TP4 = 4 GPUs = 1 node
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 4
+
+  # Decode: 4 workers x TP8 = 32 GPUs = 8 nodes
+  decode_workers: 4
+  decode_nodes: 8
+  gpus_per_decode: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      max_batch_size: 2
+      max_num_tokens: 16384
+      max_seq_len: 8232
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.4
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+
+    decode:
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      allreduce_strategy: MNNVL
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      trust_remote_code: true
+      max_batch_size: 1
+      max_num_tokens: 1
+      max_seq_len: 9256
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+
+benchmark:
+  type: "sa-bench"
+  isl: 8192
+  osl: 1024
+  concurrencies: "4"
+  req_rate: "inf"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/STP/ctx1dep4_gen5tep4_batch16_allconc_eplb0_mtp0.yaml b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/STP/ctx1dep4_gen5tep4_batch16_allconc_eplb0_mtp0.yaml
new file mode 100644
index 00000000..8f6ea063
--- /dev/null
+++ b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/STP/ctx1dep4_gen5tep4_batch16_allconc_eplb0_mtp0.yaml
@@ -0,0 +1,126 @@
+name: "kimi_k25_nvfp4_ISL8K_OSL1K_ctx1dep4_gen5tep4_batch16_allconc_eplb0_mtp0"
+
+# ctx: 1 prefill worker, TP4/EP4
+# gen: 5 decode workers, TP4/EP4, max_batch=16
+# Covers all concurrencies: 5, 15, 30, 60, 105
+
+model:
+  path: "nvidia/Kimi-K2.5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+
+  # Prefill: 1 worker x TP4 = 4 GPUs = 1 node
+  prefill_nodes: 1
+  prefill_workers: 1
+  gpus_per_prefill: 4
+
+  # Decode: 5 workers x TP4 = 20 GPUs = 5 nodes
+  decode_workers: 5
+  decode_nodes: 5
+  gpus_per_decode: 4
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      max_batch_size: 2
+      max_num_tokens: 16384
+      max_seq_len: 8232
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.4
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+
+    decode:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      enable_attention_dp: false
+      enable_lm_head_tp_in_adp: false
+      trust_remote_code: true
+      # max_batch_size=16 covers all concs: 5, 15, 30, 60, 105
+      # cuda_graph pre-compiles graphs for each batch size up to the max
+      max_batch_size: 16
+      max_num_tokens: 16
+      max_seq_len: 9256
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+          - 16
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.9
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+
+benchmark:
+  type: "sa-bench"
+  isl: 8192
+  osl: 1024
+  concurrencies: "5x15x30x60x105"
+  req_rate: "inf"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/STP/ctx2dep4_gen1dep16_batch16_eplb0_mtp0.yaml b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/STP/ctx2dep4_gen1dep16_batch16_eplb0_mtp0.yaml
new file mode 100644
index 00000000..4bfaa0e2
--- /dev/null
+++ b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/STP/ctx2dep4_gen1dep16_batch16_eplb0_mtp0.yaml
@@ -0,0 +1,124 @@
+name: "kimi_k25_nvfp4_ISL8K_OSL1K_ctx2dep4_gen1dep16_batch16_eplb0_mtp0"
+
+# ctx: 2 prefill workers, TP4/EP4
+# gen: 1 decode worker, TP16/EP16, enable_attention_dp=true, max_batch=16
+# concurrency: 333
+
+model:
+  path: "nvidia/Kimi-K2.5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+
+  # Prefill: 2 workers x TP4 = 8 GPUs = 2 nodes
+  prefill_nodes: 2
+  prefill_workers: 2
+  gpus_per_prefill: 4
+
+  # Decode: 1 worker x TP16 = 16 GPUs = 4 nodes
+  decode_workers: 1
+  decode_nodes: 4
+  gpus_per_decode: 16
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      max_batch_size: 2
+      max_num_tokens: 16384
+      max_seq_len: 8232
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.4
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+
+    decode:
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      trust_remote_code: true
+      max_batch_size: 16
+      max_num_tokens: 16
+      max_seq_len: 9256
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+          - 16
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+
+benchmark:
+  type: "sa-bench"
+  isl: 8192
+  osl: 1024
+  concurrencies: "333"
+  req_rate: "inf"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/STP/ctx3dep4_gen1dep16_batch32_eplb0_mtp0.yaml b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/STP/ctx3dep4_gen1dep16_batch32_eplb0_mtp0.yaml
new file mode 100644
index 00000000..d7d51627
--- /dev/null
+++ b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/STP/ctx3dep4_gen1dep16_batch32_eplb0_mtp0.yaml
@@ -0,0 +1,126 @@
+name: "kimi_k25_nvfp4_ISL8K_OSL1K_ctx3dep4_gen1dep16_batch32_eplb0_mtp0"
+
+# ctx: 3 prefill workers, TP4/EP4
+# gen: 1 decode worker, TP16/EP16, enable_attention_dp=true, max_batch=32
+# concurrency: 615
+
+model:
+  path: "nvidia/Kimi-K2.5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+
+  # Prefill: 3 workers x TP4 = 12 GPUs = 3 nodes
+  prefill_nodes: 3
+  prefill_workers: 3
+  gpus_per_prefill: 4
+
+  # Decode: 1 worker x TP16 = 16 GPUs = 4 nodes
+  decode_workers: 1
+  decode_nodes: 4
+  gpus_per_decode: 16
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      max_batch_size: 2
+      max_num_tokens: 16384
+      max_seq_len: 8232
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.4
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+
+    decode:
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      trust_remote_code: true
+      max_batch_size: 32
+      max_num_tokens: 32
+      max_seq_len: 9256
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+          - 16
+          - 24
+          - 32
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+
+benchmark:
+  type: "sa-bench"
+  isl: 8192
+  osl: 1024
+  concurrencies: "615"
+  req_rate: "inf"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/STP/ctx5dep4_gen1dep8_batch256_allconc_eplb0_mtp0.yaml b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/STP/ctx5dep4_gen1dep8_batch256_allconc_eplb0_mtp0.yaml
new file mode 100644
index 00000000..e8df1179
--- /dev/null
+++ b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/STP/ctx5dep4_gen1dep8_batch256_allconc_eplb0_mtp0.yaml
@@ -0,0 +1,155 @@
+name: "kimi_k25_nvfp4_ISL8K_OSL1K_ctx5dep4_gen1dep8_batch256_allconc_eplb0_mtp0"
+
+# ctx: 5 prefill workers, TP4/EP4
+# gen: 1 decode worker, TP8/EP8, enable_attention_dp=true, max_batch=256
+# Single concurrency point: 2151
+
+model:
+  path: "nvidia/Kimi-K2.5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+
+  # Prefill: 5 workers x TP4 = 20 GPUs = 5 nodes
+  prefill_nodes: 5
+  prefill_workers: 5
+  gpus_per_prefill: 4
+
+  # Decode: 1 worker x TP8 = 8 GPUs = 2 nodes
+  decode_workers: 1
+  decode_nodes: 2
+  gpus_per_decode: 8
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      max_batch_size: 2
+      max_num_tokens: 16384
+      max_seq_len: 8232
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.4
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+
+    decode:
+      tensor_parallel_size: 8
+      moe_expert_parallel_size: 8
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      trust_remote_code: true
+      # max_batch_size=256, cuda_graph pre-compiles graphs for all batch sizes up to 256
+      max_batch_size: 256
+      max_num_tokens: 256
+      max_seq_len: 9256
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+          - 16
+          - 24
+          - 32
+          - 40
+          - 48
+          - 56
+          - 64
+          - 72
+          - 80
+          - 88
+          - 96
+          - 104
+          - 112
+          - 120
+          - 128
+          - 136
+          - 144
+          - 152
+          - 160
+          - 168
+          - 176
+          - 184
+          - 192
+          - 200
+          - 208
+          - 216
+          - 224
+          - 232
+          - 240
+          - 248
+          - 256
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+
+benchmark:
+  type: "sa-bench"
+  isl: 8192
+  osl: 1024
+  concurrencies: "2151"
+  req_rate: "inf"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/STP/ctx7dep4_gen1dep16_batch128_eplb0_mtp0.yaml b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/STP/ctx7dep4_gen1dep16_batch128_eplb0_mtp0.yaml
new file mode 100644
index 00000000..db177892
--- /dev/null
+++ b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/STP/ctx7dep4_gen1dep16_batch128_eplb0_mtp0.yaml
@@ -0,0 +1,138 @@
+name: "kimi_k25_nvfp4_ISL8K_OSL1K_ctx7dep4_gen1dep16_batch128_eplb0_mtp0"
+
+# ctx: 7 prefill workers, TP4/EP4
+# gen: 1 decode worker, TP16/EP16, enable_attention_dp=true, max_batch=128
+# concurrency: 2253
+
+model:
+  path: "nvidia/Kimi-K2.5-NVFP4"
+  container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2"
+  precision: "fp4"
+
+resources:
+  gpu_type: "gb200"
+
+  # Prefill: 7 workers x TP4 = 28 GPUs = 7 nodes
+  prefill_nodes: 7
+  prefill_workers: 7
+  gpus_per_prefill: 4
+
+  # Decode: 1 worker x TP16 = 16 GPUs = 4 nodes
+  decode_workers: 1
+  decode_nodes: 4
+  gpus_per_decode: 16
+
+  gpus_per_node: 4
+
+backend:
+  type: trtllm
+
+  prefill_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  decode_environment:
+    ENROOT_ALLOW_DEV: "yes"
+    NCCL_GRAPH_MIXING_SUPPORT: "0"
+    TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1"
+    TLLM_LOG_LEVEL: "INFO"
+    TRTLLM_ENABLE_PDL: "1"
+    TRTLLM_SERVER_DISABLE_GC: "1"
+    TRTLLM_WORKER_DISABLE_GC: "1"
+
+  trtllm_config:
+    prefill:
+      tensor_parallel_size: 4
+      moe_expert_parallel_size: 4
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      disable_overlap_scheduler: true
+      trust_remote_code: true
+      max_batch_size: 2
+      max_num_tokens: 16384
+      max_seq_len: 8232
+      print_iter_log: true
+      cuda_graph_config: null
+      moe_config:
+        backend: TRTLLM
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.4
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+
+    decode:
+      tensor_parallel_size: 16
+      moe_expert_parallel_size: 16
+      pipeline_parallel_size: 1
+      enable_attention_dp: true
+      enable_lm_head_tp_in_adp: false
+      trust_remote_code: true
+      max_batch_size: 128
+      max_num_tokens: 128
+      max_seq_len: 9256
+      print_iter_log: true
+      stream_interval: 100
+      num_postprocess_workers: 4
+      cuda_graph_config:
+        enable_padding: true
+        batch_sizes:
+          - 1
+          - 2
+          - 4
+          - 8
+          - 16
+          - 24
+          - 32
+          - 40
+          - 48
+          - 56
+          - 64
+          - 72
+          - 80
+          - 88
+          - 96
+          - 104
+          - 112
+          - 120
+          - 128
+      moe_config:
+        backend: TRTLLM
+        use_low_precision_moe_combine: true
+      kv_cache_config:
+        dtype: fp8
+        enable_block_reuse: false
+        free_gpu_memory_fraction: 0.8
+      cache_transceiver_config:
+        backend: UCX
+        max_tokens_in_buffer: 16384
+      nvfp4_gemm_config:
+        allowed_backends:
+          - cutlass
+          - cublaslt
+          - cutedsl
+          - cuda_core
+
+benchmark:
+  type: "sa-bench"
+  isl: 8192
+  osl: 1024
+  concurrencies: "2253"
+  req_rate: "inf"
+
+frontend:
+  type: "dynamo"
+  enable_multiple_frontends: false
+
+health_check:
+  max_attempts: 360
+  interval_seconds: 10
+
+dynamo:
+  install: false
diff --git a/recipes/vllm/deepseek-v4-pro/8k1k/disagg-gb200-1p1d-dep8-dep8.yaml b/recipes/vllm/deepseek-v4-pro/8k1k/disagg-gb200-1p1d-dep8-dep8.yaml
new file mode 100644
index 00000000..10d038a5
--- /dev/null
+++ b/recipes/vllm/deepseek-v4-pro/8k1k/disagg-gb200-1p1d-dep8-dep8.yaml
@@ -0,0 +1,88 @@
+name: "svf-vllm-disagg-gb200-1p1d-dep8-dep8"
+model:
+  path: "deepseekv4-fp4"
+  container: "vllm/vllm-openai:deepseekv4-cu130"
+  precision: "fp4"
+dynamo:
+  hash: 6a159fedd8e4a1563aa647c31f622aedbf254b5b
+setup_script: vllm-container-deps.sh
+resources:
+  gpu_type: "gb200"
+  gpus_per_node: 4
+  prefill_nodes: 2
+  decode_nodes: 2
+  prefill_workers: 1
+  decode_workers: 1
+  gpus_per_prefill: 8
+  gpus_per_decode: 8
+frontend:
+  type: dynamo
+  enable_multiple_frontends: false
+backend:
+  type: vllm
+  connector: null
+  prefill_environment:
+    TILELANG_CLEANUP_TEMP_FILES: "1"
+    VLLM_USE_NCCL_SYMM_MEM: "1"
+    NCCL_CUMEM_ENABLE: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_NVLS_ENABLE: "1"
+    VLLM_SERVER_DEV_MODE: "1"
+  decode_environment:
+    TILELANG_CLEANUP_TEMP_FILES: "1"
+    VLLM_USE_NCCL_SYMM_MEM: "1"
+    NCCL_CUMEM_ENABLE: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_NVLS_ENABLE: "1"
+    VLLM_SERVER_DEV_MODE: "1"
+  vllm_config:
+    prefill:
+      kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}'
+      served-model-name: "deepseek-ai/DeepSeek-V4-Pro"
+      kv-cache-dtype: "fp8"
+      tensor-parallel-size: 1
+      pipeline-parallel-size: 1
+      data-parallel-size: 8
+      data-parallel-rpc-port: 13345
+      enable-expert-parallel: true
+      enforce-eager: true
+      max-model-len: auto
+      max-num-seqs: 4
+      max-num-batched-tokens: 16384
+      trust-remote-code: true
+      no-enable-prefix-caching: true
+      no-enable-flashinfer-autotune: true
+      no-async-scheduling: true
+      block-size: 256
+      gpu-memory-utilization: 0.9
+      no-disable-hybrid-kv-cache-manager: true
+      enable-sleep-mode: true
+    decode:
+      kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}'
+      served-model-name: "deepseek-ai/DeepSeek-V4-Pro"
+      kv-cache-dtype: "fp8"
+      tensor-parallel-size: 1
+      pipeline-parallel-size: 1
+      data-parallel-size: 8
+      data-parallel-rpc-port: 13345
+      enable-expert-parallel: true
+      max-model-len: auto
+      max-num-seqs: 64
+      max-cudagraph-capture-size: 64
+      max-num-batched-tokens: 64
+      trust-remote-code: true
+      no-enable-prefix-caching: true
+      block-size: 256
+      compilation-config: '{"cudagraph_mode":"FULL_DECODE_ONLY","mode":0}'
+      gpu-memory-utilization: 0.9
+      stream-interval: 50
+      no-disable-hybrid-kv-cache-manager: true
+      enable-sleep-mode: true
+benchmark:
+  type: "sa-bench"
+  isl: 8192
+  osl: 1024
+  concurrencies: "4x8x16x32x64x256"
+  req_rate: "inf"
+  custom_tokenizer: "deepseek_v4"
+  use_chat_template: false
diff --git a/recipes/vllm/deepseek-v4-pro/8k1k/disagg-gb200-2p1d-dep8-dep16.yaml b/recipes/vllm/deepseek-v4-pro/8k1k/disagg-gb200-2p1d-dep8-dep16.yaml
new file mode 100644
index 00000000..a46d9bf7
--- /dev/null
+++ b/recipes/vllm/deepseek-v4-pro/8k1k/disagg-gb200-2p1d-dep8-dep16.yaml
@@ -0,0 +1,88 @@
+name: "svf-vllm-disagg-gb200-2p1d-dep8-dep16"
+model:
+  path: "deepseekv4-fp4"
+  container: "vllm/vllm-openai:deepseekv4-cu130"
+  precision: "fp4"
+dynamo:
+  hash: 6a159fedd8e4a1563aa647c31f622aedbf254b5b
+setup_script: vllm-container-deps.sh
+resources:
+  gpu_type: "gb200"
+  gpus_per_node: 4
+  prefill_nodes: 2
+  decode_nodes: 2
+  prefill_workers: 2
+  decode_workers: 1
+  gpus_per_prefill: 8
+  gpus_per_decode: 8
+frontend:
+  type: dynamo
+  enable_multiple_frontends: false
+backend:
+  type: vllm
+  connector: null
+  prefill_environment:
+    TILELANG_CLEANUP_TEMP_FILES: "1"
+    VLLM_USE_NCCL_SYMM_MEM: "1"
+    NCCL_CUMEM_ENABLE: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_NVLS_ENABLE: "1"
+    VLLM_SERVER_DEV_MODE: "1"
+  decode_environment:
+    TILELANG_CLEANUP_TEMP_FILES: "1"
+    VLLM_USE_NCCL_SYMM_MEM: "1"
+    NCCL_CUMEM_ENABLE: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_NVLS_ENABLE: "1"
+    VLLM_SERVER_DEV_MODE: "1"
+  vllm_config:
+    prefill:
+      kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}'
+      served-model-name: "deepseek-ai/DeepSeek-V4-Pro"
+      kv-cache-dtype: "fp8"
+      tensor-parallel-size: 1
+      pipeline-parallel-size: 1
+      data-parallel-size: 8
+      data-parallel-rpc-port: 13345
+      enable-expert-parallel: true
+      enforce-eager: true
+      max-model-len: auto
+      max-num-seqs: 4
+      max-num-batched-tokens: 16384
+      trust-remote-code: true
+      no-enable-prefix-caching: true
+      no-enable-flashinfer-autotune: true
+      no-async-scheduling: true
+      block-size: 256
+      gpu-memory-utilization: 0.9
+      no-disable-hybrid-kv-cache-manager: true
+      enable-sleep-mode: true
+    decode:
+      kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}'
+      served-model-name: "deepseek-ai/DeepSeek-V4-Pro"
+      kv-cache-dtype: "fp8"
+      tensor-parallel-size: 1
+      pipeline-parallel-size: 1
+      data-parallel-size: 16
+      data-parallel-rpc-port: 13345
+      enable-expert-parallel: true
+      max-model-len: auto
+      max-num-seqs: 64
+      max-cudagraph-capture-size: 64
+      max-num-batched-tokens: 64
+      trust-remote-code: true
+      no-enable-prefix-caching: true
+      block-size: 256
+      compilation-config: '{"cudagraph_mode":"FULL_DECODE_ONLY","mode":0}'
+      gpu-memory-utilization: 0.9
+      stream-interval: 50
+      no-disable-hybrid-kv-cache-manager: true
+      enable-sleep-mode: true
+benchmark:
+  type: "sa-bench"
+  isl: 8192
+  osl: 1024
+  concurrencies: "1024"
+  req_rate: "inf"
+  custom_tokenizer: "deepseek_v4"
+  use_chat_template: false
diff --git a/recipes/vllm/deepseek-v4-pro/8k1k/disagg-gb200-4p1d-dep8-dep16.yaml b/recipes/vllm/deepseek-v4-pro/8k1k/disagg-gb200-4p1d-dep8-dep16.yaml
new file mode 100644
index 00000000..32089c84
--- /dev/null
+++ b/recipes/vllm/deepseek-v4-pro/8k1k/disagg-gb200-4p1d-dep8-dep16.yaml
@@ -0,0 +1,88 @@
+name: "svf-vllm-disagg-gb200-4p1d-dep8-dep16"
+model:
+  path: "deepseekv4-fp4"
+  container: "vllm/vllm-openai:deepseekv4-cu130"
+  precision: "fp4"
+dynamo:
+  hash: 6a159fedd8e4a1563aa647c31f622aedbf254b5b
+setup_script: vllm-container-deps.sh
+resources:
+  gpu_type: "gb200"
+  gpus_per_node: 4
+  prefill_nodes: 2
+  decode_nodes: 2
+  prefill_workers: 4
+  decode_workers: 1
+  gpus_per_prefill: 8
+  gpus_per_decode: 8
+frontend:
+  type: dynamo
+  enable_multiple_frontends: false
+backend:
+  type: vllm
+  connector: null
+  prefill_environment:
+    TILELANG_CLEANUP_TEMP_FILES: "1"
+    VLLM_USE_NCCL_SYMM_MEM: "1"
+    NCCL_CUMEM_ENABLE: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_NVLS_ENABLE: "1"
+    VLLM_SERVER_DEV_MODE: "1"
+  decode_environment:
+    TILELANG_CLEANUP_TEMP_FILES: "1"
+    VLLM_USE_NCCL_SYMM_MEM: "1"
+    NCCL_CUMEM_ENABLE: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_NVLS_ENABLE: "1"
+    VLLM_SERVER_DEV_MODE: "1"
+  vllm_config:
+    prefill:
+      kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}'
+      served-model-name: "deepseek-ai/DeepSeek-V4-Pro"
+      kv-cache-dtype: "fp8"
+      tensor-parallel-size: 1
+      pipeline-parallel-size: 1
+      data-parallel-size: 8
+      data-parallel-rpc-port: 13345
+      enable-expert-parallel: true
+      enforce-eager: true
+      max-model-len: auto
+      max-num-seqs: 4
+      max-num-batched-tokens: 16384
+      trust-remote-code: true
+      no-enable-prefix-caching: true
+      no-enable-flashinfer-autotune: true
+      no-async-scheduling: true
+      block-size: 256
+      gpu-memory-utilization: 0.9
+      no-disable-hybrid-kv-cache-manager: true
+      enable-sleep-mode: true
+    decode:
+      kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}'
+      served-model-name: "deepseek-ai/DeepSeek-V4-Pro"
+      kv-cache-dtype: "fp8"
+      tensor-parallel-size: 1
+      pipeline-parallel-size: 1
+      data-parallel-size: 16
+      data-parallel-rpc-port: 13345
+      enable-expert-parallel: true
+      max-model-len: auto
+      max-num-seqs: 256
+      max-cudagraph-capture-size: 256
+      max-num-batched-tokens: 256
+      trust-remote-code: true
+      no-enable-prefix-caching: true
+      block-size: 256
+      compilation-config: '{"cudagraph_mode":"FULL_DECODE_ONLY","mode":0}'
+      gpu-memory-utilization: 0.9
+      stream-interval: 50
+      no-disable-hybrid-kv-cache-manager: true
+      enable-sleep-mode: true
+benchmark:
+  type: "sa-bench"
+  isl: 8192
+  osl: 1024
+  concurrencies: "2048"
+  req_rate: "inf"
+  custom_tokenizer: "deepseek_v4"
+  use_chat_template: false
diff --git a/recipes/vllm/deepseek-v4-pro/8k1k/disagg-gb200-7p1d-dep8-dep16.yaml b/recipes/vllm/deepseek-v4-pro/8k1k/disagg-gb200-7p1d-dep8-dep16.yaml
new file mode 100644
index 00000000..1568e492
--- /dev/null
+++ b/recipes/vllm/deepseek-v4-pro/8k1k/disagg-gb200-7p1d-dep8-dep16.yaml
@@ -0,0 +1,87 @@
+name: "svf-vllm-disagg-gb200-7p1d-dep8-dep16"
+model:
+  path: "deepseekv4-fp4"
+  container: "vllm/vllm-openai:deepseekv4-cu130"
+  precision: "fp4"
+dynamo:
+  hash: 6a159fedd8e4a1563aa647c31f622aedbf254b5b
+setup_script: vllm-container-deps.sh
+resources:
+  gpu_type: "gb200"
+  gpus_per_node: 4
+  prefill_nodes: 14
+  decode_nodes: 4
+  prefill_workers: 7
+  decode_workers: 1
+  gpus_per_prefill: 8
+  gpus_per_decode: 16
+frontend:
+  type: dynamo
+  enable_multiple_frontends: false
+backend:
+  type: vllm
+  connector: null
+  prefill_environment:
+    TILELANG_CLEANUP_TEMP_FILES: "1"
+    VLLM_USE_NCCL_SYMM_MEM: "1"
+    NCCL_CUMEM_ENABLE: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_NVLS_ENABLE: "1"
+    VLLM_SERVER_DEV_MODE: "1"
+  decode_environment:
+    TILELANG_CLEANUP_TEMP_FILES: "1"
+    VLLM_USE_NCCL_SYMM_MEM: "1"
+    NCCL_CUMEM_ENABLE: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_NVLS_ENABLE: "1"
+    VLLM_SERVER_DEV_MODE: "1"
+  vllm_config:
+    prefill:
+      kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}'
+      served-model-name: "deepseek-ai/DeepSeek-V4-Pro"
+      kv-cache-dtype: "fp8"
+      tensor-parallel-size: 1
+      pipeline-parallel-size: 1
+      data-parallel-size: 8
+      data-parallel-rpc-port: 13345
+      enable-expert-parallel: true
+      enforce-eager: true
+      max-model-len: auto
+      max-num-seqs: 2
+      max-num-batched-tokens: 16384
+      trust-remote-code: true
+      no-enable-prefix-caching: true
+      no-enable-flashinfer-autotune: true
+      block-size: 256
+      gpu-memory-utilization: 0.88
+      no-disable-hybrid-kv-cache-manager: true
+      enable-sleep-mode: true
+    decode:
+      kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}'
+      served-model-name: "deepseek-ai/DeepSeek-V4-Pro"
+      kv-cache-dtype: "fp8"
+      tensor-parallel-size: 1
+      pipeline-parallel-size: 1
+      data-parallel-size: 16
+      data-parallel-rpc-port: 13345
+      enable-expert-parallel: true
+      max-model-len: auto
+      max-num-seqs: 256
+      max-cudagraph-capture-size: 256
+      max-num-batched-tokens: 256
+      trust-remote-code: true
+      no-enable-prefix-caching: true
+      block-size: 256
+      compilation-config: '{"cudagraph_mode":"FULL_DECODE_ONLY","mode":0}'
+      gpu-memory-utilization: 0.9
+      stream-interval: 50
+      no-disable-hybrid-kv-cache-manager: true
+      enable-sleep-mode: true
+benchmark:
+  type: "sa-bench"
+  isl: 8192
+  osl: 1024
+  concurrencies: "4096"
+  req_rate: "inf"
+  custom_tokenizer: "deepseek_v4"
+  use_chat_template: false
diff --git a/recipes/vllm/kimi-k2.5/1k1k/disagg-gb200-1p1d-dep4-dep16.yaml b/recipes/vllm/kimi-k2.5/1k1k/disagg-gb200-1p1d-dep4-dep16.yaml
new file mode 100644
index 00000000..ecdc9233
--- /dev/null
+++ b/recipes/vllm/kimi-k2.5/1k1k/disagg-gb200-1p1d-dep4-dep16.yaml
@@ -0,0 +1,101 @@
+name: "kimi-vllm-disagg-gb200-1p1d-dep4-dep16"
+
+model:
+  path: "kimi-k2.5-nvfp4"
+  container: "vllm/vllm-openai:v0.18.0-cu130"
+  precision: "fp4"
+
+dynamo:
+  version: 1.0.1
+  install: true
+
+setup_script: vllm-container-deps.sh
+
+resources:
+  gpu_type: "gb200"
+  gpus_per_node: 4
+  prefill_nodes: 1
+  decode_nodes: 4
+  prefill_workers: 1
+  decode_workers: 1
+  gpus_per_prefill: 4
+  gpus_per_decode: 16
+
+frontend:
+  type: dynamo
+  enable_multiple_frontends: false
+
+backend:
+  type: vllm
+  connector: null
+
+  prefill_environment:
+    VLLM_USE_FLASHINFER_MOE_FP4: "1"
+    VLLM_USE_NCCL_SYMM_MEM: "1"
+    NCCL_CUMEM_ENABLE: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_NVLS_ENABLE: "1"
+
+  decode_environment:
+    VLLM_USE_FLASHINFER_MOE_FP4: "1"
+    VLLM_USE_NCCL_SYMM_MEM: "1"
+    NCCL_CUMEM_ENABLE: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_NVLS_ENABLE: "1"
+
+  vllm_config:
+    prefill:
+      kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}'
+      served-model-name: "nvidia/Kimi-K2.5-NVFP4"
+      kv-cache-dtype: "fp8"
+      tensor-parallel-size: 1
+      pipeline-parallel-size: 1
+      data-parallel-size: 4
+      data-parallel-rpc-port: 13345
+      enable-expert-parallel: true
+      max-model-len: 3072
+      max-num-seqs: 4096
+      enforce-eager: true
+      compilation-config: '{"custom_ops":["+quant_fp8","+rms_norm","+rotary_embedding"],"pass_config":{"fuse_attn_quant":true,"fuse_allreduce_rms":true}}'
+      max-num-batched-tokens: 16384
+      safetensors-load-strategy: "prefetch"
+      trust-remote-code: true
+      no-enable-prefix-caching: true
+      no-enable-chunked-prefill: true
+      attention-backend: "FLASHINFER_MLA"
+      block-size: 64
+      attention-config: '{"use_trtllm_ragged_deepseek_prefill": true}'
+      all2all-backend: "flashinfer_nvlink_one_sided"
+      gpu-memory-utilization: 0.9
+
+    decode:
+      kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}'
+      served-model-name: "nvidia/Kimi-K2.5-NVFP4"
+      kv-cache-dtype: "fp8"
+      tensor-parallel-size: 1
+      pipeline-parallel-size: 1
+      data-parallel-size: 16
+      data-parallel-rpc-port: 13345
+      enable-expert-parallel: true
+      max-model-len: 3072
+      max-num-seqs: 4096
+      max-num-batched-tokens: 10240
+      safetensors-load-strategy: "prefetch"
+      trust-remote-code: true
+      no-enable-prefix-caching: true
+      no-enable-chunked-prefill: true
+      async-scheduling: true
+      attention-backend: "FLASHINFER_MLA"
+      block-size: 64
+      all2all-backend: "flashinfer_nvlink_one_sided"
+      compilation-config: '{"cudagraph_mode":"FULL_DECODE_ONLY","custom_ops":["+quant_fp8","+rms_norm","+rotary_embedding"],"pass_config":{"fuse_attn_quant":true,"fuse_allreduce_rms":true}}'
+      gpu-memory-utilization: 0.9
+      stream-interval: 50
+      max-cudagraph-capture-size: 512
+
+benchmark:
+  type: "sa-bench"
+  isl: 1024
+  osl: 1024
+  concurrencies: "256x512x1024x2048x3072x4096"
+  req_rate: "inf"
diff --git a/recipes/vllm/kimi-k2.5/1k1k/disagg-gb200-1p4d-dep4-tep4.yaml b/recipes/vllm/kimi-k2.5/1k1k/disagg-gb200-1p4d-dep4-tep4.yaml
new file mode 100644
index 00000000..43167b5f
--- /dev/null
+++ b/recipes/vllm/kimi-k2.5/1k1k/disagg-gb200-1p4d-dep4-tep4.yaml
@@ -0,0 +1,98 @@
+name: "kimi-vllm-disagg-gb200-1p4d-dep4-tep4"
+
+model:
+  path: "kimi-k2.5-nvfp4"
+  container: "vllm/vllm-openai:v0.18.0-cu130"
+  precision: "fp4"
+
+dynamo:
+  version: 1.0.1
+  install: true
+
+setup_script: vllm-container-deps.sh
+
+resources:
+  gpu_type: "gb200"
+  gpus_per_node: 4
+  prefill_nodes: 1
+  decode_nodes: 4
+  prefill_workers: 1
+  decode_workers: 4
+  gpus_per_prefill: 4
+  gpus_per_decode: 4
+
+frontend:
+  type: dynamo
+  enable_multiple_frontends: false
+
+backend:
+  type: vllm
+  connector: null
+
+  prefill_environment:
+    VLLM_USE_FLASHINFER_MOE_FP4: "1"
+    VLLM_USE_NCCL_SYMM_MEM: "1"
+    NCCL_CUMEM_ENABLE: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_NVLS_ENABLE: "1"
+
+  decode_environment:
+    VLLM_USE_FLASHINFER_MOE_FP4: "1"
+    VLLM_USE_NCCL_SYMM_MEM: "1"
+    NCCL_CUMEM_ENABLE: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_NVLS_ENABLE: "1"
+
+  vllm_config:
+    prefill:
+      kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}'
+      served-model-name: "nvidia/Kimi-K2.5-NVFP4"
+      kv-cache-dtype: "fp8"
+      tensor-parallel-size: 1
+      pipeline-parallel-size: 1
+      data-parallel-size: 4
+      data-parallel-rpc-port: 13345
+      enable-expert-parallel: true
+      max-model-len: 3072
+      max-num-seqs: 1024
+      enforce-eager: true
+      compilation-config: '{"custom_ops":["+quant_fp8","+rms_norm","+rotary_embedding"],"pass_config":{"fuse_attn_quant":true,"fuse_allreduce_rms":true}}'
+      max-num-batched-tokens: 16384
+      safetensors-load-strategy: "prefetch"
+      trust-remote-code: true
+      no-enable-prefix-caching: true
+      no-enable-chunked-prefill: true
+      attention-backend: "FLASHINFER_MLA"
+      block-size: 64
+      attention-config: '{"use_trtllm_ragged_deepseek_prefill": true}'
+      all2all-backend: "flashinfer_nvlink_one_sided"
+      gpu-memory-utilization: 0.9
+
+    decode:
+      kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}'
+      served-model-name: "nvidia/Kimi-K2.5-NVFP4"
+      kv-cache-dtype: "fp8"
+      tensor-parallel-size: 4
+      pipeline-parallel-size: 1
+      enable-expert-parallel: true
+      max-model-len: 3072
+      max-num-seqs: 1024
+      max-num-batched-tokens: 10240
+      safetensors-load-strategy: "prefetch"
+      trust-remote-code: true
+      no-enable-prefix-caching: true
+      no-enable-chunked-prefill: true
+      async-scheduling: true
+      attention-backend: "FLASHINFER_MLA"
+      block-size: 64
+      compilation-config: '{"cudagraph_mode":"FULL_DECODE_ONLY","custom_ops":["+quant_fp8","+rms_norm","+rotary_embedding"],"pass_config":{"fuse_attn_quant":true,"fuse_allreduce_rms":true}}'
+      gpu-memory-utilization: 0.9
+      stream-interval: 50
+      max-cudagraph-capture-size: 1024
+
+benchmark:
+  type: "sa-bench"
+  isl: 1024
+  osl: 1024
+  concurrencies: "4x8x16x32x64x128"
+  req_rate: "inf"
diff --git a/recipes/vllm/kimi-k2.5/8k1k/disagg-gb200-1p4d-dep4-tep4.yaml b/recipes/vllm/kimi-k2.5/8k1k/disagg-gb200-1p4d-dep4-tep4.yaml
new file mode 100644
index 00000000..1ab6ca27
--- /dev/null
+++ b/recipes/vllm/kimi-k2.5/8k1k/disagg-gb200-1p4d-dep4-tep4.yaml
@@ -0,0 +1,98 @@
+name: "kimi-vllm-disagg-gb200-1p4d-dep4-tep4"
+
+model:
+  path: "kimi-k2.5-nvfp4"
+  container: "vllm/vllm-openai:v0.18.0-cu130"
+  precision: "fp4"
+
+dynamo:
+  version: 1.0.1
+  install: true
+
+setup_script: vllm-container-deps.sh
+
+resources:
+  gpu_type: "gb200"
+  gpus_per_node: 4
+  prefill_nodes: 1
+  decode_nodes: 4
+  prefill_workers: 1
+  decode_workers: 4
+  gpus_per_prefill: 4
+  gpus_per_decode: 4
+
+frontend:
+  type: dynamo
+  enable_multiple_frontends: false
+
+backend:
+  type: vllm
+  connector: null
+
+  prefill_environment:
+    VLLM_USE_FLASHINFER_MOE_FP4: "1"
+    VLLM_USE_NCCL_SYMM_MEM: "1"
+    NCCL_CUMEM_ENABLE: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_NVLS_ENABLE: "1"
+
+  decode_environment:
+    VLLM_USE_FLASHINFER_MOE_FP4: "1"
+    VLLM_USE_NCCL_SYMM_MEM: "1"
+    NCCL_CUMEM_ENABLE: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_NVLS_ENABLE: "1"
+
+  vllm_config:
+    prefill:
+      kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}'
+      served-model-name: "nvidia/Kimi-K2.5-NVFP4"
+      kv-cache-dtype: "fp8"
+      tensor-parallel-size: 1
+      pipeline-parallel-size: 1
+      data-parallel-size: 4
+      data-parallel-rpc-port: 13345
+      enable-expert-parallel: true
+      max-model-len: 10240
+      max-num-seqs: 64
+      enforce-eager: true
+      compilation-config: '{"custom_ops":["+quant_fp8","+rms_norm","+rotary_embedding"],"pass_config":{"fuse_attn_quant":true,"fuse_allreduce_rms":true}}'
+      max-num-batched-tokens: 16384
+      safetensors-load-strategy: "prefetch"
+      trust-remote-code: true
+      no-enable-prefix-caching: true
+      no-enable-chunked-prefill: true
+      attention-backend: "FLASHINFER_MLA"
+      block-size: 64
+      attention-config: '{"use_trtllm_ragged_deepseek_prefill": true}'
+      all2all-backend: "flashinfer_nvlink_one_sided"
+      gpu-memory-utilization: 0.9
+
+    decode:
+      kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}'
+      served-model-name: "nvidia/Kimi-K2.5-NVFP4"
+      kv-cache-dtype: "fp8"
+      tensor-parallel-size: 4
+      pipeline-parallel-size: 1
+      enable-expert-parallel: true
+      max-model-len: 10240
+      max-num-seqs: 16
+      max-num-batched-tokens: 10240
+      safetensors-load-strategy: "prefetch"
+      trust-remote-code: true
+      no-enable-prefix-caching: true
+      no-enable-chunked-prefill: true
+      async-scheduling: true
+      attention-backend: "FLASHINFER_MLA"
+      block-size: 64
+      compilation-config: '{"cudagraph_mode":"FULL_DECODE_ONLY","custom_ops":["+quant_fp8","+rms_norm","+rotary_embedding"],"pass_config":{"fuse_attn_quant":true,"fuse_allreduce_rms":true}}'
+      gpu-memory-utilization: 0.9
+      stream-interval: 50
+      max-cudagraph-capture-size: 16
+
+benchmark:
+  type: "sa-bench"
+  isl: 8192
+  osl: 1024
+  concurrencies: "4x8x16x32x128"
+  req_rate: "inf"
diff --git a/recipes/vllm/kimi-k2.5/8k1k/disagg-gb200-3p1d-dep4-dep16.yaml b/recipes/vllm/kimi-k2.5/8k1k/disagg-gb200-3p1d-dep4-dep16.yaml
new file mode 100644
index 00000000..ca4e9813
--- /dev/null
+++ b/recipes/vllm/kimi-k2.5/8k1k/disagg-gb200-3p1d-dep4-dep16.yaml
@@ -0,0 +1,101 @@
+name: "kimi-vllm-disagg-gb200-3p1d-dep4-dep16"
+
+model:
+  path: "kimi-k2.5-nvfp4"
+  container: "vllm/vllm-openai:v0.18.0-cu130"
+  precision: "fp4"
+
+dynamo:
+  version: 1.0.1
+  install: true
+
+setup_script: vllm-container-deps.sh
+
+resources:
+  gpu_type: "gb200"
+  gpus_per_node: 4
+  prefill_nodes: 3
+  decode_nodes: 4
+  prefill_workers: 3
+  decode_workers: 1
+  gpus_per_prefill: 4
+  gpus_per_decode: 16
+
+frontend:
+  type: dynamo
+  enable_multiple_frontends: false
+
+backend:
+  type: vllm
+  connector: null
+
+  prefill_environment:
+    VLLM_USE_FLASHINFER_MOE_FP4: "1"
+    VLLM_USE_NCCL_SYMM_MEM: "1"
+    NCCL_CUMEM_ENABLE: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_NVLS_ENABLE: "1"
+
+  decode_environment:
+    VLLM_USE_FLASHINFER_MOE_FP4: "1"
+    VLLM_USE_NCCL_SYMM_MEM: "1"
+    NCCL_CUMEM_ENABLE: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_NVLS_ENABLE: "1"
+
+  vllm_config:
+    prefill:
+      kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}'
+      served-model-name: "nvidia/Kimi-K2.5-NVFP4"
+      kv-cache-dtype: "fp8"
+      tensor-parallel-size: 1
+      pipeline-parallel-size: 1
+      data-parallel-size: 4
+      data-parallel-rpc-port: 13345
+      enable-expert-parallel: true
+      max-model-len: 10240
+      max-num-seqs: 64
+      enforce-eager: true
+      compilation-config: '{"custom_ops":["+quant_fp8","+rms_norm","+rotary_embedding"],"pass_config":{"fuse_attn_quant":true,"fuse_allreduce_rms":true}}'
+      max-num-batched-tokens: 16384
+      safetensors-load-strategy: "prefetch"
+      trust-remote-code: true
+      no-enable-prefix-caching: true
+      no-enable-chunked-prefill: true
+      attention-backend: "FLASHINFER_MLA"
+      block-size: 64
+      attention-config: '{"use_trtllm_ragged_deepseek_prefill": true}'
+      all2all-backend: "flashinfer_nvlink_one_sided"
+      gpu-memory-utilization: 0.9
+
+    decode:
+      kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}'
+      served-model-name: "nvidia/Kimi-K2.5-NVFP4"
+      kv-cache-dtype: "fp8"
+      tensor-parallel-size: 1
+      pipeline-parallel-size: 1
+      data-parallel-size: 16
+      data-parallel-rpc-port: 13345
+      enable-expert-parallel: true
+      max-model-len: 10240
+      max-num-seqs: 256
+      max-num-batched-tokens: 10240
+      safetensors-load-strategy: "prefetch"
+      trust-remote-code: true
+      no-enable-prefix-caching: true
+      no-enable-chunked-prefill: true
+      async-scheduling: true
+      attention-backend: "FLASHINFER_MLA"
+      block-size: 64
+      all2all-backend: "flashinfer_nvlink_one_sided"
+      compilation-config: '{"cudagraph_mode":"FULL_DECODE_ONLY","custom_ops":["+quant_fp8","+rms_norm","+rotary_embedding"],"pass_config":{"fuse_attn_quant":true,"fuse_allreduce_rms":true}}'
+      gpu-memory-utilization: 0.9
+      stream-interval: 50
+      max-cudagraph-capture-size: 256
+
+benchmark:
+  type: "sa-bench"
+  isl: 8192
+  osl: 1024
+  concurrencies: "512x1024"
+  req_rate: "inf"
diff --git a/recipes/vllm/kimi-k2.5/8k1k/disagg-gb200-5p1d-dep4-dep8.yaml b/recipes/vllm/kimi-k2.5/8k1k/disagg-gb200-5p1d-dep4-dep8.yaml
new file mode 100644
index 00000000..cd9f94a9
--- /dev/null
+++ b/recipes/vllm/kimi-k2.5/8k1k/disagg-gb200-5p1d-dep4-dep8.yaml
@@ -0,0 +1,101 @@
+name: "kimi-vllm-disagg-gb200-5p1d-dep4-dep8"
+
+model:
+  path: "kimi-k2.5-nvfp4"
+  container: "vllm/vllm-openai:v0.18.0-cu130"
+  precision: "fp4"
+
+dynamo:
+  version: 1.0.1
+  install: true
+
+setup_script: vllm-container-deps.sh
+
+resources:
+  gpu_type: "gb200"
+  gpus_per_node: 4
+  prefill_nodes: 5
+  decode_nodes: 2
+  prefill_workers: 5
+  decode_workers: 1
+  gpus_per_prefill: 4
+  gpus_per_decode: 8
+
+frontend:
+  type: dynamo
+  enable_multiple_frontends: false
+
+backend:
+  type: vllm
+  connector: null
+
+  prefill_environment:
+    VLLM_USE_FLASHINFER_MOE_FP4: "1"
+    VLLM_USE_NCCL_SYMM_MEM: "1"
+    NCCL_CUMEM_ENABLE: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_NVLS_ENABLE: "1"
+
+  decode_environment:
+    VLLM_USE_FLASHINFER_MOE_FP4: "1"
+    VLLM_USE_NCCL_SYMM_MEM: "1"
+    NCCL_CUMEM_ENABLE: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_NVLS_ENABLE: "1"
+
+  vllm_config:
+    prefill:
+      kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}'
+      served-model-name: "nvidia/Kimi-K2.5-NVFP4"
+      kv-cache-dtype: "fp8"
+      tensor-parallel-size: 1
+      pipeline-parallel-size: 1
+      data-parallel-size: 4
+      data-parallel-rpc-port: 13345
+      enable-expert-parallel: true
+      max-model-len: 10240
+      max-num-seqs: 64
+      enforce-eager: true
+      compilation-config: '{"custom_ops":["+quant_fp8","+rms_norm","+rotary_embedding"],"pass_config":{"fuse_attn_quant":true,"fuse_allreduce_rms":true}}'
+      max-num-batched-tokens: 16384
+      safetensors-load-strategy: "prefetch"
+      trust-remote-code: true
+      no-enable-prefix-caching: true
+      no-enable-chunked-prefill: true
+      attention-backend: "FLASHINFER_MLA"
+      block-size: 64
+      attention-config: '{"use_trtllm_ragged_deepseek_prefill": true}'
+      all2all-backend: "flashinfer_nvlink_one_sided"
+      gpu-memory-utilization: 0.9
+
+    decode:
+      kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}'
+      served-model-name: "nvidia/Kimi-K2.5-NVFP4"
+      kv-cache-dtype: "fp8"
+      tensor-parallel-size: 1
+      pipeline-parallel-size: 1
+      data-parallel-size: 8
+      data-parallel-rpc-port: 13345
+      enable-expert-parallel: true
+      max-model-len: 10240
+      max-num-seqs: 512
+      max-num-batched-tokens: 10240
+      safetensors-load-strategy: "prefetch"
+      trust-remote-code: true
+      no-enable-prefix-caching: true
+      no-enable-chunked-prefill: true
+      async-scheduling: true
+      attention-backend: "FLASHINFER_MLA"
+      block-size: 64
+      all2all-backend: "flashinfer_nvlink_one_sided"
+      compilation-config: '{"cudagraph_mode":"FULL_DECODE_ONLY","custom_ops":["+quant_fp8","+rms_norm","+rotary_embedding"],"pass_config":{"fuse_attn_quant":true,"fuse_allreduce_rms":true}}'
+      gpu-memory-utilization: 0.9
+      stream-interval: 50
+      max-cudagraph-capture-size: 512
+
+benchmark:
+  type: "sa-bench"
+  isl: 8192
+  osl: 1024
+  concurrencies: "2048"
+  req_rate: "inf"
diff --git a/recipes/vllm/kimi-k2.5/8k1k/disagg-gb200-6p1d-dep4-dep16.yaml b/recipes/vllm/kimi-k2.5/8k1k/disagg-gb200-6p1d-dep4-dep16.yaml
new file mode 100644
index 00000000..47d3d7ee
--- /dev/null
+++ b/recipes/vllm/kimi-k2.5/8k1k/disagg-gb200-6p1d-dep4-dep16.yaml
@@ -0,0 +1,101 @@
+name: "kimi-vllm-disagg-gb200-6p1d-dep4-dep16"
+
+model:
+  path: "kimi-k2.5-nvfp4"
+  container: "vllm/vllm-openai:v0.18.0-cu130"
+  precision: "fp4"
+
+dynamo:
+  version: 1.0.1
+  install: true
+
+setup_script: vllm-container-deps.sh
+
+resources:
+  gpu_type: "gb200"
+  gpus_per_node: 4
+  prefill_nodes: 6
+  decode_nodes: 4
+  prefill_workers: 6
+  decode_workers: 1
+  gpus_per_prefill: 4
+  gpus_per_decode: 16
+
+frontend:
+  type: dynamo
+  enable_multiple_frontends: false
+
+backend:
+  type: vllm
+  connector: null
+
+  prefill_environment:
+    VLLM_USE_FLASHINFER_MOE_FP4: "1"
+    VLLM_USE_NCCL_SYMM_MEM: "1"
+    NCCL_CUMEM_ENABLE: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_NVLS_ENABLE: "1"
+
+  decode_environment:
+    VLLM_USE_FLASHINFER_MOE_FP4: "1"
+    VLLM_USE_NCCL_SYMM_MEM: "1"
+    NCCL_CUMEM_ENABLE: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_NVLS_ENABLE: "1"
+
+  vllm_config:
+    prefill:
+      kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}'
+      served-model-name: "nvidia/Kimi-K2.5-NVFP4"
+      kv-cache-dtype: "fp8"
+      tensor-parallel-size: 1
+      pipeline-parallel-size: 1
+      data-parallel-size: 4
+      data-parallel-rpc-port: 13345
+      enable-expert-parallel: true
+      max-model-len: 10240
+      max-num-seqs: 64
+      enforce-eager: true
+      compilation-config: '{"custom_ops":["+quant_fp8","+rms_norm","+rotary_embedding"],"pass_config":{"fuse_attn_quant":true,"fuse_allreduce_rms":true}}'
+      max-num-batched-tokens: 16384
+      safetensors-load-strategy: "prefetch"
+      trust-remote-code: true
+      no-enable-prefix-caching: true
+      no-enable-chunked-prefill: true
+      attention-backend: "FLASHINFER_MLA"
+      block-size: 64
+      attention-config: '{"use_trtllm_ragged_deepseek_prefill": true}'
+      all2all-backend: "flashinfer_nvlink_one_sided"
+      gpu-memory-utilization: 0.9
+
+    decode:
+      kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}'
+      served-model-name: "nvidia/Kimi-K2.5-NVFP4"
+      kv-cache-dtype: "fp8"
+      tensor-parallel-size: 1
+      pipeline-parallel-size: 1
+      data-parallel-size: 16
+      data-parallel-rpc-port: 13345
+      enable-expert-parallel: true
+      max-model-len: 10240
+      max-num-seqs: 512
+      max-num-batched-tokens: 10240
+      safetensors-load-strategy: "prefetch"
+      trust-remote-code: true
+      no-enable-prefix-caching: true
+      no-enable-chunked-prefill: true
+      async-scheduling: true
+      attention-backend: "FLASHINFER_MLA"
+      block-size: 64
+      all2all-backend: "flashinfer_nvlink_one_sided"
+      compilation-config: '{"cudagraph_mode":"FULL_DECODE_ONLY","custom_ops":["+quant_fp8","+rms_norm","+rotary_embedding"],"pass_config":{"fuse_attn_quant":true,"fuse_allreduce_rms":true}}'
+      gpu-memory-utilization: 0.9
+      stream-interval: 50
+      max-cudagraph-capture-size: 512
+
+benchmark:
+  type: "sa-bench"
+  isl: 8192
+  osl: 1024
+  concurrencies: "3072x4096"
+  req_rate: "inf"
diff --git a/recipes/vllm/minimax-m2.5/b200-fp4/1k1k.yaml b/recipes/vllm/minimax-m2.5/b200-fp4/1k1k.yaml
new file mode 100644
index 00000000..daef7b0d
--- /dev/null
+++ b/recipes/vllm/minimax-m2.5/b200-fp4/1k1k.yaml
@@ -0,0 +1,103 @@
+# MiniMax-M2.5 NVFP4 B200 — 1K/1K ISL/OSL
+# Aggregated vLLM, single-node 
+# requires github.com/NVIDIA/srt-slurm, branch sa-submission-q2-2026
+# usage examples: 
+# srtctl apply -f 1k1k.yaml     # run all variants
+# srtctl apply -f 1k1k.yaml:zip_override_lowlat    # full lowlat sweep
+# srtctl apply -f 1k1k.yaml:zip_override_lowlat[2] # lowlat, tep2 variant only
+# srtctl apply -f 1k1k.yaml:zip_override_hightput   # full high tput sweep
+# srtctl dry-run -f 1k1k.yaml   # preview the variants
+
+base:
+  name: "minimax-m2.5-nvfp4-b200-1k1k"
+
+  model:
+    path: "minimax_m2.5_fp4"
+    container: "vllm/vllm-openai:v0.19.0-cu130"
+    precision: "fp4"
+
+  resources:
+    gpu_type: "b200"
+    gpus_per_node: 8
+    agg_nodes: 1
+    agg_workers: 1
+    gpus_per_agg: 1
+
+  frontend:
+    type: dynamo
+    enable_multiple_frontends: false
+
+  dynamo:
+    install: true
+    top_of_tree: true # currently need ToT for vllm 0.19.0
+
+  setup_script: vllm-container-deps.sh
+
+  backend:
+    type: vllm
+
+    aggregated_environment:
+      DYN_HEALTH_CHECK_ENABLED: "false"
+      PYTHONUNBUFFERED: "1"
+
+    vllm_config:
+      aggregated:
+        tensor-parallel-size: 1
+        gpu-memory-utilization: 0.90
+        max-model-len: 2248
+        max-num-batched-tokens: 2048
+        kv-cache-dtype: fp8
+        max-cudagraph-capture-size: 2048
+        stream-interval: 20
+        no-enable-prefix-caching: true
+        trust-remote-code: true
+
+  benchmark:
+    type: "sa-bench"
+    isl: 1024
+    osl: 1024
+    req_rate: "inf"
+
+
+zip_override_lowlat:
+  name:
+    - "minimax-m2.5-nvfp4-b200-1k1k-lowlat-tp1"
+    - "minimax-m2.5-nvfp4-b200-1k1k-lowlat-tp2"
+    - "minimax-m2.5-nvfp4-b200-1k1k-lowlat-tep2"
+  resources:
+    gpus_per_agg: [1, 2, 2]
+  backend:
+    vllm_config:
+      aggregated:
+        tensor-parallel-size: [1, 2, 2]
+        enable-expert-parallel: [false, false, true]
+  benchmark:
+    concurrencies: ["4","4x8x16x32x64x128x256x512","128x256"]
+
+override_maxtput:
+  name: "minimax-m2.5-nvfp4-b200-1k1k-maxtput-dep2"
+  resources:
+    gpus_per_agg: 2
+  backend:
+    vllm_config:
+      aggregated:
+        tensor-parallel-size: 1
+        enable-expert-parallel: true
+        data-parallel-size: 2
+  benchmark:
+    concurrencies: "512"
+
+zip_override_hightput:
+  name:
+    - "minimax-m2.5-nvfp4-b200-1k1k-hightput-tp4"
+    - "minimax-m2.5-nvfp4-b200-1k1k-hightput-tep4"
+    - "minimax-m2.5-nvfp4-b200-1k1k-hightput-tp8"
+  resources:
+    gpus_per_agg: [4, 4, 8]
+  backend:
+    vllm_config:
+      aggregated:
+        tensor-parallel-size: [4, 4, 8]
+        enable-expert-parallel: [false, true, false]
+  benchmark:
+    concurrencies: ["4x8x16x32x64x128x256x512", "32x64x128", "4"]
diff --git a/recipes/vllm/minimax-m2.5/b200-fp4/8k1k.yaml b/recipes/vllm/minimax-m2.5/b200-fp4/8k1k.yaml
new file mode 100644
index 00000000..7d817e73
--- /dev/null
+++ b/recipes/vllm/minimax-m2.5/b200-fp4/8k1k.yaml
@@ -0,0 +1,88 @@
+# MiniMax-M2.5 NVFP4 B200 — 8K/1K ISL/OSL
+# Aggregated vLLM, single-node 
+# requires github.com/NVIDIA/srt-slurm, branch sa-submission-q2-2026
+# usage examples: 
+# srtctl apply -f 8k1k.yaml     # run all variants
+# srtctl apply -f 8k1k.yaml:zip_override_lowlat    # full lowlat sweep
+# srtctl apply -f 8k1k.yaml:zip_override_lowlat[2] # lowlat, tep2 variant only
+# srtctl apply -f 8k1k.yaml:zip_override_maxtput   # full max tput sweep
+# srtctl dry-run -f 8k1k.yaml   # preview the variants
+
+base:
+  name: "minimax-m2.5-nvfp4-b200-8k1k"
+
+  model:
+    path: "minimax_m2.5_fp4"
+    container: "vllm/vllm-openai:v0.19.0-cu130"
+    precision: "fp4"
+
+  resources:
+    gpu_type: "b200"
+    gpus_per_node: 8
+    agg_nodes: 1
+    agg_workers: 1
+    gpus_per_agg: 1
+
+  frontend:
+    type: dynamo
+    enable_multiple_frontends: false
+
+  dynamo:
+    install: true
+    top_of_tree: true # currently need ToT for vllm 0.19.0
+
+  setup_script: vllm-container-deps.sh
+
+  backend:
+    type: vllm
+
+    aggregated_environment:
+      DYN_HEALTH_CHECK_ENABLED: "false"
+      PYTHONUNBUFFERED: "1"
+
+    vllm_config:
+      aggregated:
+        tensor-parallel-size: 1
+        gpu-memory-utilization: 0.90
+        max-model-len: 9416
+        max-num-batched-tokens: 16384
+        kv-cache-dtype: fp8
+        max-cudagraph-capture-size: 2048
+        stream-interval: 20
+        no-enable-prefix-caching: true
+        trust-remote-code: true
+
+  benchmark:
+    type: "sa-bench"
+    isl: 8192
+    osl: 1024
+    req_rate: "inf"
+
+zip_override_lowlat:
+  name:
+    - "minimax-m2.5-nvfp4-b200-8k1k-lowlat-tp1"
+    - "minimax-m2.5-nvfp4-b200-8k1k-lowlat-tp2"
+    - "minimax-m2.5-nvfp4-b200-8k1k-lowlat-tep2"
+  resources:
+    gpus_per_agg: [1, 2, 2]
+  backend:
+    vllm_config:
+      aggregated:
+        tensor-parallel-size: [1, 2, 2]
+        enable-expert-parallel: [false, false, true]
+  benchmark:
+    concurrencies: ["4x8x16x32x256x512", "4x8x16x32x64x128x256x512", "128x256x512"]
+
+zip_override_maxtput:
+  name:
+    - "minimax-m2.5-nvfp4-b200-8k1k-maxtput-tp4"
+    - "minimax-m2.5-nvfp4-b200-8k1k-maxtput-tp8"
+  resources:
+    gpus_per_agg: [4, 8]
+  backend:
+    vllm_config:
+      aggregated:
+        tensor-parallel-size: [4, 8]
+        enable-expert-parallel: false
+  benchmark:
+    concurrencies: ["4x8x16x32x64x128x256x512", "4"]
diff --git a/src/srtctl/backends/vllm.py b/src/srtctl/backends/vllm.py
index ff20cb40..1acbd50c 100644
--- a/src/srtctl/backends/vllm.py
+++ b/src/srtctl/backends/vllm.py
@@ -132,12 +132,16 @@ def get_process_environment(self, process: Process) -> dict[str, str]:
         vLLM with dynamo requires unique ports for each worker:
         - DYN_VLLM_KV_EVENT_PORT: ZMQ port for KV events publishing
         - VLLM_NIXL_SIDE_CHANNEL_PORT: Port for NIXL side channel transfers
+        - VLLM_NIXL_SIDE_CHANNEL_HOST: Routable IP for NIXL side channel (not 0.0.0.0/localhost)
         """
+        from srtctl.core.slurm import get_hostname_ip
+
         env: dict[str, str] = {}
         if process.kv_events_port is not None:
             env["DYN_VLLM_KV_EVENT_PORT"] = str(process.kv_events_port)
         if process.nixl_port is not None:
             env["VLLM_NIXL_SIDE_CHANNEL_PORT"] = str(process.nixl_port)
+            env["VLLM_NIXL_SIDE_CHANNEL_HOST"] = get_hostname_ip(process.node)
         return env
 
     def get_served_model_name(self, default: str) -> str:
diff --git a/src/srtctl/benchmarks/__init__.py b/src/srtctl/benchmarks/__init__.py
index 3a2d6449..088617a6 100644
--- a/src/srtctl/benchmarks/__init__.py
+++ b/src/srtctl/benchmarks/__init__.py
@@ -4,7 +4,7 @@
 """Benchmark runners for srtctl."""
 
 # Import runners to trigger registration
-from srtctl.benchmarks import gpqa, gsm8k, longbenchv2, mmlu, mooncake_router, router, sa_bench, sglang_bench
+from srtctl.benchmarks import gpqa, gsm8k, lm_eval, longbenchv2, mmlu, mooncake_router, router, sa_bench, sglang_bench
 from srtctl.benchmarks.base import (
     BenchmarkRunner,
     get_runner,
@@ -18,6 +18,7 @@
     "list_benchmarks",
     "register_benchmark",
     # Runners
+    "lm_eval",
     "sa_bench",
     "sglang_bench",
     "mmlu",
diff --git a/src/srtctl/benchmarks/lm_eval.py b/src/srtctl/benchmarks/lm_eval.py
new file mode 100644
index 00000000..c63ec097
--- /dev/null
+++ b/src/srtctl/benchmarks/lm_eval.py
@@ -0,0 +1,58 @@
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 2026 SemiAnalysis LLC. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+"""lm-eval benchmark runner for InferenceX evals."""
+
+from __future__ import annotations
+
+from typing import TYPE_CHECKING
+
+from srtctl.benchmarks.base import SCRIPTS_DIR, BenchmarkRunner, register_benchmark
+
+if TYPE_CHECKING:
+    from srtctl.core.runtime import RuntimeContext
+    from srtctl.core.schema import SrtConfig
+
+
+@register_benchmark("lm-eval")
+class LMEvalRunner(BenchmarkRunner):
+    """lm-eval accuracy evaluation using InferenceX benchmark_lib.
+
+    Runs lm-eval via the InferenceX benchmark_lib.sh harness,
+    which handles task selection, result collection, and summary generation.
+    """
+
+    @property
+    def name(self) -> str:
+        return "lm-eval"
+
+    @property
+    def script_path(self) -> str:
+        return "/srtctl-benchmarks/lm-eval/bench.sh"
+
+    @property
+    def local_script_dir(self) -> str:
+        return str(SCRIPTS_DIR / "lm-eval")
+
+    def validate_config(self, config: SrtConfig) -> list[str]:
+        # lm-eval has sensible defaults
+        return []
+
+    def build_command(
+        self,
+        config: SrtConfig,
+        runtime: RuntimeContext,
+    ) -> list[str]:
+        endpoint = f"http://localhost:{runtime.frontend_port}"
+        # Always use the container mount path, not the host path.
+        # INFMAX_WORKSPACE env var contains the host path (used for mount setup
+        # in runtime.py), but inside the container it's at /infmax-workspace.
+        infmax_workspace = "/infmax-workspace"
+
+        return [
+            "bash",
+            self.script_path,
+            endpoint,
+            infmax_workspace,
+        ]
diff --git a/src/srtctl/benchmarks/sa_bench.py b/src/srtctl/benchmarks/sa_bench.py
index 9adc6678..5f220393 100644
--- a/src/srtctl/benchmarks/sa_bench.py
+++ b/src/srtctl/benchmarks/sa_bench.py
@@ -97,5 +97,9 @@ def build_command(
             str(prefill_gpus),
             str(decode_gpus),
             str(b.random_range_ratio) if b.random_range_ratio is not None else "0.8",
+            str(b.num_prompts_mult) if b.num_prompts_mult is not None else "10",
+            str(b.num_warmup_mult) if b.num_warmup_mult is not None else "2",
+            b.custom_tokenizer or "",
+            str(b.use_chat_template).lower(),
         ]
         return cmd
diff --git a/src/srtctl/benchmarks/scripts/lm-eval/bench.sh b/src/srtctl/benchmarks/scripts/lm-eval/bench.sh
new file mode 100755
index 00000000..a10e4e7d
--- /dev/null
+++ b/src/srtctl/benchmarks/scripts/lm-eval/bench.sh
@@ -0,0 +1,77 @@
+#!/bin/bash
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 2026 SemiAnalysis LLC. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+# lm-eval accuracy evaluation using InferenceX benchmark_lib
+# Expects: endpoint [infmax_workspace]
+
+set -e
+
+ENDPOINT=$1
+INFMAX_WORKSPACE=${2:-/infmax-workspace}
+
+# Extract HOST and PORT from endpoint (e.g., http://localhost:8000)
+HOST=$(echo "$ENDPOINT" | sed -E 's|https?://||; s|:.*||')
+PORT=$(echo "$ENDPOINT" | sed -E 's|.*:([0-9]+).*|\1|')
+
+echo "lm-eval Config: endpoint=${ENDPOINT}; host=${HOST}; port=${PORT}; workspace=${INFMAX_WORKSPACE}"
+
+# Auto-discover the served model name from /v1/models if MODEL_NAME is not set.
+# This ensures we use the exact name the server recognizes, regardless of what
+# $MODEL (the HuggingFace ID from the workflow) is set to.
+if [[ -z "${MODEL_NAME:-}" ]]; then
+    DISCOVERED_MODEL=$(curl -sf "${ENDPOINT}/v1/models" 2>/dev/null \
+        | python3 -c "import sys,json; d=json.load(sys.stdin); print(d['data'][0]['id'])" 2>/dev/null || true)
+    if [[ -n "$DISCOVERED_MODEL" ]]; then
+        export MODEL_NAME="$DISCOVERED_MODEL"
+        echo "Auto-discovered MODEL_NAME from /v1/models: ${MODEL_NAME}"
+    else
+        echo "WARNING: Could not discover model name from /v1/models, using MODEL_NAME=${MODEL_NAME:-$MODEL}"
+    fi
+else
+    echo "Using MODEL_NAME from environment: ${MODEL_NAME}"
+fi
+
+# cd to workspace so that relative paths (e.g., utils/evals/*.yaml) resolve
+cd "${INFMAX_WORKSPACE}"
+
+# Source the InferenceX benchmark library
+source "${INFMAX_WORKSPACE}/benchmarks/benchmark_lib.sh"
+
+# Run lm-eval via benchmark_lib
+# EVAL_CONC is set by the InferenceX workflow (median of conc list).
+# benchmark_lib reads concurrency from EVAL_CONCURRENT_REQUESTS env var.
+export EVAL_CONCURRENT_REQUESTS="${EVAL_CONC:-${EVAL_CONCURRENT_REQUESTS:-256}}"
+echo "Running lm-eval with concurrent-requests=${EVAL_CONCURRENT_REQUESTS}..."
+eval_rc=0
+run_eval --framework lm-eval --port "$PORT" || eval_rc=$?
+
+# Derive metadata env vars that append_lm_eval_summary needs but do_sweep.py
+# does not pass directly (it passes PREFILL_TP/EP/etc, not TP/EP_SIZE/CONC).
+export IS_MULTINODE="${IS_MULTINODE:-true}"
+export TP="${TP:-${PREFILL_TP:-1}}"
+export CONC="${CONC:-${EVAL_CONC:-${EVAL_CONCURRENT_REQUESTS:-1}}}"
+export EP_SIZE="${EP_SIZE:-${PREFILL_EP:-1}}"
+export DP_ATTENTION="${DP_ATTENTION:-${PREFILL_DP_ATTN:-false}}"
+# Remap srt-slurm's DP_ATTN names to InferenceX's DP_ATTENTION names
+export PREFILL_DP_ATTENTION="${PREFILL_DP_ATTENTION:-${PREFILL_DP_ATTN:-${DP_ATTENTION:-false}}}"
+export DECODE_DP_ATTENTION="${DECODE_DP_ATTENTION:-${DECODE_DP_ATTN:-${DP_ATTENTION:-false}}}"
+
+# Generate the lm-eval summary
+echo "Generating lm-eval summary..."
+append_lm_eval_summary || true
+
+# Copy eval artifacts to /logs/eval_results/
+mkdir -p /logs/eval_results
+echo "Copying eval artifacts to /logs/eval_results/..."
+cp -v meta_env.json /logs/eval_results/ 2>/dev/null || true
+cp -v results*.json /logs/eval_results/ 2>/dev/null || true
+cp -v sample*.jsonl /logs/eval_results/ 2>/dev/null || true
+
+if [[ "$eval_rc" -ne 0 ]]; then
+    echo "lm-eval evaluation failed with exit code ${eval_rc}"
+    exit "$eval_rc"
+fi
+
+echo "lm-eval evaluation complete"
diff --git a/src/srtctl/benchmarks/scripts/sa-bench/backend_request_func.py b/src/srtctl/benchmarks/scripts/sa-bench/backend_request_func.py
index dd2cac44..ded56a80 100644
--- a/src/srtctl/benchmarks/scripts/sa-bench/backend_request_func.py
+++ b/src/srtctl/benchmarks/scripts/sa-bench/backend_request_func.py
@@ -511,10 +511,107 @@ def get_model(pretrained_model_name_or_path: str) -> str:
     return pretrained_model_name_or_path
 
 
+def _resolve_tokenizer_file(model_name_or_path):
+    """Resolve tokenizer.json from a local directory or HF hub cache."""
+    from pathlib import Path
+
+    local_path = Path(model_name_or_path) / "tokenizer.json"
+    if local_path.is_file():
+        return str(local_path)
+    try:
+        from huggingface_hub import hf_hub_download
+
+        return hf_hub_download(model_name_or_path, "tokenizer.json", local_files_only=True)
+    except Exception:
+        return None
+
+
+def _fix_v5_tokenizer_components(tokenizer, model_name_or_path):
+    """Fix pre_tokenizer/decoder when transformers v5 LlamaTokenizerFast overwrites them.
+
+    In transformers v5, LlamaTokenizerFast.__init__ rebuilds the pre_tokenizer
+    and decoder from scratch, discarding the originals from tokenizer.json.
+    This breaks models like DeepSeek-R1 that declare LlamaTokenizerFast but
+    actually use a ByteLevel pre_tokenizer.
+
+    Ported from sglang/python/sglang/srt/utils/hf_transformers_utils.py.
+    """
+    backend = getattr(tokenizer, "_tokenizer", None)
+    if backend is None:
+        return
+
+    try:
+        from tokenizers import Tokenizer as RawTokenizer
+
+        tok_file = _resolve_tokenizer_file(model_name_or_path)
+        if tok_file is None:
+            return
+        raw = RawTokenizer.from_file(tok_file)
+    except Exception:
+        return
+
+    raw_pre = type(raw.pre_tokenizer).__name__ if raw.pre_tokenizer else None
+    loaded_pre = type(backend.pre_tokenizer).__name__ if backend.pre_tokenizer else None
+
+    if raw_pre and loaded_pre and raw_pre != loaded_pre:
+        print(
+            f"[sa-bench] Fixing v5 tokenizer component mismatch for {model_name_or_path}: "
+            f"pre_tokenizer {loaded_pre} -> {raw_pre}, "
+            f"decoder {type(backend.decoder).__name__ if backend.decoder else None} "
+            f"-> {type(raw.decoder).__name__ if raw.decoder else None}",
+            flush=True,
+        )
+        backend.pre_tokenizer = raw.pre_tokenizer
+        backend.decoder = raw.decoder
+
+
+def _load_glm_moe_dsa_tokenizer(pretrained_model_name_or_path: str) -> "PreTrainedTokenizerFast":
+    """Load GLM-Moe-Dsa / GLM-5 tokenizer directly from tokenizer.json.
+
+    Works around incompatibilities when the checkpoint was saved with
+    transformers 5.x (TokenizersBackend / list-style extra_special_tokens).
+    """
+    import json
+    from pathlib import Path
+
+    from tokenizers import Tokenizer as RustTokenizer
+    from transformers import PreTrainedTokenizerFast
+
+    _SAFE_CONFIG_KEYS = (
+        "pad_token", "pad_token_id", "eos_token", "eos_token_id",
+        "bos_token", "bos_token_id", "unk_token", "unk_token_id",
+        "model_max_length", "padding_side", "truncation_side",
+    )
+
+    path = Path(pretrained_model_name_or_path)
+    tokenizer_json = path / "tokenizer.json"
+    if not tokenizer_json.exists():
+        raise FileNotFoundError(
+            f"Expected tokenizer.json at {tokenizer_json}. "
+            "GlmMoeDsaTokenizer loads from tokenizer.json only."
+        )
+
+    rust_tok = RustTokenizer.from_file(str(tokenizer_json))
+    init_kwargs = {}
+    config_path = path / "tokenizer_config.json"
+    if config_path.exists():
+        with open(config_path, encoding="utf-8") as f:
+            config = json.load(f)
+        for key in _SAFE_CONFIG_KEYS:
+            if key in config:
+                init_kwargs[key] = config[key]
+        if "extra_special_tokens" in config:
+            init_kwargs["additional_special_tokens"] = config["extra_special_tokens"]
+
+    return PreTrainedTokenizerFast(tokenizer_object=rust_tok, **init_kwargs)
+
+
 def get_tokenizer(
     pretrained_model_name_or_path: str,
     tokenizer_mode: str = "auto",
     trust_remote_code: bool = False,
+    custom_tokenizer: str | None = None,
+    backend: str | None = None,
     **kwargs,
 ) -> PreTrainedTokenizer | PreTrainedTokenizerFast:
     if pretrained_model_name_or_path is not None and not os.path.exists(pretrained_model_name_or_path):
@@ -533,12 +630,60 @@ def get_tokenizer(
                 "to use mistral tokenizer mode."
             ) from e
         return MistralTokenizer.from_pretrained(str(pretrained_model_name_or_path))
-    else:
-        return AutoTokenizer.from_pretrained(
-            pretrained_model_name_or_path,
-            trust_remote_code=trust_remote_code,
-            **kwargs,
-        )
+
+    if custom_tokenizer:
+        if custom_tokenizer == "glm_moe_dsa":
+            return _load_glm_moe_dsa_tokenizer(pretrained_model_name_or_path)
+        if custom_tokenizer == "deepseek_v4":
+            if backend == "sglang":
+                # SGLang has no client-side DeepseekV4Tokenizer package; we
+                # vendor sglang's own server-side encoder (encoding_dsv4.py)
+                # under ./tokenizers/ so the sa-bench client renders the
+                # exact same DSML prompt the sglang server builds.
+                from tokenizers.sglang_deepseek_v4 import (
+                    SGLangDeepseekV4Tokenizer,
+                )
+                return SGLangDeepseekV4Tokenizer.from_pretrained(
+                    str(pretrained_model_name_or_path)
+                )
+            if backend in (None, "vllm"):
+                try:
+                    from vllm.tokenizers.deepseek_v4 import DeepseekV4Tokenizer
+                except ImportError as e:
+                    raise ImportError(
+                        "DeepseekV4Tokenizer requires vllm package.\n"
+                        "Please install it with `pip install vllm` "
+                        "to use deepseek_v4 tokenizer."
+                    ) from e
+                return DeepseekV4Tokenizer.from_pretrained(
+                    str(pretrained_model_name_or_path)
+                )
+            raise ValueError(
+                f"custom_tokenizer='deepseek_v4' does not support backend={backend!r}; "
+                "expected 'vllm' or 'sglang'."
+            )
+        from importlib import import_module
+        try:
+            module_path, class_name = custom_tokenizer.rsplit('.', 1)
+            module = import_module(module_path)
+            tokenizer_class = getattr(module, class_name)
+            return tokenizer_class.from_pretrained(
+                pretrained_model_name_or_path,
+                trust_remote_code=trust_remote_code,
+                **kwargs,
+            )
+        except (ValueError, ImportError, AttributeError) as e:
+            raise ValueError(
+                f"Failed to load custom_tokenizer '{custom_tokenizer}'. "
+                "Expected 'glm_moe_dsa' or 'module.path.ClassName'.") from e
+
+    tokenizer = AutoTokenizer.from_pretrained(
+        pretrained_model_name_or_path,
+        trust_remote_code=trust_remote_code,
+        **kwargs,
+    )
+    _fix_v5_tokenizer_components(tokenizer, pretrained_model_name_or_path)
+    return tokenizer
 
 
 ASYNC_REQUEST_FUNCS = {
diff --git a/src/srtctl/benchmarks/scripts/sa-bench/bench.sh b/src/srtctl/benchmarks/scripts/sa-bench/bench.sh
index ed907308..acddf754 100644
--- a/src/srtctl/benchmarks/scripts/sa-bench/bench.sh
+++ b/src/srtctl/benchmarks/scripts/sa-bench/bench.sh
@@ -60,6 +60,22 @@ TOTAL_GPUS=${9:-0}
 PREFILL_GPUS=${10:-0}
 DECODE_GPUS=${11:-0}
 RANDOM_RANGE_RATIO=${12:-0.8}
+NUM_PROMPTS_MULT=${13:-10}
+NUM_WARMUP_MULT=${14:-2}
+CUSTOM_TOKENIZER=${15:-}
+USE_CHAT_TEMPLATE=${16:-true}
+
+# Build optional custom tokenizer args
+CUSTOM_TOKENIZER_ARGS=()
+if [ -n "$CUSTOM_TOKENIZER" ]; then
+    CUSTOM_TOKENIZER_ARGS=(--custom-tokenizer "$CUSTOM_TOKENIZER")
+fi
+
+# Build optional chat template args
+CHAT_TEMPLATE_ARGS=()
+if [ "$USE_CHAT_TEMPLATE" = "true" ]; then
+    CHAT_TEMPLATE_ARGS=(--use-chat-template)
+fi
 
 # Parse endpoint into host:port
 HOST=$(echo "$ENDPOINT" | sed 's|http://||' | cut -d: -f1)
@@ -119,7 +135,8 @@ for concurrency in "${CONCURRENCY_LIST[@]}"; do
         --request-rate 250 \
         --percentile-metrics ttft,tpot,itl,e2el \
         --max-concurrency "$concurrency" \
-        --trust-remote-code
+        --trust-remote-code \
+        "${CUSTOM_TOKENIZER_ARGS[@]}"
 
     num_prompts=$((concurrency * 10))
     
@@ -149,7 +166,8 @@ for concurrency in "${CONCURRENCY_LIST[@]}"; do
         --percentile-metrics ttft,tpot,itl,e2el \
         --max-concurrency "$concurrency" \
         --trust-remote-code \
-        --use-chat-template \
+        "${CHAT_TEMPLATE_ARGS[@]}" \
+        "${CUSTOM_TOKENIZER_ARGS[@]}" \
         --save-result --result-dir "$result_dir" --result-filename "$result_filename"
     set +x
 
diff --git a/src/srtctl/benchmarks/scripts/sa-bench/benchmark_serving.py b/src/srtctl/benchmarks/scripts/sa-bench/benchmark_serving.py
index 4363ef6e..75b3a97f 100644
--- a/src/srtctl/benchmarks/scripts/sa-bench/benchmark_serving.py
+++ b/src/srtctl/benchmarks/scripts/sa-bench/benchmark_serving.py
@@ -837,6 +837,8 @@ def main(args: argparse.Namespace):
         tokenizer_id,
         tokenizer_mode=tokenizer_mode,
         trust_remote_code=args.trust_remote_code,
+        custom_tokenizer=args.custom_tokenizer,
+        backend=backend,
     )
 
     if args.dataset is not None:
@@ -1279,6 +1281,14 @@ def main(args: argparse.Namespace):
         '"custom" will use --tokenizer to select the preregistered tokenizer.',
     )
 
+    parser.add_argument(
+        "--custom-tokenizer",
+        type=str,
+        default=None,
+        help="Custom tokenizer to use (e.g., 'glm_moe_dsa' or 'module.path.ClassName'). "
+        "When set, overrides the default tokenizer loading.",
+    )
+
     parser.add_argument(
         "--served-model-name",
         type=str,
diff --git a/src/srtctl/benchmarks/scripts/sa-bench/tokenizers/__init__.py b/src/srtctl/benchmarks/scripts/sa-bench/tokenizers/__init__.py
new file mode 100644
index 00000000..42d334ba
--- /dev/null
+++ b/src/srtctl/benchmarks/scripts/sa-bench/tokenizers/__init__.py
@@ -0,0 +1 @@
+"""Custom tokenizers bundled with sa-bench."""
diff --git a/src/srtctl/benchmarks/scripts/sa-bench/tokenizers/_sglang_encoding_dsv4.py b/src/srtctl/benchmarks/scripts/sa-bench/tokenizers/_sglang_encoding_dsv4.py
new file mode 100644
index 00000000..2212e090
--- /dev/null
+++ b/src/srtctl/benchmarks/scripts/sa-bench/tokenizers/_sglang_encoding_dsv4.py
@@ -0,0 +1,856 @@
+# SPDX-License-Identifier: Apache-2.0
+#
+# Vendored from sgl-project/sglang PR #23600 (currently unmerged).
+# Source: https://github.com/sgl-project/sglang/blob/f5d03db853862c8fb0e805df591bed883a71868b/python/sglang/srt/entrypoints/openai/encoding_dsv4.py
+# Upstream SHA-256: 106b471e559153d93c4af34a4865b2a68b205b72ddd688dbed93dfd86e4b92cb
+#
+# This file is vendored because sglang does not ship a client-side
+# tokenizer package equivalent to vllm.tokenizers.deepseek_v4. Keeping
+# a byte-identical copy here lets the sa-bench client render the exact
+# DeepSeek-V4 DSML prompt that sglang server builds internally, so
+# input_tokens reported by the client match the server's #new-token.
+#
+# When sglang upstream merges an official client-side tokenizer package,
+# this vendored copy can be removed in favor of that import.
+#
+# -------------------- Original sglang file begins below --------------------
+# Adapted from the DeepSeek-V4 release reference implementation.
+"""
+DeepSeek-V4 Encoding
+
+A self-contained implementation for encoding/decoding DeepSeek-V4 chat messages
+with tool calling, thinking mode, and quick instruction task support.
+"""
+
+import copy
+import json
+import re
+from typing import Any, Dict, List, Optional, Tuple, Union
+
+# ============================================================
+# Special Tokens
+# ============================================================
+
+bos_token: str = "<｜begin▁of▁sentence｜>"
+eos_token: str = "<｜end▁of▁sentence｜>"
+thinking_start_token: str = "<think>"
+thinking_end_token: str = "</think>"
+dsml_token: str = "｜DSML｜"
+
+USER_SP_TOKEN = "<｜User｜>"
+ASSISTANT_SP_TOKEN = "<｜Assistant｜>"
+LATEST_REMINDER_SP_TOKEN = "<｜latest_reminder｜>"
+
+# Task special tokens for internal classification tasks
+DS_TASK_SP_TOKENS = {
+    "action": "<｜action｜>",
+    "query": "<｜query｜>",
+    "authority": "<｜authority｜>",
+    "domain": "<｜domain｜>",
+    "title": "<｜title｜>",
+    "read_url": "<｜read_url｜>",
+}
+VALID_TASKS = set(DS_TASK_SP_TOKENS.keys())
+
+# ============================================================
+# Templates
+# ============================================================
+
+system_msg_template: str = "{content}"
+user_msg_template: str = "{content}"
+latest_reminder_msg_template: str = "{content}"
+assistant_msg_template: str = "{reasoning}{content}{tool_calls}" + eos_token
+assistant_msg_wo_eos_template: str = "{reasoning}{content}{tool_calls}"
+thinking_template: str = "{reasoning_content}"
+
+response_format_template: str = (
+    "## Response Format:\n\nYou MUST strictly adhere to the following schema to reply:\n{schema}"
+)
+tool_call_template: str = (
+    '<{dsml_token}invoke name="{name}">\n{arguments}\n</{dsml_token}invoke>'
+)
+tool_calls_template = (
+    "<{dsml_token}{tc_block_name}>\n{tool_calls}\n</{dsml_token}{tc_block_name}>"
+)
+tool_calls_block_name: str = "tool_calls"
+
+tool_output_template: str = "<tool_result>{content}</tool_result>"
+
+REASONING_EFFORT_MAX = (
+    "Reasoning Effort: Absolute maximum with no shortcuts permitted.\n"
+    "You MUST be very thorough in your thinking and comprehensively decompose the problem to resolve the root cause, rigorously stress-testing your logic against all potential paths, edge cases, and adversarial scenarios.\n"
+    "Explicitly write out your entire deliberation process, documenting every intermediate step, considered alternative, and rejected hypothesis to ensure absolutely no assumption is left unchecked.\n\n"
+)
+
+TOOLS_TEMPLATE = """## Tools
+
+You have access to a set of tools to help answer the user's question. You can invoke tools by writing a "<{dsml_token}tool_calls>" block like the following:
+
+<{dsml_token}tool_calls>
+<{dsml_token}invoke name="$TOOL_NAME">
+<{dsml_token}parameter name="$PARAMETER_NAME" string="true|false">$PARAMETER_VALUE</{dsml_token}parameter>
+...
+</{dsml_token}invoke>
+<{dsml_token}invoke name="$TOOL_NAME2">
+...
+</{dsml_token}invoke>
+</{dsml_token}tool_calls>
+
+String parameters should be specified as is and set `string="true"`. For all other types (numbers, booleans, arrays, objects), pass the value in JSON format and set `string="false"`.
+
+If thinking_mode is enabled (triggered by {thinking_start_token}), you MUST output your complete reasoning inside {thinking_start_token}...{thinking_end_token} BEFORE any tool calls or final response.
+
+Otherwise, output directly after {thinking_end_token} with tool calls or final response.
+
+### Available Tool Schemas
+
+{tool_schemas}
+
+You MUST strictly follow the above defined tool name and parameter schemas to invoke tool calls.
+"""
+
+# ============================================================
+# Utility Functions
+# ============================================================
+
+
+def to_json(value: Any) -> str:
+    """Serialize a value to JSON string."""
+    try:
+        return json.dumps(value, ensure_ascii=False)
+    except:
+        return json.dumps(value, ensure_ascii=True)
+
+
+def tools_from_openai_format(tools):
+    """Extract function definitions from OpenAI-format tool list."""
+    return [tool["function"] for tool in tools]
+
+
+def tool_calls_from_openai_format(tool_calls):
+    """Convert OpenAI-format tool calls to internal format."""
+    return [
+        {
+            "name": tool_call["function"]["name"],
+            "arguments": tool_call["function"]["arguments"],
+        }
+        for tool_call in tool_calls
+    ]
+
+
+def tool_calls_to_openai_format(tool_calls):
+    """Convert internal tool calls to OpenAI format."""
+    return [
+        {
+            "type": "function",
+            "function": {
+                "name": tool_call["name"],
+                "arguments": tool_call["arguments"],
+            },
+        }
+        for tool_call in tool_calls
+    ]
+
+
+def encode_arguments_to_dsml(tool_call: Dict[str, str]) -> str:
+    """
+    Encode tool call arguments into DSML parameter format.
+
+    Args:
+        tool_call: Dict with "name" and "arguments" (JSON string) keys.
+
+    Returns:
+        DSML-formatted parameter string.
+    """
+    p_dsml_template = '<{dsml_token}parameter name="{key}" string="{is_str}">{value}</{dsml_token}parameter>'
+    P_dsml_strs = []
+
+    try:
+        arguments = json.loads(tool_call["arguments"])
+    except Exception as err:
+        arguments = {"arguments": tool_call["arguments"]}
+
+    for k, v in arguments.items():
+        p_dsml_str = p_dsml_template.format(
+            dsml_token=dsml_token,
+            key=k,
+            is_str="true" if isinstance(v, str) else "false",
+            value=v if isinstance(v, str) else to_json(v),
+        )
+        P_dsml_strs.append(p_dsml_str)
+
+    return "\n".join(P_dsml_strs)
+
+
+def decode_dsml_to_arguments(
+    tool_name: str, tool_args: Dict[str, Tuple[str, str]]
+) -> Dict[str, str]:
+    """
+    Decode DSML parameters back to a tool call dict.
+
+    Args:
+        tool_name: Name of the tool.
+        tool_args: Dict mapping param_name -> (value, is_string_flag).
+
+    Returns:
+        Dict with "name" and "arguments" (JSON string) keys.
+    """
+
+    def _decode_value(key: str, value: str, string: str):
+        if string == "true":
+            value = to_json(value)
+        return f"{to_json(key)}: {value}"
+
+    tool_args_json = (
+        "{"
+        + ", ".join(
+            [_decode_value(k, v, string=is_str) for k, (v, is_str) in tool_args.items()]
+        )
+        + "}"
+    )
+    return dict(name=tool_name, arguments=tool_args_json)
+
+
+def render_tools(tools: List[Dict[str, Union[str, Dict[str, Any]]]]) -> str:
+    """
+    Render tool schemas into the system prompt format.
+
+    Args:
+        tools: List of tool schema dicts (each with name, description, parameters).
+
+    Returns:
+        Formatted tools section string.
+    """
+    tools_json = [to_json(t) for t in tools]
+
+    return TOOLS_TEMPLATE.format(
+        tool_schemas="\n".join(tools_json),
+        dsml_token=dsml_token,
+        thinking_start_token=thinking_start_token,
+        thinking_end_token=thinking_end_token,
+    )
+
+
+def find_last_user_index(messages: List[Dict[str, Any]]) -> int:
+    """Find the index of the last user/developer message."""
+    last_user_index = -1
+    for idx in range(len(messages) - 1, -1, -1):
+        if messages[idx].get("role") in ["user", "developer"]:
+            last_user_index = idx
+            break
+    return last_user_index
+
+
+# ============================================================
+# Message Rendering
+# ============================================================
+
+
+def render_message(
+    index: int,
+    messages: List[Dict[str, Any]],
+    thinking_mode: str,
+    drop_thinking: bool = True,
+    reasoning_effort: Optional[str] = None,
+) -> str:
+    """
+    Render a single message at the given index into its encoded string form.
+
+    This is the core function that converts each message in the conversation
+    into the DeepSeek-V4 format.
+
+    Args:
+        index: Index of the message to render.
+        messages: Full list of messages in the conversation.
+        thinking_mode: Either "chat" or "thinking".
+        drop_thinking: Whether to drop reasoning content from earlier turns.
+        reasoning_effort: Optional reasoning effort level ("max", "high", or None).
+
+    Returns:
+        Encoded string for this message.
+    """
+    assert 0 <= index < len(messages)
+    assert thinking_mode in [
+        "chat",
+        "thinking",
+    ], f"Invalid thinking_mode `{thinking_mode}`"
+
+    prompt = ""
+    msg = messages[index]
+    last_user_idx = find_last_user_index(messages)
+
+    role = msg.get("role")
+    content = msg.get("content")
+    tools = msg.get("tools")
+    response_format = msg.get("response_format")
+    tool_calls = msg.get("tool_calls")
+    reasoning_content = msg.get("reasoning_content")
+    wo_eos = msg.get("wo_eos", False)
+
+    if tools:
+        tools = tools_from_openai_format(tools)
+    if tool_calls:
+        tool_calls = tool_calls_from_openai_format(tool_calls)
+
+    # Reasoning effort prefix (only at index 0 in thinking mode with max effort)
+    assert reasoning_effort in [
+        "max",
+        None,
+        "high",
+    ], f"Invalid reasoning effort: {reasoning_effort}"
+    if index == 0 and thinking_mode == "thinking" and reasoning_effort == "max":
+        prompt += REASONING_EFFORT_MAX
+
+    if role == "system":
+        prompt += system_msg_template.format(content=content or "")
+        if tools:
+            prompt += "\n\n" + render_tools(tools)
+        if response_format:
+            prompt += "\n\n" + response_format_template.format(
+                schema=to_json(response_format)
+            )
+
+    elif role == "developer":
+        assert content, f"Invalid message for role `{role}`: {msg}"
+
+        content_developer = USER_SP_TOKEN
+        content_developer += content
+
+        if tools:
+            content_developer += "\n\n" + render_tools(tools)
+        if response_format:
+            content_developer += "\n\n" + response_format_template.format(
+                schema=to_json(response_format)
+            )
+
+        prompt += user_msg_template.format(content=content_developer)
+
+    elif role == "user":
+        prompt += USER_SP_TOKEN
+
+        # Handle content blocks (tool results mixed with text)
+        content_blocks = msg.get("content_blocks")
+        if content_blocks:
+            parts = []
+            for block in content_blocks:
+                block_type = block.get("type")
+                if block_type == "text":
+                    parts.append(block.get("text", ""))
+                elif block_type == "tool_result":
+                    tool_content = block.get("content", "")
+                    if isinstance(tool_content, list):
+                        text_parts = []
+                        for b in tool_content:
+                            if b.get("type") == "text":
+                                text_parts.append(b.get("text", ""))
+                            else:
+                                text_parts.append(f"[Unsupported {b.get('type')}]")
+                        tool_content = "\n\n".join(text_parts)
+                    parts.append(tool_output_template.format(content=tool_content))
+                else:
+                    parts.append(f"[Unsupported {block_type}]")
+            prompt += "\n\n".join(parts)
+        else:
+            prompt += content or ""
+
+    elif role == "latest_reminder":
+        prompt += LATEST_REMINDER_SP_TOKEN + latest_reminder_msg_template.format(
+            content=content
+        )
+
+    elif role == "tool":
+        raise NotImplementedError(
+            "deepseek_v4 merges tool messages into user; please preprocess with merge_tool_messages()"
+        )
+
+    elif role == "assistant":
+        thinking_part = ""
+        tc_content = ""
+
+        if tool_calls:
+            tc_list = [
+                tool_call_template.format(
+                    dsml_token=dsml_token,
+                    name=tc.get("name"),
+                    arguments=encode_arguments_to_dsml(tc),
+                )
+                for tc in tool_calls
+            ]
+            tc_content += "\n\n" + tool_calls_template.format(
+                dsml_token=dsml_token,
+                tool_calls="\n".join(tc_list),
+                tc_block_name=tool_calls_block_name,
+            )
+
+        summary_content = content or ""
+        rc = reasoning_content or ""
+
+        # Check if previous message has a task - if so, this is a task output (no thinking)
+        prev_has_task = index - 1 >= 0 and messages[index - 1].get("task") is not None
+
+        if thinking_mode == "thinking" and not prev_has_task:
+            if not drop_thinking or index > last_user_idx:
+                thinking_part = (
+                    thinking_template.format(reasoning_content=rc) + thinking_end_token
+                )
+            else:
+                thinking_part = ""
+
+        if wo_eos:
+            prompt += assistant_msg_wo_eos_template.format(
+                reasoning=thinking_part,
+                content=summary_content,
+                tool_calls=tc_content,
+            )
+        else:
+            prompt += assistant_msg_template.format(
+                reasoning=thinking_part,
+                content=summary_content,
+                tool_calls=tc_content,
+            )
+    else:
+        raise NotImplementedError(f"Unknown role: {role}")
+
+    # Append transition tokens based on what follows
+    if index + 1 < len(messages) and messages[index + 1].get("role") not in [
+        "assistant",
+        "latest_reminder",
+    ]:
+        return prompt
+
+    task = messages[index].get("task")
+    if task is not None:
+        # Task special token for internal classification tasks
+        assert (
+            task in VALID_TASKS
+        ), f"Invalid task: '{task}'. Valid tasks are: {list(VALID_TASKS)}"
+        task_sp_token = DS_TASK_SP_TOKENS[task]
+
+        if task != "action":
+            # Non-action tasks: append task sp token directly after the message
+            prompt += task_sp_token
+        else:
+            # Action task: append Assistant + thinking token + action sp token
+            prompt += ASSISTANT_SP_TOKEN
+            prompt += (
+                thinking_end_token
+                if thinking_mode != "thinking"
+                else thinking_start_token
+            )
+            prompt += task_sp_token
+
+    elif messages[index].get("role") in ["user", "developer"]:
+        # Normal generation: append Assistant + thinking token
+        prompt += ASSISTANT_SP_TOKEN
+        if not drop_thinking and thinking_mode == "thinking":
+            prompt += thinking_start_token
+        elif drop_thinking and thinking_mode == "thinking" and index >= last_user_idx:
+            prompt += thinking_start_token
+        else:
+            prompt += thinking_end_token
+
+    return prompt
+
+
+# ============================================================
+# Preprocessing
+# ============================================================
+
+
+def merge_tool_messages(messages: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
+    """
+    Merge tool messages into the preceding user message using content_blocks format.
+
+    DeepSeek-V4 does not have a standalone "tool" role; instead, tool results
+    are encoded as <tool_result> blocks within user messages.
+
+    This function converts a standard OpenAI-format conversation (with separate
+    "tool" role messages) into V4 format where tool results are merged into
+    user messages.
+
+    Args:
+        messages: List of message dicts in OpenAI format.
+
+    Returns:
+        Processed message list with tool messages merged into user messages.
+    """
+    merged: List[Dict[str, Any]] = []
+
+    for msg in messages:
+        msg = copy.deepcopy(msg)
+        role = msg.get("role")
+
+        if role == "tool":
+            # Convert tool message to a user message with tool_result block
+            tool_block = {
+                "type": "tool_result",
+                "tool_use_id": msg.get("tool_call_id", ""),
+                "content": msg.get("content", ""),
+            }
+            # Merge into previous message if it's already a user (merged tool)
+            if (
+                merged
+                and merged[-1].get("role") == "user"
+                and "content_blocks" in merged[-1]
+            ):
+                merged[-1]["content_blocks"].append(tool_block)
+            else:
+                merged.append(
+                    {
+                        "role": "user",
+                        "content_blocks": [tool_block],
+                    }
+                )
+        elif role == "user":
+            text_block = {"type": "text", "text": msg.get("content", "")}
+            if (
+                merged
+                and merged[-1].get("role") == "user"
+                and "content_blocks" in merged[-1]
+                and merged[-1].get("task") is None
+            ):
+                merged[-1]["content_blocks"].append(text_block)
+            else:
+                new_msg = {
+                    "role": "user",
+                    "content": msg.get("content", ""),
+                    "content_blocks": [text_block],
+                }
+                # Preserve extra fields (task, wo_eos, mask, etc.)
+                for key in ("task", "wo_eos", "mask"):
+                    if key in msg:
+                        new_msg[key] = msg[key]
+                merged.append(new_msg)
+        else:
+            merged.append(msg)
+
+    return merged
+
+
+def sort_tool_results_by_call_order(
+    messages: List[Dict[str, Any]]
+) -> List[Dict[str, Any]]:
+    """
+    Sort tool_result blocks within user messages by the order of tool_calls
+    in the preceding assistant message.
+
+    Args:
+        messages: Preprocessed message list (after merge_tool_messages).
+
+    Returns:
+        Message list with sorted tool result blocks.
+    """
+    last_tool_call_order: Dict[str, int] = {}
+
+    for msg in messages:
+        role = msg.get("role")
+        if role == "assistant" and msg.get("tool_calls"):
+            last_tool_call_order = {}
+            for idx, tc in enumerate(msg["tool_calls"]):
+                tc_id = tc.get("id") or tc.get("function", {}).get("id", "")
+                if tc_id:
+                    last_tool_call_order[tc_id] = idx
+
+        elif role == "user" and msg.get("content_blocks"):
+            tool_blocks = [
+                b for b in msg["content_blocks"] if b.get("type") == "tool_result"
+            ]
+            if len(tool_blocks) > 1 and last_tool_call_order:
+                sorted_blocks = sorted(
+                    tool_blocks,
+                    key=lambda b: last_tool_call_order.get(b.get("tool_use_id", ""), 0),
+                )
+                sorted_idx = 0
+                new_blocks = []
+                for block in msg["content_blocks"]:
+                    if block.get("type") == "tool_result":
+                        new_blocks.append(sorted_blocks[sorted_idx])
+                        sorted_idx += 1
+                    else:
+                        new_blocks.append(block)
+                msg["content_blocks"] = new_blocks
+
+    return messages
+
+
+# ============================================================
+# Main Encoding Function
+# ============================================================
+
+
+def encode_messages(
+    messages: List[Dict[str, Any]],
+    thinking_mode: str,
+    context: Optional[List[Dict[str, Any]]] = None,
+    drop_thinking: bool = True,
+    add_default_bos_token: bool = True,
+    reasoning_effort: Optional[str] = None,
+) -> str:
+    """
+    Encode a list of messages into the DeepSeek-V4 prompt format.
+
+    This is the main entry point for encoding conversations. It handles:
+    - BOS token insertion
+    - Thinking mode with optional reasoning content dropping
+    - Tool message merging into user messages
+    - Multi-turn conversation context
+
+    Args:
+        messages: List of message dicts to encode.
+        thinking_mode: Either "chat" or "thinking".
+        context: Optional preceding context messages (already encoded prefix).
+        drop_thinking: If True, drop reasoning_content from earlier assistant turns
+                      (only keep reasoning for messages after the last user message).
+        add_default_bos_token: Whether to prepend BOS token at conversation start.
+        reasoning_effort: Optional reasoning effort level ("max", "high", or None).
+
+    Returns:
+        The encoded prompt string.
+    """
+    context = context if context else []
+
+    # Preprocess: merge tool messages and sort tool results
+    messages = merge_tool_messages(messages)
+    messages = sort_tool_results_by_call_order(context + messages)[len(context) :]
+    if context:
+        context = merge_tool_messages(context)
+        context = sort_tool_results_by_call_order(context)
+
+    full_messages = context + messages
+
+    prompt = bos_token if add_default_bos_token and len(context) == 0 else ""
+
+    # Resolve drop_thinking: if any message has tools defined, don't drop thinking
+    effective_drop_thinking = drop_thinking
+    if any(m.get("tools") for m in full_messages):
+        effective_drop_thinking = False
+
+    if thinking_mode == "thinking" and effective_drop_thinking:
+        full_messages = _drop_thinking_messages(full_messages)
+        # After dropping, recalculate how many messages to render
+        # (context may have shrunk too)
+        num_to_render = len(full_messages) - len(_drop_thinking_messages(context))
+        context_len = len(full_messages) - num_to_render
+    else:
+        num_to_render = len(messages)
+        context_len = len(context)
+
+    for idx in range(num_to_render):
+        prompt += render_message(
+            idx + context_len,
+            full_messages,
+            thinking_mode=thinking_mode,
+            drop_thinking=effective_drop_thinking,
+            reasoning_effort=reasoning_effort,
+        )
+
+    return prompt
+
+
+def _drop_thinking_messages(messages: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
+    """
+    Drop reasoning_content and non-essential messages before the last user message.
+
+    Behavior:
+    - Messages with role in ["user", "system", "tool", "latest_reminder"] are always kept.
+    - Messages at or after the last user index are always kept.
+    - Assistant messages before the last user get reasoning_content removed.
+    - Developer messages before the last user are dropped entirely.
+    """
+    last_user_idx = find_last_user_index(messages)
+    result = []
+    keep_roles = {"user", "system", "tool", "latest_reminder", "direct_search_results"}
+
+    for idx, msg in enumerate(messages):
+        role = msg.get("role")
+        if role in keep_roles or idx >= last_user_idx:
+            result.append(msg)
+        elif role == "assistant":
+            msg = copy.copy(msg)
+            msg.pop("reasoning_content", None)
+            result.append(msg)
+        # developer and other roles before last_user_idx are dropped
+
+    return result
+
+
+# ============================================================
+# Parsing (Decoding model output)
+# ============================================================
+
+
+def _read_until_stop(
+    index: int, text: str, stop: List[str]
+) -> Tuple[int, str, Optional[str]]:
+    """
+    Read text from index until one of the stop strings is found.
+
+    Returns:
+        Tuple of (new_index, content_before_stop, matched_stop_string_or_None).
+    """
+    min_pos = len(text)
+    matched_stop = None
+
+    for s in stop:
+        pos = text.find(s, index)
+        if pos != -1 and pos < min_pos:
+            min_pos = pos
+            matched_stop = s
+
+    if matched_stop:
+        content = text[index:min_pos]
+        return min_pos + len(matched_stop), content, matched_stop
+    else:
+        content = text[index:]
+        return len(text), content, None
+
+
+def parse_tool_calls(
+    index: int, text: str
+) -> Tuple[int, Optional[str], List[Dict[str, str]]]:
+    """
+    Parse DSML tool calls from text starting at the given index.
+
+    Args:
+        index: Starting position in text.
+        text: The full text to parse.
+
+    Returns:
+        Tuple of (new_index, last_stop_token, list_of_tool_call_dicts).
+        Each tool call dict has "name" and "arguments" keys.
+    """
+    tool_calls: List[Dict[str, Any]] = []
+    stop_token = None
+    tool_calls_end_token = f"</{dsml_token}{tool_calls_block_name}>"
+
+    while index < len(text):
+        index, _, stop_token = _read_until_stop(
+            index, text, [f"<{dsml_token}invoke", tool_calls_end_token]
+        )
+        if _ != ">\n":
+            raise ValueError(f"Tool call format error: expected '>\\n' but got '{_}'")
+
+        if stop_token == tool_calls_end_token:
+            break
+
+        if stop_token is None:
+            raise ValueError("Missing special token in tool calls")
+
+        index, tool_name_content, stop_token = _read_until_stop(
+            index, text, [f"<{dsml_token}parameter", f"</{dsml_token}invoke"]
+        )
+
+        p_tool_name = re.findall(
+            r'^\s*name="(.*?)">\n$', tool_name_content, flags=re.DOTALL
+        )
+        if len(p_tool_name) != 1:
+            raise ValueError(f"Tool name format error: '{tool_name_content}'")
+        tool_name = p_tool_name[0]
+
+        tool_args: Dict[str, Tuple[str, str]] = {}
+        while stop_token == f"<{dsml_token}parameter":
+            index, param_content, stop_token = _read_until_stop(
+                index, text, [f"/{dsml_token}parameter"]
+            )
+
+            param_kv = re.findall(
+                r'^ name="(.*?)" string="(true|false)">(.*?)<$',
+                param_content,
+                flags=re.DOTALL,
+            )
+            if len(param_kv) != 1:
+                raise ValueError(f"Parameter format error: '{param_content}'")
+            param_name, string, param_value = param_kv[0]
+
+            if param_name in tool_args:
+                raise ValueError(f"Duplicate parameter name: '{param_name}'")
+            tool_args[param_name] = (param_value, string)
+
+            index, content, stop_token = _read_until_stop(
+                index, text, [f"<{dsml_token}parameter", f"</{dsml_token}invoke"]
+            )
+            if content != ">\n":
+                raise ValueError(
+                    f"Parameter format error: expected '>\\n' but got '{content}'"
+                )
+
+        tool_call = decode_dsml_to_arguments(tool_name=tool_name, tool_args=tool_args)
+        tool_calls.append(tool_call)
+
+    return index, stop_token, tool_calls
+
+
+def parse_message_from_completion_text(text: str, thinking_mode: str) -> Dict[str, Any]:
+    """
+    Parse a model completion text into a structured assistant message.
+
+    This function takes the raw text output from the model (a single assistant turn)
+    and extracts:
+    - reasoning_content (thinking block)
+    - content (summary/response)
+    - tool_calls (if any)
+
+    NOTE: This function is designed to parse only correctly formatted strings and
+    will raise ValueError for malformed output.
+
+    Args:
+        text: The raw completion text (including EOS token).
+        thinking_mode: Either "chat" or "thinking".
+
+    Returns:
+        Dict with keys: "role", "content", "reasoning_content", "tool_calls".
+        tool_calls are in OpenAI format.
+    """
+    summary_content, reasoning_content, tool_calls = "", "", []
+    index, stop_token = 0, None
+    tool_calls_start_token = f"\n\n<{dsml_token}{tool_calls_block_name}"
+
+    is_thinking = thinking_mode == "thinking"
+    is_tool_calling = False
+
+    if is_thinking:
+        index, content_delta, stop_token = _read_until_stop(
+            index, text, [thinking_end_token, tool_calls_start_token]
+        )
+        reasoning_content = content_delta
+        assert (
+            stop_token == thinking_end_token
+        ), "Invalid thinking format: missing </think>"
+
+    index, content_delta, stop_token = _read_until_stop(
+        index, text, [eos_token, tool_calls_start_token]
+    )
+    summary_content = content_delta
+    if stop_token == tool_calls_start_token:
+        is_tool_calling = True
+    else:
+        assert stop_token == eos_token, "Invalid format: missing EOS token"
+
+    if is_tool_calling:
+        index, stop_token, tool_calls = parse_tool_calls(index, text)
+
+        index, tool_ends_text, stop_token = _read_until_stop(index, text, [eos_token])
+        assert not tool_ends_text, "Unexpected content after tool calls"
+
+    assert len(text) == index and stop_token in [
+        eos_token,
+        None,
+    ], "Unexpected content at end"
+
+    for sp_token in [
+        bos_token,
+        eos_token,
+        thinking_start_token,
+        thinking_end_token,
+        dsml_token,
+    ]:
+        assert (
+            sp_token not in summary_content and sp_token not in reasoning_content
+        ), f"Unexpected special token '{sp_token}' in content"
+
+    return {
+        "role": "assistant",
+        "content": summary_content,
+        "reasoning_content": reasoning_content,
+        "tool_calls": tool_calls_to_openai_format(tool_calls),
+    }
diff --git a/src/srtctl/benchmarks/scripts/sa-bench/tokenizers/sglang_deepseek_v4.py b/src/srtctl/benchmarks/scripts/sa-bench/tokenizers/sglang_deepseek_v4.py
new file mode 100644
index 00000000..595e7b2f
--- /dev/null
+++ b/src/srtctl/benchmarks/scripts/sa-bench/tokenizers/sglang_deepseek_v4.py
@@ -0,0 +1,125 @@
+
+# SPDX-License-Identifier: Apache-2.0
+"""
+SGLang-side DeepSeek-V4 tokenizer for sa-bench.
+
+Mirrors what sglang's ``serving_chat._apply_jinja_template`` does
+when ``chat_encoding_spec == "dsv4"`` (see
+sgl-project/sglang PR #23600), so that the tokens counted on the
+sa-bench client side match the tokens the sglang server actually
+feeds into the model.
+
+The vllm counterpart lives in ``vllm.tokenizers.deepseek_v4``; sglang
+has no equivalent client-side package, so we vendor the rendering
+logic from ``encoding_dsv4.py`` in ``_sglang_encoding_dsv4.py``.
+"""
+from __future__ import annotations
+
+from typing import Any, Dict, List, Optional
+
+from transformers import AutoTokenizer
+
+from ._sglang_encoding_dsv4 import encode_messages as _encode_messages
+
+
+class SGLangDeepseekV4Tokenizer:
+    """Client-side DeepSeek-V4 tokenizer matching sglang server behavior.
+
+    The server-side call chain (sglang PR #23600) is:
+
+        messages = request.messages                        # OpenAI-style
+        if messages[0]["role"] != "system":
+            messages.insert(0, {"role": "system", "content": ""})
+        real_input = encoding_dsv4.encode_messages(
+            messages,
+            thinking_mode="chat",                          # default
+            reasoning_effort=None,                         # "medium" dropped
+        )
+        prompt_ids = tokenizer.encode(real_input)
+
+    We reproduce the exact same steps here.
+    """
+
+    def __init__(self, hf_tokenizer):
+        self._hf = hf_tokenizer
+
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path: str, **kwargs):
+        kwargs.setdefault("trust_remote_code", True)
+        hf = AutoTokenizer.from_pretrained(
+            pretrained_model_name_or_path, **kwargs
+        )
+        return cls(hf)
+
+    def _render_prompt(
+        self,
+        messages: List[Dict[str, Any]],
+        thinking_mode: str = "chat",
+        reasoning_effort: Optional[str] = None,
+    ) -> str:
+        msgs = [dict(m) for m in messages]
+        if not msgs or msgs[0].get("role") != "system":
+            msgs.insert(0, {"role": "system", "content": ""})
+
+        if reasoning_effort not in ("max", "high"):
+            reasoning_effort = None
+
+        return _encode_messages(
+            msgs,
+            thinking_mode=thinking_mode,
+            reasoning_effort=reasoning_effort,
+        )
+
+    def apply_chat_template(
+        self,
+        messages: List[Dict[str, Any]],
+        tokenize: bool = True,
+        add_generation_prompt: bool = True,  # noqa: ARG002  (encoder always adds the <｜Assistant｜>... tail)
+        tools: Optional[List[Dict[str, Any]]] = None,
+        thinking: bool = False,
+        reasoning_effort: Optional[str] = None,
+        **_: Any,
+    ):
+        msgs = [dict(m) for m in messages]
+        if tools:
+            if not msgs or msgs[0].get("role") != "system":
+                msgs.insert(0, {"role": "system", "content": ""})
+            msgs[0]["tools"] = list(tools)
+
+        thinking_mode = "thinking" if thinking else "chat"
+        prompt = self._render_prompt(
+            msgs,
+            thinking_mode=thinking_mode,
+            reasoning_effort=reasoning_effort,
+        )
+        if not tokenize:
+            return prompt
+        return self._hf.encode(prompt, add_special_tokens=False)
+
+    def encode(self, text, **kwargs):
+        return self._hf.encode(text, **kwargs)
+
+    def decode(self, token_ids, **kwargs):
+        return self._hf.decode(token_ids, **kwargs)
+
+    def __len__(self):
+        return len(self._hf)
+
+    @property
+    def vocab_size(self):
+        return self._hf.vocab_size
+
+    @property
+    def eos_token_id(self):
+        return self._hf.eos_token_id
+
+    @property
+    def bos_token_id(self):
+        return self._hf.bos_token_id
+
+    @property
+    def pad_token_id(self):
+        return self._hf.pad_token_id
+
+    def __getattr__(self, name):
+        return getattr(self._hf, name)
diff --git a/src/srtctl/cli/do_sweep.py b/src/srtctl/cli/do_sweep.py
index ff6eaa91..77b79ac5 100644
--- a/src/srtctl/cli/do_sweep.py
+++ b/src/srtctl/cli/do_sweep.py
@@ -18,6 +18,7 @@
 import os
 import sys
 import threading
+import time
 from dataclasses import dataclass
 from pathlib import Path
 
@@ -179,6 +180,118 @@ def _print_connection_info(self) -> None:
         logger.info("=" * 60)
         logger.info("")
 
+    def _run_post_eval(self, stop_event: threading.Event) -> int:
+        """Run lm-eval after the main benchmark completes (or directly in eval-only mode)."""
+        from srtctl.benchmarks import get_runner
+        from srtctl.core.health import wait_for_model
+
+        # In eval-only mode the benchmark health check was skipped, so do the
+        # full model-ready wait here.  In post-benchmark mode a quick port
+        # check is sufficient since the server already served traffic.
+        if os.environ.get("EVAL_ONLY", "false").lower() == "true":
+            r = self.config.resources
+            n_prefill = 0 if r.num_agg > 0 else r.num_prefill
+            n_decode = r.num_agg if r.num_agg > 0 else r.num_decode
+            hc = self.config.health_check
+            logger.info("EVAL_ONLY: Waiting for server health before eval...")
+            if not wait_for_model(
+                host=self.runtime.nodes.head,
+                port=8000,
+                n_prefill=n_prefill,
+                n_decode=n_decode,
+                poll_interval=float(hc.interval_seconds),
+                timeout=float(hc.max_attempts * hc.interval_seconds),
+                report_every=60.0,
+                frontend_type=self.config.frontend.type,
+                stop_event=stop_event,
+            ):
+                logger.error("Server did not become healthy for eval")
+                return 1
+        else:
+            if not wait_for_port(self.runtime.nodes.head, 8000, timeout=30):
+                logger.error("Server health check failed before eval - skipping")
+                return 1
+
+        try:
+            runner = get_runner("lm-eval")
+        except ValueError as e:
+            logger.error("lm-eval runner not available: %s", e)
+            return 1
+
+        eval_log = self.runtime.log_dir / "eval.out"
+        cmd = runner.build_command(self.config, self.runtime)
+
+        logger.info("Eval command: %s", " ".join(cmd))
+        logger.info("Eval log: %s", eval_log)
+
+        # Pass through eval-related env vars. InferenceX writes multi-node
+        # metadata from these variables in append_lm_eval_summary().
+        env_to_set = {}
+        for var in [
+            "RUN_EVAL",
+            "EVAL_ONLY",
+            "IS_MULTINODE",
+            "FRAMEWORK",
+            "PRECISION",
+            "MODEL_PREFIX",
+            "RUNNER_TYPE",
+            "RESULT_FILENAME",
+            "SPEC_DECODING",
+            "ISL",
+            "OSL",
+            "MODEL",
+            "MODEL_PATH",
+            "MAX_MODEL_LEN",
+            "EVAL_MAX_MODEL_LEN",
+            "PREFILL_TP",
+            "PREFILL_EP",
+            "PREFILL_DP_ATTN",
+            "PREFILL_NUM_WORKERS",
+            "DECODE_TP",
+            "DECODE_EP",
+            "DECODE_DP_ATTN",
+            "DECODE_NUM_WORKERS",
+        ]:
+            val = os.environ.get(var)
+            if val:
+                env_to_set[var] = val
+
+        # Set MODEL_NAME to the served model name so lm-eval uses the correct
+        # name for API requests. Without this, benchmark_lib.sh falls back to
+        # $MODEL (the HuggingFace ID) which the server doesn't recognize.
+        env_to_set["MODEL_NAME"] = self.config.served_model_name
+        logger.info("Eval MODEL_NAME: %s", env_to_set["MODEL_NAME"])
+
+        # Use EVAL_CONC from workflow (median chosen by InferenceX mark_eval_entries),
+        # falling back to max of benchmark concurrency list.
+        eval_conc = os.environ.get("EVAL_CONC")
+        if eval_conc:
+            env_to_set["EVAL_CONC"] = eval_conc
+            logger.info("Eval concurrency (from workflow): %s", eval_conc)
+        else:
+            conc_list = self.config.benchmark.get_concurrency_list()
+            if conc_list:
+                env_to_set["EVAL_CONC"] = str(max(conc_list))
+                logger.info("Eval concurrency (max of %s): %s", conc_list, env_to_set["EVAL_CONC"])
+
+        proc = start_srun_process(
+            command=cmd,
+            nodelist=[self.runtime.nodes.head],
+            output=str(eval_log),
+            container_image=str(self.runtime.container_image),
+            container_mounts=self.runtime.container_mounts,
+            env_to_set=env_to_set,
+        )
+
+        while proc.poll() is None:
+            if stop_event.is_set():
+                logger.info("Stop requested, terminating eval")
+                proc.terminate()
+                return 1
+            time.sleep(1)
+
+        return proc.returncode or 0
+
     def run(self) -> int:
         """Run the complete sweep."""
         # Create status reporter (fire-and-forget, no-op if not configured)
@@ -221,8 +334,27 @@ def run(self) -> int:
 
             self._print_connection_info()
 
-            # Stage 4: Benchmark (status reported AFTER health check passes)
-            exit_code = self.run_benchmark(registry, stop_event, reporter)
+            if os.environ.get("EVAL_ONLY", "false").lower() == "true":
+                reporter.report(JobStatus.BENCHMARK, JobStage.BENCHMARK, "Running eval-only evaluation")
+                logger.info("EVAL_ONLY=true: Skipping benchmark stage and running lm-eval evaluation...")
+                exit_code = self._run_post_eval(stop_event)
+                if exit_code != 0:
+                    logger.error("Eval-only evaluation failed with exit code %d", exit_code)
+                else:
+                    logger.info("Eval-only evaluation completed successfully")
+            else:
+                # Stage 4: Benchmark (status reported AFTER health check passes)
+                exit_code = self.run_benchmark(registry, stop_event, reporter)
+
+                # Stage 5: Post-benchmark eval (optional, non-fatal)
+                if os.environ.get("RUN_EVAL", "false").lower() == "true" and exit_code == 0:
+                    reporter.report(JobStatus.BENCHMARK, JobStage.BENCHMARK, "Running post-benchmark evaluation")
+                    logger.info("RUN_EVAL=true: Running post-benchmark lm-eval evaluation...")
+                    eval_exit = self._run_post_eval(stop_event)
+                    if eval_exit != 0:
+                        logger.warning("Eval failed with exit code %d (benchmark result is still valid)", eval_exit)
+                    else:
+                        logger.info("Post-benchmark eval completed successfully")
 
         except Exception as e:
             logger.exception("Error during sweep: %s", e)
diff --git a/src/srtctl/core/config.py b/src/srtctl/core/config.py
index 8cea4e17..f30fc7fc 100644
--- a/src/srtctl/core/config.py
+++ b/src/srtctl/core/config.py
@@ -141,6 +141,20 @@ def resolve_config_with_defaults(user_config: dict[str, Any], cluster_config: di
         config["reporting"] = cluster_config["reporting"]
         logger.debug("Applied cluster reporting config")
 
+    # Resolve extra_mount host path aliases through model_paths
+    extra_mounts = config.get("extra_mount", [])
+    if model_paths and extra_mounts:
+        resolved_mounts = []
+        for mount_spec in extra_mounts:
+            host_path, container_path = mount_spec.split(":", 1)
+            if host_path in model_paths:
+                resolved_host = model_paths[host_path]
+                resolved_mounts.append(f"{resolved_host}:{container_path}")
+                logger.debug(f"Resolved extra_mount alias '{host_path}' -> '{resolved_host}'")
+            else:
+                resolved_mounts.append(mount_spec)
+        config["extra_mount"] = resolved_mounts
+
     # Resolve frontend nginx_container alias
     frontend = config.get("frontend", {})
     nginx_container = frontend.get("nginx_container", "")
diff --git a/src/srtctl/core/runtime.py b/src/srtctl/core/runtime.py
index 3e68bdd5..31195ed3 100644
--- a/src/srtctl/core/runtime.py
+++ b/src/srtctl/core/runtime.py
@@ -231,6 +231,14 @@ def from_config(
                 host_path, container_path = mount_spec.split(":", 1)
                 container_mounts[Path(host_path).resolve()] = Path(container_path)
 
+        # Mount InferenceX workspace if available (for lm-eval support).
+        # Skip exists() check: the orchestrator runs on the SLURM head node
+        # where the GH Actions workspace path may not be directly accessible,
+        # but it IS accessible from compute nodes via shared filesystem.
+        infmax_ws = os.environ.get("INFMAX_WORKSPACE")
+        if infmax_ws:
+            container_mounts[Path(infmax_ws)] = Path("/infmax-workspace")
+
         # Add FormattablePath mounts from config.container_mounts
         # These need to be expanded with the runtime context, so we create a
         # temporary context first and then update
diff --git a/src/srtctl/core/schema.py b/src/srtctl/core/schema.py
index 97547fec..c535be39 100644
--- a/src/srtctl/core/schema.py
+++ b/src/srtctl/core/schema.py
@@ -539,6 +539,12 @@ class BenchmarkConfig:
     ttft_threshold_ms: int | None = None  # Goodput TTFT threshold in ms (default: 2000)
     itl_threshold_ms: int | None = None  # Goodput ITL threshold in ms (default: 25)
     random_range_ratio: float | None = None  # Random input/output length range ratio (default: 0.8)
+    num_prompts_mult: int | None = None  # Multiplier for num_prompts = concurrency * mult (default: 10)
+    num_warmup_mult: int | None = None  # Multiplier for warmup prompts = concurrency * mult (default: 2)
+    # Trace replay benchmark fields (uses aiperf with mooncake_trace dataset type)
+    trace_file: str | None = None  # Path to trace JSONL file (container path, e.g., /traces/dataset.jsonl)
+    custom_tokenizer: str | None = None  # Custom tokenizer class (e.g., "module.path.ClassName")
+    use_chat_template: bool = True  # Pass --use-chat-template to benchmark (default: true)
 
     def get_concurrency_list(self) -> list[int]:
         if self.concurrencies is None:
@@ -711,7 +717,7 @@ def get_install_commands(self) -> str:
         if self.version is not None:
             return (
                 f"echo 'Installing dynamo {self.version}...' && "
-                f"pip install --break-system-packages --quiet ai-dynamo-runtime=={self.version} ai-dynamo=={self.version} && "
+                f"pip install --break-system-packages --quiet --extra-index-url https://pypi.nvidia.com ai-dynamo-runtime=={self.version} ai-dynamo=={self.version} && "
                 f"echo 'Dynamo {self.version} installed'"
             )
 
@@ -719,8 +725,8 @@ def get_install_commands(self) -> str:
         git_ref = self.hash if self.hash else "HEAD"
         checkout_cmd = f"git checkout {self.hash}" if self.hash else ""
 
-        return (
-            f"echo 'Installing dynamo from source ({git_ref})...' && "
+        # Original SGLang container path, UNCHANGED
+        sglang = (
             "apt-get update -qq && apt-get install -y -qq libclang-dev > /dev/null 2>&1 && "
             "cd /sgl-workspace/ && "
             "git clone https://github.com/ai-dynamo/dynamo.git && "
@@ -736,6 +742,34 @@ def get_install_commands(self) -> str:
             f"echo 'Dynamo installed from source ({git_ref})'"
         )
 
+        # Portable path for non-SGLang containers (vLLM, etc.)
+        portable = (
+            "if ! command -v cargo &> /dev/null || ! command -v maturin &> /dev/null; then "
+            "apt-get update -qq && apt-get install -y -qq git curl libclang-dev protobuf-compiler > /dev/null 2>&1 && "
+            "if ! command -v cargo &> /dev/null; then "
+            "curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y && source $HOME/.cargo/env; fi && "
+            "if ! command -v maturin &> /dev/null; then "
+            "pip install --break-system-packages maturin; fi; fi && "
+            "ORIG_DIR=$(pwd) && rm -rf /tmp/dynamo_build && mkdir -p /tmp/dynamo_build && cd /tmp/dynamo_build && "
+            "git clone https://github.com/ai-dynamo/dynamo.git && "
+            "cd dynamo && "
+            f"{checkout_cmd + ' && ' if checkout_cmd else ''}"
+            "cd lib/bindings/python/ && "
+            'export RUSTFLAGS="${RUSTFLAGS:-} -C target-cpu=native --cfg tokio_unstable" && '
+            "rm -f /tmp/ai_dynamo_runtime*.whl && "
+            "maturin build -o /tmp && "
+            "pip install --break-system-packages /tmp/ai_dynamo_runtime*.whl --force-reinstall && "
+            "cd /tmp/dynamo_build/dynamo/ && "
+            "pip install --break-system-packages -e . && "
+            "cd $ORIG_DIR && "
+            f"echo 'Dynamo installed from source ({git_ref})'"
+        )
+
+        return (
+            f"echo 'Installing dynamo from source ({git_ref})...' && "
+            f"if [ -d /sgl-workspace ]; then {sglang}; else {portable}; fi"
+        )
+
     Schema: ClassVar[type[Schema]] = Schema
 
 
diff --git a/tests/test_benchmarks.py b/tests/test_benchmarks.py
index 261020c7..c15759b2 100644
--- a/tests/test_benchmarks.py
+++ b/tests/test_benchmarks.py
@@ -193,6 +193,62 @@ def test_build_command_includes_tokenizer_path(self):
         assert cmd[7] == "/model"  # tokenizer path
 
 
+class TestLMEvalRunner:
+    """Test LM-Eval runner."""
+
+    def test_registry_includes_lm_eval(self):
+        """lm-eval is in the benchmark registry."""
+        assert "lm-eval" in list_benchmarks()
+
+    def test_get_runner(self):
+        """Can get lm-eval runner."""
+        runner = get_runner("lm-eval")
+        assert runner.name == "lm-eval"
+
+    def test_script_path(self):
+        """Script path points to lm-eval bench.sh."""
+        runner = get_runner("lm-eval")
+        assert "lm-eval/bench.sh" in runner.script_path
+
+    def test_local_script_dir(self):
+        """Local script dir points to lm-eval scripts."""
+        runner = get_runner("lm-eval")
+        assert runner.local_script_dir.endswith("lm-eval")
+
+    def test_validate_config_always_valid(self):
+        """lm-eval accepts any config."""
+        from srtctl.benchmarks.lm_eval import LMEvalRunner
+        from srtctl.core.schema import BenchmarkConfig, ModelConfig, ResourceConfig, SrtConfig
+
+        runner = LMEvalRunner()
+        config = SrtConfig(
+            name="test",
+            model=ModelConfig(path="/model", container="/image", precision="fp4"),
+            resources=ResourceConfig(gpu_type="h100"),
+            benchmark=BenchmarkConfig(type="sa-bench"),
+        )
+        assert runner.validate_config(config) == []
+
+    def test_build_command(self):
+        """build_command returns correct bash command."""
+        from unittest.mock import MagicMock
+
+        from srtctl.benchmarks.lm_eval import LMEvalRunner
+
+        runner = LMEvalRunner()
+        runtime = MagicMock()
+        runtime.frontend_port = 8000
+
+        config = MagicMock()
+        cmd = runner.build_command(config, runtime)
+        assert cmd == [
+            "bash",
+            "/srtctl-benchmarks/lm-eval/bench.sh",
+            "http://localhost:8000",
+            "/infmax-workspace",
+        ]
+
+
 class TestScriptsExist:
     """Test that benchmark scripts exist."""
 
@@ -209,3 +265,365 @@ def test_mmlu_script_exists(self):
         """MMLU script exists."""
         script = SCRIPTS_DIR / "mmlu" / "bench.sh"
         assert script.exists()
+
+
+class TestRunPostEval:
+    """Test SweepOrchestrator._run_post_eval method."""
+
+    @staticmethod
+    def _make_orchestrator():
+        """Create a SweepOrchestrator with mocked config/runtime."""
+        from pathlib import Path
+
+        from srtctl.cli.do_sweep import SweepOrchestrator
+        from srtctl.core.runtime import Nodes, RuntimeContext
+        from srtctl.core.schema import (
+            BenchmarkConfig,
+            FrontendConfig,
+            HealthCheckConfig,
+            ModelConfig,
+            ResourceConfig,
+            SrtConfig,
+        )
+
+        config = SrtConfig(
+            name="test",
+            model=ModelConfig(path="/model/test-model", container="/image", precision="fp4"),
+            resources=ResourceConfig(
+                gpu_type="h100",
+                gpus_per_node=8,
+                prefill_nodes=1,
+                decode_nodes=2,
+                prefill_workers=1,
+                decode_workers=2,
+            ),
+            benchmark=BenchmarkConfig(type="sa-bench", isl=1024, osl=1024, concurrencies="128x256x512"),
+            health_check=HealthCheckConfig(max_attempts=3, interval_seconds=1),
+            frontend=FrontendConfig(type="dynamo"),
+        )
+        runtime = RuntimeContext(
+            job_id="12345",
+            run_name="test-run",
+            nodes=Nodes(head="node0", bench="node0", infra="node0", worker=("node0", "node1", "node2")),
+            head_node_ip="10.0.0.1",
+            infra_node_ip="10.0.0.1",
+            log_dir=Path("/tmp/logs"),
+            model_path=Path("/model/test-model"),
+            container_image=Path("/path/to/container.sqsh"),
+            gpus_per_node=8,
+            network_interface=None,
+            container_mounts={},
+            environment={},
+        )
+        return SweepOrchestrator(config=config, runtime=runtime)
+
+    def test_post_benchmark_port_check_fails(self):
+        """Returns 1 when port check fails in post-benchmark mode."""
+        import os
+        import threading
+        from unittest.mock import patch
+
+        orch = self._make_orchestrator()
+        stop = threading.Event()
+        with patch.dict(os.environ, {"EVAL_ONLY": "false"}, clear=False):
+            with patch("srtctl.cli.do_sweep.wait_for_port", return_value=False):
+                result = orch._run_post_eval(stop)
+        assert result == 1
+
+    def test_eval_only_health_check_fails(self):
+        """Returns 1 when health check fails in eval-only mode."""
+        import os
+        import threading
+        from unittest.mock import patch
+
+        orch = self._make_orchestrator()
+        stop = threading.Event()
+        with patch.dict(os.environ, {"EVAL_ONLY": "true"}, clear=False):
+            with patch("srtctl.core.health.wait_for_model", return_value=False):
+                result = orch._run_post_eval(stop)
+        assert result == 1
+
+    def test_runner_not_available(self):
+        """Returns 1 when lm-eval runner is not registered."""
+        import os
+        import threading
+        from unittest.mock import patch
+
+        orch = self._make_orchestrator()
+        stop = threading.Event()
+        with patch.dict(os.environ, {"EVAL_ONLY": "false"}, clear=False):
+            with patch("srtctl.cli.do_sweep.wait_for_port", return_value=True):
+                with patch("srtctl.benchmarks.get_runner", side_effect=ValueError("not found")):
+                    result = orch._run_post_eval(stop)
+        assert result == 1
+
+    def test_successful_eval(self):
+        """Returns 0 when eval completes successfully."""
+        import os
+        import threading
+        from unittest.mock import MagicMock, patch
+
+        orch = self._make_orchestrator()
+        stop = threading.Event()
+
+        mock_proc = MagicMock()
+        mock_proc.poll.side_effect = [None, 0]
+        mock_proc.returncode = 0
+
+        with patch.dict(os.environ, {"EVAL_ONLY": "false"}, clear=False):
+            with patch("srtctl.cli.do_sweep.wait_for_port", return_value=True):
+                with patch("srtctl.cli.do_sweep.start_srun_process", return_value=mock_proc):
+                    result = orch._run_post_eval(stop)
+        assert result == 0
+
+    def test_eval_only_successful(self):
+        """Returns 0 in eval-only mode when health check and eval succeed."""
+        import os
+        import threading
+        from unittest.mock import MagicMock, patch
+
+        orch = self._make_orchestrator()
+        stop = threading.Event()
+
+        mock_proc = MagicMock()
+        mock_proc.poll.side_effect = [None, 0]
+        mock_proc.returncode = 0
+
+        with patch.dict(os.environ, {"EVAL_ONLY": "true"}, clear=False):
+            with patch("srtctl.core.health.wait_for_model", return_value=True):
+                with patch("srtctl.cli.do_sweep.start_srun_process", return_value=mock_proc):
+                    result = orch._run_post_eval(stop)
+        assert result == 0
+
+    def test_env_var_passthrough(self):
+        """Eval env vars are passed through to srun."""
+        import os
+        import threading
+        from unittest.mock import MagicMock, patch
+
+        orch = self._make_orchestrator()
+        stop = threading.Event()
+
+        mock_proc = MagicMock()
+        mock_proc.poll.return_value = 0
+        mock_proc.returncode = 0
+
+        env_vars = {
+            "EVAL_ONLY": "false",
+            "RUN_EVAL": "true",
+            "FRAMEWORK": "sglang",
+            "PRECISION": "fp4",
+            "MODEL": "test-model",
+        }
+
+        captured_kwargs = {}
+
+        def capture_srun(**kwargs):
+            captured_kwargs.update(kwargs)
+            return mock_proc
+
+        with patch.dict(os.environ, env_vars, clear=False):
+            with patch("srtctl.cli.do_sweep.wait_for_port", return_value=True):
+                with patch("srtctl.cli.do_sweep.start_srun_process", side_effect=capture_srun):
+                    orch._run_post_eval(stop)
+
+        env_to_set = captured_kwargs["env_to_set"]
+        assert env_to_set["RUN_EVAL"] == "true"
+        assert env_to_set["FRAMEWORK"] == "sglang"
+        assert env_to_set["PRECISION"] == "fp4"
+        assert env_to_set["MODEL"] == "test-model"
+        assert env_to_set["MODEL_NAME"] == "test-model"
+
+    def test_eval_conc_from_env(self):
+        """EVAL_CONC from env takes priority over benchmark concurrencies."""
+        import os
+        import threading
+        from unittest.mock import MagicMock, patch
+
+        orch = self._make_orchestrator()
+        stop = threading.Event()
+
+        mock_proc = MagicMock()
+        mock_proc.poll.return_value = 0
+        mock_proc.returncode = 0
+
+        captured_kwargs = {}
+
+        def capture_srun(**kwargs):
+            captured_kwargs.update(kwargs)
+            return mock_proc
+
+        with patch.dict(os.environ, {"EVAL_ONLY": "false", "EVAL_CONC": "64"}, clear=False):
+            with patch("srtctl.cli.do_sweep.wait_for_port", return_value=True):
+                with patch("srtctl.cli.do_sweep.start_srun_process", side_effect=capture_srun):
+                    orch._run_post_eval(stop)
+
+        assert captured_kwargs["env_to_set"]["EVAL_CONC"] == "64"
+
+    def test_eval_conc_fallback_to_max_concurrency(self):
+        """EVAL_CONC falls back to max of benchmark concurrencies."""
+        import os
+        import threading
+        from unittest.mock import MagicMock, patch
+
+        orch = self._make_orchestrator()
+        stop = threading.Event()
+
+        mock_proc = MagicMock()
+        mock_proc.poll.return_value = 0
+        mock_proc.returncode = 0
+
+        captured_kwargs = {}
+
+        def capture_srun(**kwargs):
+            captured_kwargs.update(kwargs)
+            return mock_proc
+
+        env = {"EVAL_ONLY": "false"}
+        # Remove EVAL_CONC if present
+        with patch.dict(os.environ, env, clear=False):
+            os.environ.pop("EVAL_CONC", None)
+            with patch("srtctl.cli.do_sweep.wait_for_port", return_value=True):
+                with patch("srtctl.cli.do_sweep.start_srun_process", side_effect=capture_srun):
+                    orch._run_post_eval(stop)
+
+        # concurrencies="128x256x512", max is 512
+        assert captured_kwargs["env_to_set"]["EVAL_CONC"] == "512"
+
+    def test_stop_event_terminates_eval(self):
+        """Stop event terminates the eval process."""
+        import os
+        import threading
+        from unittest.mock import MagicMock, patch
+
+        orch = self._make_orchestrator()
+        stop = threading.Event()
+        stop.set()
+
+        mock_proc = MagicMock()
+        mock_proc.poll.return_value = None
+
+        with patch.dict(os.environ, {"EVAL_ONLY": "false"}, clear=False):
+            with patch("srtctl.cli.do_sweep.wait_for_port", return_value=True):
+                with patch("srtctl.cli.do_sweep.start_srun_process", return_value=mock_proc):
+                    result = orch._run_post_eval(stop)
+
+        assert result == 1
+        mock_proc.terminate.assert_called_once()
+
+
+class TestSweepRunEvalIntegration:
+    """Test eval-related branches in SweepOrchestrator.run()."""
+
+    @staticmethod
+    def _make_orchestrator():
+        return TestRunPostEval._make_orchestrator()
+
+    def test_run_eval_only_mode(self):
+        """EVAL_ONLY=true skips benchmark and runs _run_post_eval."""
+        import os
+        from unittest.mock import MagicMock, patch
+
+        orch = self._make_orchestrator()
+
+        with patch.dict(os.environ, {"EVAL_ONLY": "true"}, clear=False):
+            with patch.object(orch, "start_head_infrastructure") as mock_head:
+                mock_head.return_value = MagicMock()
+                with patch.object(orch, "start_all_workers", return_value={}):
+                    with patch.object(orch, "start_frontend", return_value=[]):
+                        with patch.object(orch, "_run_post_eval", return_value=0) as mock_eval:
+                            with patch.object(orch, "run_benchmark") as mock_bench:
+                                with patch.object(orch, "run_postprocess"):
+                                    with patch("srtctl.cli.do_sweep.StatusReporter") as mock_reporter_cls:
+                                        mock_reporter_cls.from_config.return_value = MagicMock()
+                                        exit_code = orch.run()
+
+        mock_eval.assert_called_once()
+        mock_bench.assert_not_called()
+        assert exit_code == 0
+
+    def test_run_with_post_benchmark_eval(self):
+        """RUN_EVAL=true runs benchmark then _run_post_eval."""
+        import os
+        from unittest.mock import MagicMock, patch
+
+        orch = self._make_orchestrator()
+
+        with patch.dict(os.environ, {"EVAL_ONLY": "false", "RUN_EVAL": "true"}, clear=False):
+            with patch.object(orch, "start_head_infrastructure") as mock_head:
+                mock_head.return_value = MagicMock()
+                with patch.object(orch, "start_all_workers", return_value={}):
+                    with patch.object(orch, "start_frontend", return_value=[]):
+                        with patch.object(orch, "run_benchmark", return_value=0) as mock_bench:
+                            with patch.object(orch, "_run_post_eval", return_value=0) as mock_eval:
+                                with patch.object(orch, "run_postprocess"):
+                                    with patch("srtctl.cli.do_sweep.StatusReporter") as mock_reporter_cls:
+                                        mock_reporter_cls.from_config.return_value = MagicMock()
+                                        exit_code = orch.run()
+
+        mock_bench.assert_called_once()
+        mock_eval.assert_called_once()
+        assert exit_code == 0
+
+    def test_run_eval_only_failure(self):
+        """EVAL_ONLY=true with eval failure returns non-zero exit code."""
+        import os
+        from unittest.mock import MagicMock, patch
+
+        orch = self._make_orchestrator()
+
+        with patch.dict(os.environ, {"EVAL_ONLY": "true"}, clear=False):
+            with patch.object(orch, "start_head_infrastructure") as mock_head:
+                mock_head.return_value = MagicMock()
+                with patch.object(orch, "start_all_workers", return_value={}):
+                    with patch.object(orch, "start_frontend", return_value=[]):
+                        with patch.object(orch, "_run_post_eval", return_value=1):
+                            with patch.object(orch, "run_postprocess"):
+                                with patch("srtctl.cli.do_sweep.StatusReporter") as mock_reporter_cls:
+                                    mock_reporter_cls.from_config.return_value = MagicMock()
+                                    exit_code = orch.run()
+
+        assert exit_code == 1
+
+    def test_run_post_benchmark_eval_failure_nonfatal(self):
+        """RUN_EVAL=true with eval failure still returns benchmark exit code 0."""
+        import os
+        from unittest.mock import MagicMock, patch
+
+        orch = self._make_orchestrator()
+
+        with patch.dict(os.environ, {"EVAL_ONLY": "false", "RUN_EVAL": "true"}, clear=False):
+            with patch.object(orch, "start_head_infrastructure") as mock_head:
+                mock_head.return_value = MagicMock()
+                with patch.object(orch, "start_all_workers", return_value={}):
+                    with patch.object(orch, "start_frontend", return_value=[]):
+                        with patch.object(orch, "run_benchmark", return_value=0):
+                            with patch.object(orch, "_run_post_eval", return_value=1):
+                                with patch.object(orch, "run_postprocess"):
+                                    with patch("srtctl.cli.do_sweep.StatusReporter") as mock_reporter_cls:
+                                        mock_reporter_cls.from_config.return_value = MagicMock()
+                                        exit_code = orch.run()
+
+        assert exit_code == 0
+
+    def test_run_eval_skipped_when_benchmark_fails(self):
+        """RUN_EVAL=true but benchmark fails: eval is skipped."""
+        import os
+        from unittest.mock import MagicMock, patch
+
+        orch = self._make_orchestrator()
+
+        with patch.dict(os.environ, {"EVAL_ONLY": "false", "RUN_EVAL": "true"}, clear=False):
+            with patch.object(orch, "start_head_infrastructure") as mock_head:
+                mock_head.return_value = MagicMock()
+                with patch.object(orch, "start_all_workers", return_value={}):
+                    with patch.object(orch, "start_frontend", return_value=[]):
+                        with patch.object(orch, "run_benchmark", return_value=1):
+                            with patch.object(orch, "_run_post_eval") as mock_eval:
+                                with patch.object(orch, "run_postprocess"):
+                                    with patch("srtctl.cli.do_sweep.StatusReporter") as mock_reporter_cls:
+                                        mock_reporter_cls.from_config.return_value = MagicMock()
+                                        exit_code = orch.run()
+
+        mock_eval.assert_not_called()
+        assert exit_code == 1
diff --git a/tests/test_configs.py b/tests/test_configs.py
index 1c23fb30..0b4138d5 100644
--- a/tests/test_configs.py
+++ b/tests/test_configs.py
@@ -127,7 +127,11 @@ def test_hash_install_command(self):
         assert "git clone" in cmd
         assert "git checkout abc123" in cmd
         assert "maturin build" in cmd
-        assert "pip install -e" in cmd
+        assert "if [ -d /sgl-workspace ]" in cmd
+        assert "/tmp/dynamo_build" in cmd
+        assert "protobuf-compiler" in cmd
+        assert "if ! command -v cargo" in cmd
+        assert "if ! command -v maturin" in cmd
 
     def test_top_of_tree_install_command(self):
         """Top-of-tree config generates source install without checkout."""
@@ -140,6 +144,10 @@ def test_top_of_tree_install_command(self):
         assert "git clone" in cmd
         assert "git checkout" not in cmd
         assert "maturin build" in cmd
+        assert "if [ -d /sgl-workspace ]" in cmd
+        assert "/tmp/dynamo_build" in cmd
+        assert "--break-system-packages" in cmd
+        assert "--force-reinstall" in cmd
 
     def test_hash_and_top_of_tree_not_allowed(self):
         """Cannot specify both hash and top_of_tree."""
@@ -1072,6 +1080,8 @@ def test_standard_tp_mode_still_works(self):
 
     def test_vllm_get_process_environment(self):
         """Test vLLM sets port environment variables from process."""
+        from unittest.mock import patch
+
         from srtctl.backends import VLLMProtocol
         from srtctl.core.topology import Process
 
@@ -1090,10 +1100,12 @@ def test_vllm_get_process_environment(self):
             nixl_port=6550,
         )
 
-        env = backend.get_process_environment(process)
+        with patch("srtctl.core.slurm.get_hostname_ip", return_value="10.0.0.1"):
+            env = backend.get_process_environment(process)
 
         assert env["DYN_VLLM_KV_EVENT_PORT"] == "5550"
         assert env["VLLM_NIXL_SIDE_CHANNEL_PORT"] == "6550"
+        assert env["VLLM_NIXL_SIDE_CHANNEL_HOST"] == "10.0.0.1"
 
     def test_vllm_get_process_environment_none_ports(self):
         """Test vLLM handles None ports gracefully."""
@@ -1370,3 +1382,113 @@ def test_agg_mode_no_disaggregation_flag(self):
         assert "--disaggregation-mode" not in cmd
         assert "--is-prefill-worker" not in cmd
         assert "--is-decode-worker" not in cmd
+
+
+class TestInfmaxWorkspaceMount:
+    """Test that INFMAX_WORKSPACE env var creates a container mount."""
+
+    def test_infmax_workspace_mount_added(self, tmp_path):
+        """RuntimeContext includes /infmax-workspace mount when env var is set."""
+        import os
+        import subprocess
+        from pathlib import Path
+        from unittest.mock import MagicMock, patch
+
+        from srtctl.core.runtime import RuntimeContext
+        from srtctl.core.schema import ModelConfig, ResourceConfig, SrtConfig
+
+        model_path = tmp_path / "model"
+        model_path.mkdir()
+        container_path = tmp_path / "container.sqsh"
+        container_path.touch()
+
+        slurm_env = {
+            "SLURM_JOB_ID": "12345",
+            "SLURM_JOBID": "12345",
+            "SLURM_NODELIST": "gpu-[01-02]",
+            "SLURM_JOB_NUM_NODES": "2",
+            "SRTCTL_SOURCE_DIR": str(Path(__file__).parent.parent),
+            "INFMAX_WORKSPACE": "/actions/runner/workspace",
+        }
+
+        def mock_scontrol(cmd, **kwargs):
+            if cmd[0] == "scontrol" and "hostnames" in cmd:
+                result = MagicMock()
+                result.stdout = "gpu-01\ngpu-02"
+                result.returncode = 0
+                return result
+            raise subprocess.CalledProcessError(1, cmd)
+
+        with patch.dict(os.environ, slurm_env):
+            with patch("subprocess.run", mock_scontrol):
+                with patch("srtctl.core.slurm.get_hostname_ip", return_value="10.0.0.1"):
+                    config = SrtConfig(
+                        name="test",
+                        model=ModelConfig(
+                            path=str(model_path),
+                            container=str(container_path),
+                            precision="fp8",
+                        ),
+                        resources=ResourceConfig(
+                            gpu_type="h100",
+                            gpus_per_node=8,
+                            prefill_nodes=1,
+                            decode_nodes=1,
+                        ),
+                    )
+                    runtime = RuntimeContext.from_config(config, job_id="12345")
+
+                    assert Path("/infmax-workspace") in runtime.container_mounts.values()
+
+    def test_infmax_workspace_mount_not_added_without_env(self, tmp_path):
+        """RuntimeContext does not include /infmax-workspace without env var."""
+        import os
+        import subprocess
+        from pathlib import Path
+        from unittest.mock import MagicMock, patch
+
+        from srtctl.core.runtime import RuntimeContext
+        from srtctl.core.schema import ModelConfig, ResourceConfig, SrtConfig
+
+        model_path = tmp_path / "model"
+        model_path.mkdir()
+        container_path = tmp_path / "container.sqsh"
+        container_path.touch()
+
+        slurm_env = {
+            "SLURM_JOB_ID": "12345",
+            "SLURM_JOBID": "12345",
+            "SLURM_NODELIST": "gpu-[01-02]",
+            "SLURM_JOB_NUM_NODES": "2",
+            "SRTCTL_SOURCE_DIR": str(Path(__file__).parent.parent),
+        }
+
+        def mock_scontrol(cmd, **kwargs):
+            if cmd[0] == "scontrol" and "hostnames" in cmd:
+                result = MagicMock()
+                result.stdout = "gpu-01\ngpu-02"
+                result.returncode = 0
+                return result
+            raise subprocess.CalledProcessError(1, cmd)
+
+        with patch.dict(os.environ, slurm_env):
+            os.environ.pop("INFMAX_WORKSPACE", None)
+            with patch("subprocess.run", mock_scontrol):
+                with patch("srtctl.core.slurm.get_hostname_ip", return_value="10.0.0.1"):
+                    config = SrtConfig(
+                        name="test",
+                        model=ModelConfig(
+                            path=str(model_path),
+                            container=str(container_path),
+                            precision="fp8",
+                        ),
+                        resources=ResourceConfig(
+                            gpu_type="h100",
+                            gpus_per_node=8,
+                            prefill_nodes=1,
+                            decode_nodes=1,
+                        ),
+                    )
+                    runtime = RuntimeContext.from_config(config, job_id="12345")
+
+                    assert Path("/infmax-workspace") not in runtime.container_mounts.values()