diff --git a/.github/workflows/ci.yaml b/.github/workflows/ci.yaml index eba897bb..dccdba05 100644 --- a/.github/workflows/ci.yaml +++ b/.github/workflows/ci.yaml @@ -4,7 +4,7 @@ on: push: branches: [main, master] pull_request: - branches: [main, master] + branches: [main, master, sa-submission-q2-2026] jobs: lint: @@ -119,3 +119,4 @@ jobs: exit(1) print(f'\nAll {len(recipes)} recipes valid') " + diff --git a/docs/accuracy.md b/docs/accuracy.md index f5588c9f..98b69b46 100644 --- a/docs/accuracy.md +++ b/docs/accuracy.md @@ -1,6 +1,6 @@ # Accuracy Benchmarks -In srt-slurm, users can run different accuracy benchmarks by setting the benchmark section in the config yaml file. Supported benchmarks include `mmlu`, `gpqa` and `longbenchv2`. +In srt-slurm, users can run different accuracy benchmarks by setting the benchmark section in the config yaml file. Supported benchmarks include `mmlu`, `gpqa`, `longbenchv2`, and `lm-eval`. ## Table of Contents @@ -14,6 +14,7 @@ In srt-slurm, users can run different accuracy benchmarks by setting the benchma - [Example: Quick Validation](#example-quick-validation) - [Output](#output) - [Important Notes](#important-notes) +- [lm-eval (InferenceX)](#lm-eval-inferencex) --- @@ -191,3 +192,84 @@ The output includes per-category scores and aggregate metrics: 4. **Categories**: Running specific categories is useful for targeted validation (e.g., just testing summarization capabilities) +## lm-eval (InferenceX) + +The `lm-eval` benchmark runner integrates [EleutherAI/lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) via InferenceX's `benchmark_lib.sh`. Unlike the built-in benchmarks above, this runner sources evaluation logic from an external InferenceX workspace mounted at `/infmax-workspace`. + +This is used by InferenceX CI to run evals such as GSM8K and GPQA against NVIDIA multi-node disaggregated deployments on GB200, GB300, B200, B300, H100, and H200. AMD MI355X multi-node evals are handled by InferenceX's upstreamed AMD Slurm path, not by this srt-slurm runner. + +In InferenceX CI, recipes normally keep their throughput benchmark configuration. `do_sweep.py` invokes the registered `lm-eval` runner as a post-step when `RUN_EVAL=true`, or as the only benchmark-like step when `EVAL_ONLY=true`. There is no separate `infmax-eval` benchmark type. + +### How it works + +1. `RuntimeContext` mounts the host path from `INFMAX_WORKSPACE` at `/infmax-workspace` inside the Slurm container. +2. `do_sweep.py` starts infrastructure, workers, and the frontend for the normal recipe topology. +3. For `EVAL_ONLY=true`, `do_sweep.py` skips the throughput benchmark stage and runs `_run_post_eval()` directly after frontend startup. +4. `_run_post_eval()` waits for the OpenAI-compatible endpoint on port 8000 and, in eval-only mode, performs the full `wait_for_model()` health check for the configured prefill/decode or aggregated topology. +5. `_run_post_eval()` launches the registered `lm-eval` runner on the head node and passes through InferenceX metadata such as framework, precision, sequence length, prefill/decode topology, and eval concurrency. +6. The runner script (`benchmarks/scripts/lm-eval/bench.sh`) uses `MODEL_NAME` from `do_sweep.py`, or auto-discovers the served model from `/v1/models` as a fallback. +7. The runner sources `/infmax-workspace/benchmarks/benchmark_lib.sh`, runs `run_eval --framework lm-eval`, and calls `append_lm_eval_summary`. +8. Eval artifacts are copied to `/logs/eval_results/` for InferenceX launcher-side artifact pickup. + +### EVAL_ONLY mode + +srt-slurm supports an `EVAL_ONLY` mode for CI jobs that should only validate accuracy. This is controlled by environment variables from the InferenceX workflow: + +| Env var | Description | +|---------|-------------| +| `EVAL_ONLY` | Set to `true` to skip the throughput benchmark stage and run eval only | +| `RUN_EVAL` | Set to `true` to run eval after the throughput benchmark completes | +| `EVAL_CONC` | Concurrent requests for lm-eval, normally set by InferenceX from the generated `eval-conc` value | +| `INFMAX_WORKSPACE` | Host path to the InferenceX checkout that should be mounted at `/infmax-workspace` | +| `MODEL_NAME` | Served model alias for OpenAI-compatible requests; set by `do_sweep.py` from `config.served_model_name` | + +When `EVAL_ONLY=true`: +- Stage 4 skips the throughput benchmark entirely. No throughput result JSON is expected from srt-slurm. +- The eval path uses the full `wait_for_model()` health check before starting lm-eval. +- `_run_post_eval()` launches the `lm-eval` runner and returns its exit code. +- Eval failure is fatal because eval is the only purpose of the job. + +When `RUN_EVAL=true` (without `EVAL_ONLY`): +- Throughput benchmark runs normally +- After benchmark completes successfully, eval runs as a post-step +- Eval failure is non-fatal; the benchmark job still succeeds if throughput passed + +### Environment variables + +The following env vars are passed through to the lm-eval runner container: + +| Env var | Purpose | +|---------|---------| +| `RUN_EVAL`, `EVAL_ONLY`, `IS_MULTINODE` | Control whether eval runs and how InferenceX classifies the artifact | +| `FRAMEWORK`, `PRECISION`, `MODEL_PREFIX`, `RUNNER_TYPE`, `SPEC_DECODING` | Benchmark identity metadata for `meta_env.json` | +| `ISL`, `OSL`, `RESULT_FILENAME` | Sequence length and result-file metadata | +| `MODEL`, `MODEL_PATH`, `MODEL_NAME` | Model metadata and the served model alias used for requests | +| `MAX_MODEL_LEN`, `EVAL_MAX_MODEL_LEN` | Context-length metadata used by InferenceX eval helpers when available | +| `PREFILL_TP`, `PREFILL_EP`, `PREFILL_NUM_WORKERS`, `PREFILL_DP_ATTN` | Prefill-side topology metadata | +| `DECODE_TP`, `DECODE_EP`, `DECODE_NUM_WORKERS`, `DECODE_DP_ATTN` | Decode-side topology metadata | +| `EVAL_CONC`, `EVAL_CONCURRENT_REQUESTS` | Eval concurrency controls | + +The runner maps srt-slurm's `PREFILL_DP_ATTN` and `DECODE_DP_ATTN` names to InferenceX's `PREFILL_DP_ATTENTION` and `DECODE_DP_ATTENTION` names before calling `append_lm_eval_summary`. This is required for multi-node summary tables to preserve prefill/decode DPA state. + +### Concurrency + +Eval concurrency is ultimately read by InferenceX's `benchmark_lib.sh` from `EVAL_CONCURRENT_REQUESTS`. The runner script sets that value from `EVAL_CONC` when present, preserves an existing `EVAL_CONCURRENT_REQUESTS` otherwise, and falls back to `256` only if neither variable is set: + +```bash +export EVAL_CONCURRENT_REQUESTS="${EVAL_CONC:-${EVAL_CONCURRENT_REQUESTS:-256}}" +``` + +The InferenceX workflow sets `EVAL_CONC` from the generated `eval-conc` value. For multi-node configs, InferenceX selects the `8k1k` entry with the highest max eligible concurrency for each `(model, runner, framework, precision, spec-decoding, prefill-dp-attn, decode-dp-attn)` group, then sets `eval-conc` to the upper median of that config's eligible concurrency list. If `EVAL_CONC` is not set in the environment, `do_sweep.py` falls back to the max of the recipe benchmark concurrency list. + +### Output + +Eval artifacts are written to `/logs/eval_results/` inside the container: +- `meta_env.json` - metadata used by InferenceX aggregation and summary tables +- `results*.json` - lm-eval scores per task +- `sample*.jsonl` - per-sample outputs + +These are collected by the InferenceX NVIDIA launch scripts and uploaded as workflow artifacts. In eval-only mode the InferenceX workflow expects eval artifacts, not throughput benchmark artifacts. + +### Intricacies +1. Eval floor of 16 + - There is 1 sweep config of conc: [1], which causes evals to take >4hrs to complete. diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen1dep16_batch32_eplb0_mtp2.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen1dep16_batch32_eplb0_mtp2.yaml new file mode 100644 index 00000000..21edc148 --- /dev/null +++ b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen1dep16_batch32_eplb0_mtp2.yaml @@ -0,0 +1,135 @@ +name: "glm5_nvfp4_ISL1K_OSL1K_ctx1dep4_gen1dep16_batch32_eplb0_mtp2" + +# ctx: 1 prefill worker, TP4/EP4 +# gen: 1 decode worker, TP16/EP16, enable_attention_dp=true, max_batch=32 +# concurrency: 666 + +model: + path: "nvidia/GLM5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3" + precision: "fp4" + +resources: + gpu_type: "gb200" + + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 4 + + decode_workers: 1 + decode_nodes: 4 + gpus_per_decode: 16 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1064 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: CUTEDSL + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 2 + + decode: + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 32 + max_num_tokens: 96 + max_seq_len: 2088 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + moe_config: + backend: CUTEDSL + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.7 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 2 + +benchmark: + type: "sa-bench" + isl: 1024 + osl: 1024 + concurrencies: "666" + req_rate: "inf" + custom_tokenizer: "glm_moe_dsa" + use_chat_template: false + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen1dep16_batch64_eplb0_mtp1.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen1dep16_batch64_eplb0_mtp1.yaml new file mode 100644 index 00000000..ebcd45d1 --- /dev/null +++ b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen1dep16_batch64_eplb0_mtp1.yaml @@ -0,0 +1,139 @@ +name: "glm5_nvfp4_ISL1K_OSL1K_ctx1dep4_gen1dep16_batch64_eplb0_mtp1" + +# ctx: 1 prefill worker, TP4/EP4 +# gen: 1 decode worker, TP16/EP16, enable_attention_dp=true, max_batch=64 +# concurrency: 1229 + +model: + path: "nvidia/GLM5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3" + precision: "fp4" + +resources: + gpu_type: "gb200" + + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 4 + + decode_workers: 1 + decode_nodes: 4 + gpus_per_decode: 16 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1064 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: CUTEDSL + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 1 + + decode: + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 64 + max_num_tokens: 128 + max_seq_len: 2088 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + moe_config: + backend: CUTEDSL + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.7 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 1 + +benchmark: + type: "sa-bench" + isl: 1024 + osl: 1024 + concurrencies: "1229" + req_rate: "inf" + custom_tokenizer: "glm_moe_dsa" + use_chat_template: false + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen1dep32_batch16_allconc_eplb0_mtp3.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen1dep32_batch16_allconc_eplb0_mtp3.yaml new file mode 100644 index 00000000..68af65ee --- /dev/null +++ b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen1dep32_batch16_allconc_eplb0_mtp3.yaml @@ -0,0 +1,133 @@ +name: "glm5_nvfp4_ISL1K_OSL1K_ctx1dep4_gen1dep32_batch16_allconc_eplb0_mtp3" + +# ctx: 1 prefill worker, TP4/EP4 +# gen: 1 decode worker, TP32/EP32, enable_attention_dp=true, max_batch=16 +# concurrencies: 333 (batch8), 666 (batch16) + +model: + path: "nvidia/GLM5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3" + precision: "fp4" + +resources: + gpu_type: "gb200" + + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 4 + + decode_workers: 1 + decode_nodes: 8 + gpus_per_decode: 32 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1064 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: CUTEDSL + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + + decode: + tensor_parallel_size: 32 + moe_expert_parallel_size: 32 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 16 + max_num_tokens: 64 + max_seq_len: 2088 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + moe_config: + backend: CUTEDSL + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + +benchmark: + type: "sa-bench" + isl: 1024 + osl: 1024 + concurrencies: "333x666" + req_rate: "inf" + custom_tokenizer: "glm_moe_dsa" + use_chat_template: false + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen4tep8_batch16_eplb0_mtp2.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen4tep8_batch16_eplb0_mtp2.yaml new file mode 100644 index 00000000..d6d3dcf1 --- /dev/null +++ b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen4tep8_batch16_eplb0_mtp2.yaml @@ -0,0 +1,134 @@ +name: "glm5_nvfp4_ISL1K_OSL1K_ctx1dep4_gen4tep8_batch16_eplb0_mtp2" + +# ctx: 1 prefill worker, TP4/EP4 +# gen: 4 decode workers, TP8/EP8, enable_attention_dp=false, max_batch=16 +# concurrency: 96 + +model: + path: "nvidia/GLM5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3" + precision: "fp4" + +resources: + gpu_type: "gb200" + + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 4 + + decode_workers: 4 + decode_nodes: 8 + gpus_per_decode: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1064 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: CUTEDSL + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 2 + + decode: + allreduce_strategy: MNNVL + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 16 + max_num_tokens: 48 + max_seq_len: 2088 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 2 + +benchmark: + type: "sa-bench" + isl: 1024 + osl: 1024 + concurrencies: "96" + req_rate: "inf" + custom_tokenizer: "glm_moe_dsa" + use_chat_template: false + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen4tep8_batch32_allconc_eplb0_mtp3.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen4tep8_batch32_allconc_eplb0_mtp3.yaml new file mode 100644 index 00000000..da187faf --- /dev/null +++ b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen4tep8_batch32_allconc_eplb0_mtp3.yaml @@ -0,0 +1,136 @@ +name: "glm5_nvfp4_ISL1K_OSL1K_ctx1dep4_gen4tep8_batch32_allconc_eplb0_mtp3" + +# ctx: 1 prefill worker, TP4/EP4 +# gen: 4 decode workers, TP8/EP8, enable_attention_dp=false, max_batch=32 +# concurrencies: 8 (batch1), 44 (batch8), 192 (batch32) + +model: + path: "nvidia/GLM5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3" + precision: "fp4" + +resources: + gpu_type: "gb200" + + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 4 + + decode_workers: 4 + decode_nodes: 8 + gpus_per_decode: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1064 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: CUTEDSL + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + + decode: + allreduce_strategy: MNNVL + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 32 + max_num_tokens: 128 + max_seq_len: 2088 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + +benchmark: + type: "sa-bench" + isl: 1024 + osl: 1024 + concurrencies: "8x44x192" + req_rate: "inf" + custom_tokenizer: "glm_moe_dsa" + use_chat_template: false + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen5tep4_batch1_eplb0_mtp3.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen5tep4_batch1_eplb0_mtp3.yaml new file mode 100644 index 00000000..a6121cd0 --- /dev/null +++ b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen5tep4_batch1_eplb0_mtp3.yaml @@ -0,0 +1,131 @@ +name: "glm5_nvfp4_ISL1K_OSL1K_ctx1dep4_gen5tep4_batch1_eplb0_mtp3" + +# ctx: 1 prefill worker, TP4/EP4 +# gen: 5 decode workers, TP4/EP4, enable_attention_dp=false, max_batch=1 +# concurrency: 10 + +model: + path: "nvidia/GLM5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3" + precision: "fp4" + +resources: + gpu_type: "gb200" + + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 4 + + decode_workers: 5 + decode_nodes: 5 + gpus_per_decode: 4 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1064 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: CUTEDSL + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + + decode: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 1 + max_num_tokens: 4 + max_seq_len: 2088 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.85 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + +benchmark: + type: "sa-bench" + isl: 1024 + osl: 1024 + concurrencies: "10" + req_rate: "inf" + custom_tokenizer: "glm_moe_dsa" + use_chat_template: false + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/MTP/ctx2dep4_gen1dep16_batch256_eplb256_mtp1.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/MTP/ctx2dep4_gen1dep16_batch256_eplb256_mtp1.yaml new file mode 100644 index 00000000..dc176b2d --- /dev/null +++ b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/MTP/ctx2dep4_gen1dep16_batch256_eplb256_mtp1.yaml @@ -0,0 +1,167 @@ +name: "glm5_nvfp4_ISL1K_OSL1K_ctx2dep4_gen1dep16_batch256_eplb256_mtp1" + +# ctx: 2 prefill workers, TP4/EP4 +# gen: 1 decode worker, TP16/EP16, enable_attention_dp=true, max_batch=256 +# EPLB: num_slots=256 +# concurrency: 4301 + +model: + path: "nvidia/GLM5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3" + precision: "fp4" + +resources: + gpu_type: "gb200" + + prefill_nodes: 2 + prefill_workers: 2 + gpus_per_prefill: 4 + + decode_workers: 1 + decode_nodes: 4 + gpus_per_decode: 16 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1064 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: CUTEDSL + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 1 + + decode: + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 256 + max_num_tokens: 512 + max_seq_len: 2088 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + - 72 + - 80 + - 88 + - 96 + - 104 + - 112 + - 120 + - 128 + - 136 + - 144 + - 152 + - 160 + - 168 + - 176 + - 184 + - 192 + - 200 + - 208 + - 216 + - 224 + - 232 + - 240 + - 248 + - 256 + moe_config: + backend: CUTEDSL + use_low_precision_moe_combine: true + load_balancer: + layer_updates_per_iter: 1 + num_slots: 256 + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.7 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 1 + +benchmark: + type: "sa-bench" + isl: 1024 + osl: 1024 + concurrencies: "4301" + req_rate: "inf" + custom_tokenizer: "glm_moe_dsa" + use_chat_template: false + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/MTP/ctx3dep4_gen1dep32_batch128_eplb288_mtp1.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/MTP/ctx3dep4_gen1dep32_batch128_eplb288_mtp1.yaml new file mode 100644 index 00000000..a7a1c790 --- /dev/null +++ b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/MTP/ctx3dep4_gen1dep32_batch128_eplb288_mtp1.yaml @@ -0,0 +1,151 @@ +name: "glm5_nvfp4_ISL1K_OSL1K_ctx3dep4_gen1dep32_batch128_eplb288_mtp1" + +# ctx: 3 prefill workers, TP4/EP4 +# gen: 1 decode worker, TP32/EP32, enable_attention_dp=true, max_batch=128 +# EPLB: num_slots=288 +# concurrency: 4301 + +model: + path: "nvidia/GLM5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3" + precision: "fp4" + +resources: + gpu_type: "gb200" + + prefill_nodes: 3 + prefill_workers: 3 + gpus_per_prefill: 4 + + decode_workers: 1 + decode_nodes: 8 + gpus_per_decode: 32 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1064 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: CUTEDSL + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 1 + + decode: + tensor_parallel_size: 32 + moe_expert_parallel_size: 32 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 128 + max_num_tokens: 256 + max_seq_len: 2088 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + - 72 + - 80 + - 88 + - 96 + - 104 + - 112 + - 120 + - 128 + moe_config: + backend: CUTEDSL + use_low_precision_moe_combine: true + load_balancer: + layer_updates_per_iter: 1 + num_slots: 288 + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 1 + +benchmark: + type: "sa-bench" + isl: 1024 + osl: 1024 + concurrencies: "4301" + req_rate: "inf" + custom_tokenizer: "glm_moe_dsa" + use_chat_template: false + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen1dep32_batch32_eplb0_mtp0.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen1dep32_batch32_eplb0_mtp0.yaml new file mode 100644 index 00000000..7412a109 --- /dev/null +++ b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen1dep32_batch32_eplb0_mtp0.yaml @@ -0,0 +1,129 @@ +name: "glm5_nvfp4_ISL1K_OSL1K_ctx1dep4_gen1dep32_batch32_eplb0_mtp0" + +# ctx: 1 prefill worker, TP4/EP4 +# gen: 1 decode worker, TP32/EP32, enable_attention_dp=true, max_batch=32 +# concurrency: 1229 + +model: + path: "nvidia/GLM5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3" + precision: "fp4" + +resources: + gpu_type: "gb200" + + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 4 + + decode_workers: 1 + decode_nodes: 8 + gpus_per_decode: 32 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1064 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: CUTEDSL + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + + decode: + tensor_parallel_size: 32 + moe_expert_parallel_size: 32 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 32 + max_num_tokens: 32 + max_seq_len: 2088 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + moe_config: + backend: CUTEDSL + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.7 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + +benchmark: + type: "sa-bench" + isl: 1024 + osl: 1024 + concurrencies: "1229" + req_rate: "inf" + custom_tokenizer: "glm_moe_dsa" + use_chat_template: false + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen4tep8_batch128_allconc_eplb0_mtp0.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen4tep8_batch128_allconc_eplb0_mtp0.yaml new file mode 100644 index 00000000..e969c07d --- /dev/null +++ b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen4tep8_batch128_allconc_eplb0_mtp0.yaml @@ -0,0 +1,142 @@ +name: "glm5_nvfp4_ISL1K_OSL1K_ctx1dep4_gen4tep8_batch128_allconc_eplb0_mtp0" + +# ctx: 1 prefill worker, TP4/EP4 +# gen: 4 decode workers, TP8/EP8, enable_attention_dp=false, max_batch=128 +# Merged concurrencies: batch1(4), batch32(180), batch64(360), batch128(616) + +model: + path: "nvidia/GLM5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3" + precision: "fp4" + +resources: + gpu_type: "gb200" + + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 4 + + decode_workers: 4 + decode_nodes: 8 + gpus_per_decode: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1064 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: CUTEDSL + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + + decode: + allreduce_strategy: MNNVL + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 128 + max_num_tokens: 128 + max_seq_len: 2088 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + - 72 + - 80 + - 88 + - 96 + - 104 + - 112 + - 120 + - 128 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + +benchmark: + type: "sa-bench" + isl: 1024 + osl: 1024 + concurrencies: "4x180x360x616" + req_rate: "inf" + custom_tokenizer: "glm_moe_dsa" + use_chat_template: false + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen5tep4_batch8_allconc_eplb0_mtp0.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen5tep4_batch8_allconc_eplb0_mtp0.yaml new file mode 100644 index 00000000..fb583747 --- /dev/null +++ b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen5tep4_batch8_allconc_eplb0_mtp0.yaml @@ -0,0 +1,126 @@ +name: "glm5_nvfp4_ISL1K_OSL1K_ctx1dep4_gen5tep4_batch8_allconc_eplb0_mtp0" + +# ctx: 1 prefill worker, TP4/EP4 +# gen: 5 decode workers, TP4/EP4, enable_attention_dp=false, max_batch=8 +# Merged concurrencies: batch1(5), batch2(15), batch4(30), batch8(50) + +model: + path: "nvidia/GLM5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3" + precision: "fp4" + +resources: + gpu_type: "gb200" + + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 4 + + decode_workers: 5 + decode_nodes: 5 + gpus_per_decode: 4 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1064 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: CUTEDSL + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + + decode: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 8 + max_num_tokens: 8 + max_seq_len: 2088 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + +benchmark: + type: "sa-bench" + isl: 1024 + osl: 1024 + concurrencies: "5x15x30x50" + req_rate: "inf" + custom_tokenizer: "glm_moe_dsa" + use_chat_template: false + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/STP/ctx2dep4_gen1dep16_batch128_eplb0_mtp0.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/STP/ctx2dep4_gen1dep16_batch128_eplb0_mtp0.yaml new file mode 100644 index 00000000..e057ce05 --- /dev/null +++ b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/STP/ctx2dep4_gen1dep16_batch128_eplb0_mtp0.yaml @@ -0,0 +1,141 @@ +name: "glm5_nvfp4_ISL1K_OSL1K_ctx2dep4_gen1dep16_batch128_eplb0_mtp0" + +# ctx: 2 prefill workers, TP4/EP4 +# gen: 1 decode worker, TP16/EP16, enable_attention_dp=true, max_batch=128 +# concurrency: 2253 + +model: + path: "nvidia/GLM5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3" + precision: "fp4" + +resources: + gpu_type: "gb200" + + prefill_nodes: 2 + prefill_workers: 2 + gpus_per_prefill: 4 + + decode_workers: 1 + decode_nodes: 4 + gpus_per_decode: 16 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1064 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: CUTEDSL + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + + decode: + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 128 + max_num_tokens: 128 + max_seq_len: 2088 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + - 72 + - 80 + - 88 + - 96 + - 104 + - 112 + - 120 + - 128 + moe_config: + backend: CUTEDSL + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.75 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + +benchmark: + type: "sa-bench" + isl: 1024 + osl: 1024 + concurrencies: "2253" + req_rate: "inf" + custom_tokenizer: "glm_moe_dsa" + use_chat_template: false + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/STP/ctx2dep4_gen1dep16_batch512_eplb256_mtp0.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/STP/ctx2dep4_gen1dep16_batch512_eplb256_mtp0.yaml new file mode 100644 index 00000000..d221dde2 --- /dev/null +++ b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/STP/ctx2dep4_gen1dep16_batch512_eplb256_mtp0.yaml @@ -0,0 +1,193 @@ +name: "glm5_nvfp4_ISL1K_OSL1K_ctx2dep4_gen1dep16_batch512_eplb256_mtp0" + +# ctx: 2 prefill workers, TP4/EP4 +# gen: 1 decode worker, TP16/EP16, enable_attention_dp=true, max_batch=512 +# EPLB: num_slots=256 +# concurrency: 8192 + +model: + path: "nvidia/GLM5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3" + precision: "fp4" + +resources: + gpu_type: "gb200" + + prefill_nodes: 2 + prefill_workers: 2 + gpus_per_prefill: 4 + + decode_workers: 1 + decode_nodes: 4 + gpus_per_decode: 16 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1064 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: CUTEDSL + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + + decode: + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 512 + max_num_tokens: 512 + max_seq_len: 2088 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + - 72 + - 80 + - 88 + - 96 + - 104 + - 112 + - 120 + - 128 + - 136 + - 144 + - 152 + - 160 + - 168 + - 176 + - 184 + - 192 + - 200 + - 208 + - 216 + - 224 + - 232 + - 240 + - 248 + - 256 + - 264 + - 272 + - 280 + - 288 + - 296 + - 304 + - 312 + - 320 + - 328 + - 336 + - 344 + - 352 + - 360 + - 368 + - 376 + - 384 + - 392 + - 400 + - 408 + - 416 + - 424 + - 432 + - 440 + - 448 + - 456 + - 464 + - 472 + - 480 + - 488 + - 496 + - 504 + - 512 + moe_config: + backend: CUTEDSL + use_low_precision_moe_combine: true + load_balancer: + layer_updates_per_iter: 1 + num_slots: 256 + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.75 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + +benchmark: + type: "sa-bench" + isl: 1024 + osl: 1024 + concurrencies: "8192" + req_rate: "inf" + custom_tokenizer: "glm_moe_dsa" + use_chat_template: false + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/STP/ctx2dep4_gen1dep32_batch64_eplb0_mtp0.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/STP/ctx2dep4_gen1dep32_batch64_eplb0_mtp0.yaml new file mode 100644 index 00000000..bbad79c1 --- /dev/null +++ b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/STP/ctx2dep4_gen1dep32_batch64_eplb0_mtp0.yaml @@ -0,0 +1,133 @@ +name: "glm5_nvfp4_ISL1K_OSL1K_ctx2dep4_gen1dep32_batch64_eplb0_mtp0" + +# ctx: 2 prefill workers, TP4/EP4 +# gen: 1 decode worker, TP32/EP32, enable_attention_dp=true, max_batch=64 +# concurrency: 2253 + +model: + path: "nvidia/GLM5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3" + precision: "fp4" + +resources: + gpu_type: "gb200" + + prefill_nodes: 2 + prefill_workers: 2 + gpus_per_prefill: 4 + + decode_workers: 1 + decode_nodes: 8 + gpus_per_decode: 32 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1064 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: CUTEDSL + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + + decode: + tensor_parallel_size: 32 + moe_expert_parallel_size: 32 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 64 + max_num_tokens: 64 + max_seq_len: 2088 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + moe_config: + backend: CUTEDSL + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.7 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + +benchmark: + type: "sa-bench" + isl: 1024 + osl: 1024 + concurrencies: "2253" + req_rate: "inf" + custom_tokenizer: "glm_moe_dsa" + use_chat_template: false + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/STP/ctx4dep4_gen1dep32_batch256_eplb288_mtp0.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/STP/ctx4dep4_gen1dep32_batch256_eplb288_mtp0.yaml new file mode 100644 index 00000000..26d2d29e --- /dev/null +++ b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL1K_OSL1K/STP/ctx4dep4_gen1dep32_batch256_eplb288_mtp0.yaml @@ -0,0 +1,161 @@ +name: "glm5_nvfp4_ISL1K_OSL1K_ctx4dep4_gen1dep32_batch256_eplb288_mtp0" + +# ctx: 4 prefill workers, TP4/EP4 +# gen: 1 decode worker, TP32/EP32, enable_attention_dp=true, max_batch=256 +# EPLB: num_slots=288 +# concurrency: 8192 + +model: + path: "nvidia/GLM5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3" + precision: "fp4" + +resources: + gpu_type: "gb200" + + prefill_nodes: 4 + prefill_workers: 4 + gpus_per_prefill: 4 + + decode_workers: 1 + decode_nodes: 8 + gpus_per_decode: 32 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1064 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: CUTEDSL + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + + decode: + tensor_parallel_size: 32 + moe_expert_parallel_size: 32 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 256 + max_num_tokens: 256 + max_seq_len: 2088 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + - 72 + - 80 + - 88 + - 96 + - 104 + - 112 + - 120 + - 128 + - 136 + - 144 + - 152 + - 160 + - 168 + - 176 + - 184 + - 192 + - 200 + - 208 + - 216 + - 224 + - 232 + - 240 + - 248 + - 256 + moe_config: + backend: CUTEDSL + use_low_precision_moe_combine: true + load_balancer: + layer_updates_per_iter: 1 + num_slots: 288 + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.7 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + +benchmark: + type: "sa-bench" + isl: 1024 + osl: 1024 + concurrencies: "8192" + req_rate: "inf" + custom_tokenizer: "glm_moe_dsa" + use_chat_template: false + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/MTP/ctx10dep4_gen1dep16_batch64_eplb0_mtp1.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/MTP/ctx10dep4_gen1dep16_batch64_eplb0_mtp1.yaml new file mode 100644 index 00000000..420192c2 --- /dev/null +++ b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/MTP/ctx10dep4_gen1dep16_batch64_eplb0_mtp1.yaml @@ -0,0 +1,139 @@ +name: "glm5_nvfp4_ISL8K_OSL1K_ctx10dep4_gen1dep16_batch64_eplb0_mtp1" + +# ctx: 10 prefill workers, TP4/EP4 +# gen: 1 decode worker, TP16/EP16, enable_attention_dp=true, max_batch=64 +# concurrency: 1229 + +model: + path: "nvidia/GLM5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3" + precision: "fp4" + +resources: + gpu_type: "gb200" + + prefill_nodes: 10 + prefill_workers: 10 + gpus_per_prefill: 4 + + decode_workers: 1 + decode_nodes: 4 + gpus_per_decode: 16 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 2 + max_num_tokens: 16640 + max_seq_len: 8232 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: CUTEDSL + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 1 + + decode: + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 64 + max_num_tokens: 128 + max_seq_len: 9256 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + moe_config: + backend: CUTEDSL + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.7 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 1 + +benchmark: + type: "sa-bench" + isl: 8192 + osl: 1024 + concurrencies: "1229" + req_rate: "inf" + custom_tokenizer: "glm_moe_dsa" + use_chat_template: false + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/MTP/ctx1dep4_gen2tep8_batch16_eplb0_mtp3.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/MTP/ctx1dep4_gen2tep8_batch16_eplb0_mtp3.yaml new file mode 100644 index 00000000..da3186e5 --- /dev/null +++ b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/MTP/ctx1dep4_gen2tep8_batch16_eplb0_mtp3.yaml @@ -0,0 +1,133 @@ +name: "glm5_nvfp4_ISL8K_OSL1K_ctx1dep4_gen2tep8_batch16_eplb0_mtp3" + +# ctx: 1 prefill worker, TP4/EP4 +# gen: 2 decode workers, TP8/EP8, max_batch=16, concurrency: 46 + +model: + path: "nvidia/GLM5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3" + precision: "fp4" + +resources: + gpu_type: "gb200" + + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 4 + + decode_workers: 2 + decode_nodes: 4 + gpus_per_decode: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 2 + max_num_tokens: 16640 + max_seq_len: 8232 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: CUTEDSL + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + + decode: + allreduce_strategy: MNNVL + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 16 + max_num_tokens: 64 + max_seq_len: 9256 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + +benchmark: + type: "sa-bench" + isl: 8192 + osl: 1024 + concurrencies: "46" + req_rate: "inf" + custom_tokenizer: "glm_moe_dsa" + use_chat_template: false + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/MTP/ctx1dep4_gen4tep8_batch8_allconc_eplb0_mtp3.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/MTP/ctx1dep4_gen4tep8_batch8_allconc_eplb0_mtp3.yaml new file mode 100644 index 00000000..fb94a549 --- /dev/null +++ b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/MTP/ctx1dep4_gen4tep8_batch8_allconc_eplb0_mtp3.yaml @@ -0,0 +1,133 @@ +name: "glm5_nvfp4_ISL8K_OSL1K_ctx1dep4_gen4tep8_batch8_allconc_eplb0_mtp3" + +# ctx: 1 prefill worker, TP4/EP4 +# gen: 4 decode workers, TP8/EP8, max_batch=8 +# concurrencies: 4 (batch1), 48 (batch8) + +model: + path: "nvidia/GLM5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3" + precision: "fp4" + +resources: + gpu_type: "gb200" + + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 4 + + decode_workers: 4 + decode_nodes: 8 + gpus_per_decode: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 2 + max_num_tokens: 16640 + max_seq_len: 8232 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: CUTEDSL + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + + decode: + allreduce_strategy: MNNVL + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 8 + max_num_tokens: 32 + max_seq_len: 9256 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + +benchmark: + type: "sa-bench" + isl: 8192 + osl: 1024 + concurrencies: "4x48" + req_rate: "inf" + custom_tokenizer: "glm_moe_dsa" + use_chat_template: false + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/MTP/ctx1dep4_gen5tep4_batch1_eplb0_mtp3.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/MTP/ctx1dep4_gen5tep4_batch1_eplb0_mtp3.yaml new file mode 100644 index 00000000..0a13cce4 --- /dev/null +++ b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/MTP/ctx1dep4_gen5tep4_batch1_eplb0_mtp3.yaml @@ -0,0 +1,130 @@ +name: "glm5_nvfp4_ISL8K_OSL1K_ctx1dep4_gen5tep4_batch1_eplb0_mtp3" + +# ctx: 1 prefill worker, TP4/EP4 +# gen: 5 decode workers, TP4/EP4, max_batch=1, concurrency: 5 + +model: + path: "nvidia/GLM5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3" + precision: "fp4" + +resources: + gpu_type: "gb200" + + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 4 + + decode_workers: 5 + decode_nodes: 5 + gpus_per_decode: 4 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 2 + max_num_tokens: 16640 + max_seq_len: 8232 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: CUTEDSL + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + + decode: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 1 + max_num_tokens: 4 + max_seq_len: 9256 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.85 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + +benchmark: + type: "sa-bench" + isl: 8192 + osl: 1024 + concurrencies: "5" + req_rate: "inf" + custom_tokenizer: "glm_moe_dsa" + use_chat_template: false + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/MTP/ctx3dep4_gen1dep32_batch4_eplb0_mtp3.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/MTP/ctx3dep4_gen1dep32_batch4_eplb0_mtp3.yaml new file mode 100644 index 00000000..440a4f73 --- /dev/null +++ b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/MTP/ctx3dep4_gen1dep32_batch4_eplb0_mtp3.yaml @@ -0,0 +1,130 @@ +name: "glm5_nvfp4_ISL8K_OSL1K_ctx3dep4_gen1dep32_batch4_eplb0_mtp3" + +# ctx: 3 prefill workers, TP4/EP4 +# gen: 1 decode worker, TP32/EP32, max_batch=4, concurrency: 167 + +model: + path: "nvidia/GLM5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3" + precision: "fp4" + +resources: + gpu_type: "gb200" + + prefill_nodes: 3 + prefill_workers: 3 + gpus_per_prefill: 4 + + decode_workers: 1 + decode_nodes: 8 + gpus_per_decode: 32 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 2 + max_num_tokens: 16640 + max_seq_len: 8232 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: CUTEDSL + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + + decode: + tensor_parallel_size: 32 + moe_expert_parallel_size: 32 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 4 + max_num_tokens: 16 + max_seq_len: 9256 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + moe_config: + backend: CUTEDSL + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + +benchmark: + type: "sa-bench" + isl: 8192 + osl: 1024 + concurrencies: "167" + req_rate: "inf" + custom_tokenizer: "glm_moe_dsa" + use_chat_template: false + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/MTP/ctx5dep4_gen1dep32_batch8_eplb0_mtp3.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/MTP/ctx5dep4_gen1dep32_batch8_eplb0_mtp3.yaml new file mode 100644 index 00000000..492f1b4c --- /dev/null +++ b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/MTP/ctx5dep4_gen1dep32_batch8_eplb0_mtp3.yaml @@ -0,0 +1,132 @@ +name: "glm5_nvfp4_ISL8K_OSL1K_ctx5dep4_gen1dep32_batch8_eplb0_mtp3" + +# ctx: 5 prefill workers, TP4/EP4 +# gen: 1 decode worker, TP32/EP32, enable_attention_dp=true, max_batch=8 +# concurrency: 333 + +model: + path: "nvidia/GLM5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3" + precision: "fp4" + +resources: + gpu_type: "gb200" + + prefill_nodes: 5 + prefill_workers: 5 + gpus_per_prefill: 4 + + decode_workers: 1 + decode_nodes: 8 + gpus_per_decode: 32 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 2 + max_num_tokens: 16640 + max_seq_len: 8232 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: CUTEDSL + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + + decode: + tensor_parallel_size: 32 + moe_expert_parallel_size: 32 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 8 + max_num_tokens: 32 + max_seq_len: 9256 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + moe_config: + backend: CUTEDSL + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + +benchmark: + type: "sa-bench" + isl: 8192 + osl: 1024 + concurrencies: "333" + req_rate: "inf" + custom_tokenizer: "glm_moe_dsa" + use_chat_template: false + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/MTP/ctx7dep4_gen1dep16_batch32_eplb0_mtp2.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/MTP/ctx7dep4_gen1dep16_batch32_eplb0_mtp2.yaml new file mode 100644 index 00000000..d22fbcf1 --- /dev/null +++ b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/MTP/ctx7dep4_gen1dep16_batch32_eplb0_mtp2.yaml @@ -0,0 +1,135 @@ +name: "glm5_nvfp4_ISL8K_OSL1K_ctx7dep4_gen1dep16_batch32_eplb0_mtp2" + +# ctx: 7 prefill workers, TP4/EP4 +# gen: 1 decode worker, TP16/EP16, enable_attention_dp=true, max_batch=32 +# concurrency: 615 + +model: + path: "nvidia/GLM5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3" + precision: "fp4" + +resources: + gpu_type: "gb200" + + prefill_nodes: 7 + prefill_workers: 7 + gpus_per_prefill: 4 + + decode_workers: 1 + decode_nodes: 4 + gpus_per_decode: 16 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 2 + max_num_tokens: 16640 + max_seq_len: 8232 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: CUTEDSL + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 2 + + decode: + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 32 + max_num_tokens: 96 + max_seq_len: 9256 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + moe_config: + backend: CUTEDSL + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.7 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 2 + +benchmark: + type: "sa-bench" + isl: 8192 + osl: 1024 + concurrencies: "615" + req_rate: "inf" + custom_tokenizer: "glm_moe_dsa" + use_chat_template: false + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/MTP/ctx7dep4_gen1dep8_batch128_eplb0_mtp1.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/MTP/ctx7dep4_gen1dep8_batch128_eplb0_mtp1.yaml new file mode 100644 index 00000000..804e89b5 --- /dev/null +++ b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/MTP/ctx7dep4_gen1dep8_batch128_eplb0_mtp1.yaml @@ -0,0 +1,147 @@ +name: "glm5_nvfp4_ISL8K_OSL1K_ctx7dep4_gen1dep8_batch128_eplb0_mtp1" + +# ctx: 7 prefill workers, TP4/EP4 +# gen: 1 decode worker, TP8/EP8, enable_attention_dp=true, max_batch=128 +# concurrency: 1076 + +model: + path: "nvidia/GLM5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3" + precision: "fp4" + +resources: + gpu_type: "gb200" + + prefill_nodes: 7 + prefill_workers: 7 + gpus_per_prefill: 4 + + decode_workers: 1 + decode_nodes: 2 + gpus_per_decode: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 2 + max_num_tokens: 16640 + max_seq_len: 8232 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: CUTEDSL + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 1 + + decode: + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 128 + max_num_tokens: 256 + max_seq_len: 9256 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + - 72 + - 80 + - 88 + - 96 + - 104 + - 112 + - 120 + - 128 + moe_config: + backend: CUTEDSL + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 1 + +benchmark: + type: "sa-bench" + isl: 8192 + osl: 1024 + concurrencies: "1076" + req_rate: "inf" + custom_tokenizer: "glm_moe_dsa" + use_chat_template: false + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/STP/ctx10dep4_gen1dep16_batch128_eplb0_mtp0.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/STP/ctx10dep4_gen1dep16_batch128_eplb0_mtp0.yaml new file mode 100644 index 00000000..0fa8566d --- /dev/null +++ b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/STP/ctx10dep4_gen1dep16_batch128_eplb0_mtp0.yaml @@ -0,0 +1,141 @@ +name: "glm5_nvfp4_ISL8K_OSL1K_ctx10dep4_gen1dep16_batch128_eplb0_mtp0" + +# ctx: 10 prefill workers, TP4/EP4 +# gen: 1 decode worker, TP16/EP16, enable_attention_dp=true, max_batch=128 +# concurrency: 2253 + +model: + path: "nvidia/GLM5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3" + precision: "fp4" + +resources: + gpu_type: "gb200" + + prefill_nodes: 10 + prefill_workers: 10 + gpus_per_prefill: 4 + + decode_workers: 1 + decode_nodes: 4 + gpus_per_decode: 16 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 2 + max_num_tokens: 16640 + max_seq_len: 8232 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: CUTEDSL + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + + decode: + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 128 + max_num_tokens: 128 + max_seq_len: 9256 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + - 72 + - 80 + - 88 + - 96 + - 104 + - 112 + - 120 + - 128 + moe_config: + backend: CUTEDSL + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + +benchmark: + type: "sa-bench" + isl: 8192 + osl: 1024 + concurrencies: "2253" + req_rate: "inf" + custom_tokenizer: "glm_moe_dsa" + use_chat_template: false + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/STP/ctx1dep4_gen2tep8_batch32_eplb0_mtp0.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/STP/ctx1dep4_gen2tep8_batch32_eplb0_mtp0.yaml new file mode 100644 index 00000000..478f6203 --- /dev/null +++ b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/STP/ctx1dep4_gen2tep8_batch32_eplb0_mtp0.yaml @@ -0,0 +1,130 @@ +name: "glm5_nvfp4_ISL8K_OSL1K_ctx1dep4_gen2tep8_batch32_eplb0_mtp0" + +# ctx: 1 prefill worker, TP4/EP4 +# gen: 2 decode workers, TP8/EP8, enable_attention_dp=false, max_batch=32 +# concurrency: 84 + +model: + path: "nvidia/GLM5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3" + precision: "fp4" + +resources: + gpu_type: "gb200" + + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 4 + + decode_workers: 2 + decode_nodes: 4 + gpus_per_decode: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 2 + max_num_tokens: 16640 + max_seq_len: 8232 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: CUTEDSL + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + + decode: + allreduce_strategy: MNNVL + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 32 + max_num_tokens: 32 + max_seq_len: 9256 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + +benchmark: + type: "sa-bench" + isl: 8192 + osl: 1024 + concurrencies: "84" + req_rate: "inf" + custom_tokenizer: "glm_moe_dsa" + use_chat_template: false + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/STP/ctx1dep4_gen3tep4_batch32_eplb0_mtp0.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/STP/ctx1dep4_gen3tep4_batch32_eplb0_mtp0.yaml new file mode 100644 index 00000000..462401b6 --- /dev/null +++ b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/STP/ctx1dep4_gen3tep4_batch32_eplb0_mtp0.yaml @@ -0,0 +1,129 @@ +name: "glm5_nvfp4_ISL8K_OSL1K_ctx1dep4_gen3tep4_batch32_eplb0_mtp0" + +# ctx: 1 prefill worker, TP4/EP4 +# gen: 3 decode workers, TP4/EP4, enable_attention_dp=false, max_batch=32 +# concurrency: 117 + +model: + path: "nvidia/GLM5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3" + precision: "fp4" + +resources: + gpu_type: "gb200" + + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 4 + + decode_workers: 3 + decode_nodes: 3 + gpus_per_decode: 4 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 2 + max_num_tokens: 16640 + max_seq_len: 8232 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: CUTEDSL + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + + decode: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 32 + max_num_tokens: 32 + max_seq_len: 9256 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + +benchmark: + type: "sa-bench" + isl: 8192 + osl: 1024 + concurrencies: "117" + req_rate: "inf" + custom_tokenizer: "glm_moe_dsa" + use_chat_template: false + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/STP/ctx1dep4_gen5tep4_batch8_allconc_eplb0_mtp0.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/STP/ctx1dep4_gen5tep4_batch8_allconc_eplb0_mtp0.yaml new file mode 100644 index 00000000..90e62af3 --- /dev/null +++ b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/STP/ctx1dep4_gen5tep4_batch8_allconc_eplb0_mtp0.yaml @@ -0,0 +1,126 @@ +name: "glm5_nvfp4_ISL8K_OSL1K_ctx1dep4_gen5tep4_batch8_allconc_eplb0_mtp0" + +# ctx: 1 prefill worker, TP4/EP4 +# gen: 5 decode workers, TP4/EP4, enable_attention_dp=false, max_batch=8 +# Merged concurrencies: batch1(5), batch2(10), batch4(25), batch8(50) + +model: + path: "nvidia/GLM5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3" + precision: "fp4" + +resources: + gpu_type: "gb200" + + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 4 + + decode_workers: 5 + decode_nodes: 5 + gpus_per_decode: 4 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 2 + max_num_tokens: 16640 + max_seq_len: 8232 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: CUTEDSL + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + + decode: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 8 + max_num_tokens: 8 + max_seq_len: 9256 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + +benchmark: + type: "sa-bench" + isl: 8192 + osl: 1024 + concurrencies: "5x10x25x50" + req_rate: "inf" + custom_tokenizer: "glm_moe_dsa" + use_chat_template: false + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/STP/ctx5dep4_gen1dep32_batch16_eplb0_mtp0.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/STP/ctx5dep4_gen1dep32_batch16_eplb0_mtp0.yaml new file mode 100644 index 00000000..7a6ece31 --- /dev/null +++ b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/STP/ctx5dep4_gen1dep32_batch16_eplb0_mtp0.yaml @@ -0,0 +1,127 @@ +name: "glm5_nvfp4_ISL8K_OSL1K_ctx5dep4_gen1dep32_batch16_eplb0_mtp0" + +# ctx: 5 prefill workers, TP4/EP4 +# gen: 1 decode worker, TP32/EP32, enable_attention_dp=true, max_batch=16 +# concurrency: 615 + +model: + path: "nvidia/GLM5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3" + precision: "fp4" + +resources: + gpu_type: "gb200" + + prefill_nodes: 5 + prefill_workers: 5 + gpus_per_prefill: 4 + + decode_workers: 1 + decode_nodes: 8 + gpus_per_decode: 32 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 2 + max_num_tokens: 16640 + max_seq_len: 8232 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: CUTEDSL + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + + decode: + tensor_parallel_size: 32 + moe_expert_parallel_size: 32 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 16 + max_num_tokens: 16 + max_seq_len: 9256 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + moe_config: + backend: CUTEDSL + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.75 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + +benchmark: + type: "sa-bench" + isl: 8192 + osl: 1024 + concurrencies: "615" + req_rate: "inf" + custom_tokenizer: "glm_moe_dsa" + use_chat_template: false + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/STP/ctx8dep4_gen1dep32_batch32_eplb0_mtp0.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/STP/ctx8dep4_gen1dep32_batch32_eplb0_mtp0.yaml new file mode 100644 index 00000000..7e34b6d9 --- /dev/null +++ b/recipes/GLM5/disagg/trtllm_dynamo/gb200_nvfp4/ISL8K_OSL1K/STP/ctx8dep4_gen1dep32_batch32_eplb0_mtp0.yaml @@ -0,0 +1,129 @@ +name: "glm5_nvfp4_ISL8K_OSL1K_ctx8dep4_gen1dep32_batch32_eplb0_mtp0" + +# ctx: 8 prefill workers, TP4/EP4 +# gen: 1 decode worker, TP32/EP32, enable_attention_dp=true, max_batch=32 +# concurrency: 1229 + +model: + path: "nvidia/GLM5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3" + precision: "fp4" + +resources: + gpu_type: "gb200" + + prefill_nodes: 8 + prefill_workers: 8 + gpus_per_prefill: 4 + + decode_workers: 1 + decode_nodes: 8 + gpus_per_decode: 32 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 2 + max_num_tokens: 16640 + max_seq_len: 8232 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: CUTEDSL + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + + decode: + tensor_parallel_size: 32 + moe_expert_parallel_size: 32 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 32 + max_num_tokens: 32 + max_seq_len: 9256 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + moe_config: + backend: CUTEDSL + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.75 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + +benchmark: + type: "sa-bench" + isl: 8192 + osl: 1024 + concurrencies: "1229" + req_rate: "inf" + custom_tokenizer: "glm_moe_dsa" + use_chat_template: false + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/MTP/ctx1dep2_gen1dep32_batch8_eplb0_mtp3.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/MTP/ctx1dep2_gen1dep32_batch8_eplb0_mtp3.yaml new file mode 100644 index 00000000..80aacc6a --- /dev/null +++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/MTP/ctx1dep2_gen1dep32_batch8_eplb0_mtp3.yaml @@ -0,0 +1,132 @@ +name: "glm5_nvfp4_ISL1K_OSL1K_ctx1dep2_gen1dep32_batch8_eplb0_mtp3" + +# ctx: 1 prefill worker, TP2/EP2 +# gen: 1 decode worker, TP32/EP32, enable_attention_dp=true, max_batch=8 +# concurrency: 333 + +model: + path: "nvidia/GLM5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3" + precision: "fp4" + +resources: + gpu_type: "gb300" + + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 2 + + decode_workers: 1 + decode_nodes: 8 + gpus_per_decode: 32 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1064 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: CUTEDSL + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + + decode: + tensor_parallel_size: 32 + moe_expert_parallel_size: 32 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 8 + max_num_tokens: 32 + max_seq_len: 2088 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + moe_config: + backend: CUTEDSL + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + +benchmark: + type: "sa-bench" + isl: 1024 + osl: 1024 + concurrencies: "333" + req_rate: "inf" + custom_tokenizer: "glm_moe_dsa" + use_chat_template: false + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/MTP/ctx1dep2_gen4tep8_batch16_allconc_eplb0_mtp3.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/MTP/ctx1dep2_gen4tep8_batch16_allconc_eplb0_mtp3.yaml new file mode 100644 index 00000000..648ec949 --- /dev/null +++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/MTP/ctx1dep2_gen4tep8_batch16_allconc_eplb0_mtp3.yaml @@ -0,0 +1,134 @@ +name: "glm5_nvfp4_ISL1K_OSL1K_ctx1dep2_gen4tep8_batch16_allconc_eplb0_mtp3" + +# ctx: 1 prefill worker, TP2/EP2 +# gen: 4 decode workers, TP8/EP8, enable_attention_dp=false, max_batch=16 +# concurrencies: 24 (batch4), 44 (batch8), 92 (batch16) + +model: + path: "nvidia/GLM5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3" + precision: "fp4" + +resources: + gpu_type: "gb300" + + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 2 + + decode_workers: 4 + decode_nodes: 8 + gpus_per_decode: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1064 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: CUTEDSL + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + + decode: + allreduce_strategy: MNNVL + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 16 + max_num_tokens: 64 + max_seq_len: 2088 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + +benchmark: + type: "sa-bench" + isl: 1024 + osl: 1024 + concurrencies: "24x44x92" + req_rate: "inf" + custom_tokenizer: "glm_moe_dsa" + use_chat_template: false + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/MTP/ctx1dep2_gen4tep8_batch32_eplb0_mtp2.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/MTP/ctx1dep2_gen4tep8_batch32_eplb0_mtp2.yaml new file mode 100644 index 00000000..823624ac --- /dev/null +++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/MTP/ctx1dep2_gen4tep8_batch32_eplb0_mtp2.yaml @@ -0,0 +1,136 @@ +name: "glm5_nvfp4_ISL1K_OSL1K_ctx1dep2_gen4tep8_batch32_eplb0_mtp2" + +# ctx: 1 prefill worker, TP2/EP2 +# gen: 4 decode workers, TP8/EP8, enable_attention_dp=false, max_batch=32 +# concurrency: 180 + +model: + path: "nvidia/GLM5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3" + precision: "fp4" + +resources: + gpu_type: "gb300" + + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 2 + + decode_workers: 4 + decode_nodes: 8 + gpus_per_decode: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1064 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: CUTEDSL + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 2 + + decode: + allreduce_strategy: MNNVL + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 32 + max_num_tokens: 96 + max_seq_len: 2088 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 2 + +benchmark: + type: "sa-bench" + isl: 1024 + osl: 1024 + concurrencies: "180" + req_rate: "inf" + custom_tokenizer: "glm_moe_dsa" + use_chat_template: false + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/MTP/ctx1dep2_gen5tep4_batch1_eplb0_mtp3.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/MTP/ctx1dep2_gen5tep4_batch1_eplb0_mtp3.yaml new file mode 100644 index 00000000..64b61b9f --- /dev/null +++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/MTP/ctx1dep2_gen5tep4_batch1_eplb0_mtp3.yaml @@ -0,0 +1,131 @@ +name: "glm5_nvfp4_ISL1K_OSL1K_ctx1dep2_gen5tep4_batch1_eplb0_mtp3" + +# ctx: 1 prefill worker, TP2/EP2 +# gen: 5 decode workers, TP4/EP4, enable_attention_dp=false, max_batch=1 +# concurrency: 10 + +model: + path: "nvidia/GLM5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3" + precision: "fp4" + +resources: + gpu_type: "gb300" + + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 2 + + decode_workers: 5 + decode_nodes: 5 + gpus_per_decode: 4 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1064 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: CUTEDSL + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + + decode: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 1 + max_num_tokens: 4 + max_seq_len: 2088 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.85 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + +benchmark: + type: "sa-bench" + isl: 1024 + osl: 1024 + concurrencies: "10" + req_rate: "inf" + custom_tokenizer: "glm_moe_dsa" + use_chat_template: false + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/MTP/ctx2dep2_gen1dep16_batch64_eplb0_mtp2.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/MTP/ctx2dep2_gen1dep16_batch64_eplb0_mtp2.yaml new file mode 100644 index 00000000..66d211aa --- /dev/null +++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/MTP/ctx2dep2_gen1dep16_batch64_eplb0_mtp2.yaml @@ -0,0 +1,139 @@ +name: "glm5_nvfp4_ISL1K_OSL1K_ctx2dep2_gen1dep16_batch64_eplb0_mtp2" + +# ctx: 2 prefill workers, TP2/EP2 +# gen: 1 decode worker, TP16/EP16, enable_attention_dp=true, max_batch=64 +# concurrency: 1229 + +model: + path: "nvidia/GLM5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3" + precision: "fp4" + +resources: + gpu_type: "gb300" + + prefill_nodes: 1 + prefill_workers: 2 + gpus_per_prefill: 2 + + decode_workers: 1 + decode_nodes: 4 + gpus_per_decode: 16 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1064 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: CUTEDSL + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 2 + + decode: + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 64 + max_num_tokens: 192 + max_seq_len: 2088 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + moe_config: + backend: CUTEDSL + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.7 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 2 + +benchmark: + type: "sa-bench" + isl: 1024 + osl: 1024 + concurrencies: "1229" + req_rate: "inf" + custom_tokenizer: "glm_moe_dsa" + use_chat_template: false + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/MTP/ctx2dep2_gen1dep32_batch16_eplb0_mtp3.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/MTP/ctx2dep2_gen1dep32_batch16_eplb0_mtp3.yaml new file mode 100644 index 00000000..fe754372 --- /dev/null +++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/MTP/ctx2dep2_gen1dep32_batch16_eplb0_mtp3.yaml @@ -0,0 +1,133 @@ +name: "glm5_nvfp4_ISL1K_OSL1K_ctx2dep2_gen1dep32_batch16_eplb0_mtp3" + +# ctx: 2 prefill workers, TP2/EP2 +# gen: 1 decode worker, TP32/EP32, enable_attention_dp=true, max_batch=16 +# concurrency: 666 + +model: + path: "nvidia/GLM5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3" + precision: "fp4" + +resources: + gpu_type: "gb300" + + prefill_nodes: 1 + prefill_workers: 2 + gpus_per_prefill: 2 + + decode_workers: 1 + decode_nodes: 8 + gpus_per_decode: 32 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1064 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: CUTEDSL + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + + decode: + tensor_parallel_size: 32 + moe_expert_parallel_size: 32 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 16 + max_num_tokens: 64 + max_seq_len: 2088 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + moe_config: + backend: CUTEDSL + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + +benchmark: + type: "sa-bench" + isl: 1024 + osl: 1024 + concurrencies: "666" + req_rate: "inf" + custom_tokenizer: "glm_moe_dsa" + use_chat_template: false + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/MTP/ctx3dep2_gen1dep32_batch32_eplb0_mtp2.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/MTP/ctx3dep2_gen1dep32_batch32_eplb0_mtp2.yaml new file mode 100644 index 00000000..70821f3e --- /dev/null +++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/MTP/ctx3dep2_gen1dep32_batch32_eplb0_mtp2.yaml @@ -0,0 +1,135 @@ +name: "glm5_nvfp4_ISL1K_OSL1K_ctx3dep2_gen1dep32_batch32_eplb0_mtp2" + +# ctx: 3 prefill workers, TP2/EP2 +# gen: 1 decode worker, TP32/EP32, enable_attention_dp=true, max_batch=32 +# concurrency: 1229 + +model: + path: "nvidia/GLM5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3" + precision: "fp4" + +resources: + gpu_type: "gb300" + + prefill_nodes: 2 + prefill_workers: 3 + gpus_per_prefill: 2 + + decode_workers: 1 + decode_nodes: 8 + gpus_per_decode: 32 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1064 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: CUTEDSL + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 2 + + decode: + tensor_parallel_size: 32 + moe_expert_parallel_size: 32 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 32 + max_num_tokens: 96 + max_seq_len: 2088 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + moe_config: + backend: CUTEDSL + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 2 + +benchmark: + type: "sa-bench" + isl: 1024 + osl: 1024 + concurrencies: "1229" + req_rate: "inf" + custom_tokenizer: "glm_moe_dsa" + use_chat_template: false + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/MTP/ctx4dep2_gen1dep16_batch256_eplb256_mtp1.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/MTP/ctx4dep2_gen1dep16_batch256_eplb256_mtp1.yaml new file mode 100644 index 00000000..bf3183b7 --- /dev/null +++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/MTP/ctx4dep2_gen1dep16_batch256_eplb256_mtp1.yaml @@ -0,0 +1,166 @@ +name: "glm5_nvfp4_ISL1K_OSL1K_ctx4dep2_gen1dep16_batch256_eplb256_mtp1" + +# ctx: 4 prefill workers, TP2/EP2, EPLB: num_slots=256, max_batch=256 +# gen: 1 decode worker, TP16/EP16, enable_attention_dp=true, max_batch=256 +# concurrency: 4301 + +model: + path: "nvidia/GLM5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3" + precision: "fp4" + +resources: + gpu_type: "gb300" + + prefill_nodes: 2 + prefill_workers: 4 + gpus_per_prefill: 2 + + decode_workers: 1 + decode_nodes: 4 + gpus_per_decode: 16 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1064 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: CUTEDSL + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 1 + + decode: + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 256 + max_num_tokens: 512 + max_seq_len: 2088 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + - 72 + - 80 + - 88 + - 96 + - 104 + - 112 + - 120 + - 128 + - 136 + - 144 + - 152 + - 160 + - 168 + - 176 + - 184 + - 192 + - 200 + - 208 + - 216 + - 224 + - 232 + - 240 + - 248 + - 256 + moe_config: + backend: CUTEDSL + use_low_precision_moe_combine: true + load_balancer: + layer_updates_per_iter: 1 + num_slots: 256 + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.7 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 1 + +benchmark: + type: "sa-bench" + isl: 1024 + osl: 1024 + concurrencies: "4301" + req_rate: "inf" + custom_tokenizer: "glm_moe_dsa" + use_chat_template: false + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/MTP/ctx5dep2_gen2dep8_batch512_eplb0_mtp1.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/MTP/ctx5dep2_gen2dep8_batch512_eplb0_mtp1.yaml new file mode 100644 index 00000000..1d9f4f10 --- /dev/null +++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/MTP/ctx5dep2_gen2dep8_batch512_eplb0_mtp1.yaml @@ -0,0 +1,195 @@ +name: "glm5_nvfp4_ISL1K_OSL1K_ctx5dep2_gen2dep8_batch512_eplb0_mtp1" + +# ctx: 5 prefill workers, TP2/EP2 +# gen: 2 decode workers, TP8/EP8, enable_attention_dp=true, max_batch=512 +# concurrency: 8602 + +model: + path: "nvidia/GLM5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3" + precision: "fp4" + +resources: + gpu_type: "gb300" + + prefill_nodes: 3 + prefill_workers: 5 + gpus_per_prefill: 2 + + decode_workers: 2 + decode_nodes: 4 + gpus_per_decode: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1064 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: CUTEDSL + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 1 + + decode: + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 512 + max_num_tokens: 1024 + max_seq_len: 2088 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + - 72 + - 80 + - 88 + - 96 + - 104 + - 112 + - 120 + - 128 + - 136 + - 144 + - 152 + - 160 + - 168 + - 176 + - 184 + - 192 + - 200 + - 208 + - 216 + - 224 + - 232 + - 240 + - 248 + - 256 + - 264 + - 272 + - 280 + - 288 + - 296 + - 304 + - 312 + - 320 + - 328 + - 336 + - 344 + - 352 + - 360 + - 368 + - 376 + - 384 + - 392 + - 400 + - 408 + - 416 + - 424 + - 432 + - 440 + - 448 + - 456 + - 464 + - 472 + - 480 + - 488 + - 496 + - 504 + - 512 + moe_config: + backend: CUTEDSL + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 1 + +benchmark: + type: "sa-bench" + isl: 1024 + osl: 1024 + concurrencies: "8602" + req_rate: "inf" + custom_tokenizer: "glm_moe_dsa" + use_chat_template: false + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/MTP/ctx6dep2_gen1dep32_batch128_eplb288_mtp1.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/MTP/ctx6dep2_gen1dep32_batch128_eplb288_mtp1.yaml new file mode 100644 index 00000000..44b81b3c --- /dev/null +++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/MTP/ctx6dep2_gen1dep32_batch128_eplb288_mtp1.yaml @@ -0,0 +1,150 @@ +name: "glm5_nvfp4_ISL1K_OSL1K_ctx6dep2_gen1dep32_batch128_eplb288_mtp1" + +# ctx: 6 prefill workers, TP2/EP2, EPLB: num_slots=288, max_batch=128 +# gen: 1 decode worker, TP32/EP32, enable_attention_dp=true, max_batch=128 +# concurrency: 4301 + +model: + path: "nvidia/GLM5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3" + precision: "fp4" + +resources: + gpu_type: "gb300" + + prefill_nodes: 3 + prefill_workers: 6 + gpus_per_prefill: 2 + + decode_workers: 1 + decode_nodes: 8 + gpus_per_decode: 32 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1064 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: CUTEDSL + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 1 + + decode: + tensor_parallel_size: 32 + moe_expert_parallel_size: 32 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 128 + max_num_tokens: 256 + max_seq_len: 2088 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + - 72 + - 80 + - 88 + - 96 + - 104 + - 112 + - 120 + - 128 + moe_config: + backend: CUTEDSL + use_low_precision_moe_combine: true + load_balancer: + layer_updates_per_iter: 1 + num_slots: 288 + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 1 + +benchmark: + type: "sa-bench" + isl: 1024 + osl: 1024 + concurrencies: "4301" + req_rate: "inf" + custom_tokenizer: "glm_moe_dsa" + use_chat_template: false + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/STP/ctx1dep2_gen1dep32_batch16_eplb0_mtp0.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/STP/ctx1dep2_gen1dep32_batch16_eplb0_mtp0.yaml new file mode 100644 index 00000000..0410623b --- /dev/null +++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/STP/ctx1dep2_gen1dep32_batch16_eplb0_mtp0.yaml @@ -0,0 +1,127 @@ +name: "glm5_nvfp4_ISL1K_OSL1K_ctx1dep2_gen1dep32_batch16_eplb0_mtp0" + +# ctx: 1 prefill worker, TP2/EP2 +# gen: 1 decode worker, TP32/EP32, enable_attention_dp=true, max_batch=16 +# concurrency: 615 + +model: + path: "nvidia/GLM5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3" + precision: "fp4" + +resources: + gpu_type: "gb300" + + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 2 + + decode_workers: 1 + decode_nodes: 8 + gpus_per_decode: 32 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1064 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: CUTEDSL + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + + decode: + tensor_parallel_size: 32 + moe_expert_parallel_size: 32 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 16 + max_num_tokens: 16 + max_seq_len: 2088 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + moe_config: + backend: CUTEDSL + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.7 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + +benchmark: + type: "sa-bench" + isl: 1024 + osl: 1024 + concurrencies: "615" + req_rate: "inf" + custom_tokenizer: "glm_moe_dsa" + use_chat_template: false + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/STP/ctx1dep2_gen4tep8_batch64_allconc_eplb0_mtp0.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/STP/ctx1dep2_gen4tep8_batch64_allconc_eplb0_mtp0.yaml new file mode 100644 index 00000000..d967e3b2 --- /dev/null +++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/STP/ctx1dep2_gen4tep8_batch64_allconc_eplb0_mtp0.yaml @@ -0,0 +1,134 @@ +name: "glm5_nvfp4_ISL1K_OSL1K_ctx1dep2_gen4tep8_batch64_allconc_eplb0_mtp0" + +# ctx: 1 prefill worker, TP2/EP2 +# gen: 4 decode workers, TP8/EP8, enable_attention_dp=false, max_batch=64 +# Merged concurrencies: batch16(84), batch32(180), batch64(336) + +model: + path: "nvidia/GLM5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3" + precision: "fp4" + +resources: + gpu_type: "gb300" + + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 2 + + decode_workers: 4 + decode_nodes: 8 + gpus_per_decode: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1064 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: CUTEDSL + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + + decode: + allreduce_strategy: MNNVL + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 64 + max_num_tokens: 64 + max_seq_len: 2088 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + +benchmark: + type: "sa-bench" + isl: 1024 + osl: 1024 + concurrencies: "84x180x336" + req_rate: "inf" + custom_tokenizer: "glm_moe_dsa" + use_chat_template: false + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/STP/ctx1dep2_gen5tep4_batch4_allconc_eplb0_mtp0.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/STP/ctx1dep2_gen5tep4_batch4_allconc_eplb0_mtp0.yaml new file mode 100644 index 00000000..d9f9ea2f --- /dev/null +++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/STP/ctx1dep2_gen5tep4_batch4_allconc_eplb0_mtp0.yaml @@ -0,0 +1,125 @@ +name: "glm5_nvfp4_ISL1K_OSL1K_ctx1dep2_gen5tep4_batch4_allconc_eplb0_mtp0" + +# ctx: 1 prefill worker, TP2/EP2 +# gen: 5 decode workers, TP4/EP4, enable_attention_dp=false, max_batch=4 +# Merged concurrencies: batch1(5), batch2(10), batch4(25) + +model: + path: "nvidia/GLM5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3" + precision: "fp4" + +resources: + gpu_type: "gb300" + + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 2 + + decode_workers: 5 + decode_nodes: 5 + gpus_per_decode: 4 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1064 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: CUTEDSL + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + + decode: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 4 + max_num_tokens: 4 + max_seq_len: 2088 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + +benchmark: + type: "sa-bench" + isl: 1024 + osl: 1024 + concurrencies: "5x10x25" + req_rate: "inf" + custom_tokenizer: "glm_moe_dsa" + use_chat_template: false + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/STP/ctx2dep2_gen1dep32_batch32_eplb0_mtp0.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/STP/ctx2dep2_gen1dep32_batch32_eplb0_mtp0.yaml new file mode 100644 index 00000000..26ddd7b1 --- /dev/null +++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/STP/ctx2dep2_gen1dep32_batch32_eplb0_mtp0.yaml @@ -0,0 +1,129 @@ +name: "glm5_nvfp4_ISL1K_OSL1K_ctx2dep2_gen1dep32_batch32_eplb0_mtp0" + +# ctx: 2 prefill workers, TP2/EP2, max_batch=32 +# gen: 1 decode worker, TP32/EP32, enable_attention_dp=true, max_batch=32 +# concurrency: 1229 + +model: + path: "nvidia/GLM5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3" + precision: "fp4" + +resources: + gpu_type: "gb300" + + prefill_nodes: 1 + prefill_workers: 2 + gpus_per_prefill: 2 + + decode_workers: 1 + decode_nodes: 8 + gpus_per_decode: 32 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1064 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: CUTEDSL + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + + decode: + tensor_parallel_size: 32 + moe_expert_parallel_size: 32 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 32 + max_num_tokens: 32 + max_seq_len: 2088 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + moe_config: + backend: CUTEDSL + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.7 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + +benchmark: + type: "sa-bench" + isl: 1024 + osl: 1024 + concurrencies: "1229" + req_rate: "inf" + custom_tokenizer: "glm_moe_dsa" + use_chat_template: false + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/STP/ctx3dep2_gen1dep32_batch64_eplb0_mtp0.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/STP/ctx3dep2_gen1dep32_batch64_eplb0_mtp0.yaml new file mode 100644 index 00000000..081e96da --- /dev/null +++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/STP/ctx3dep2_gen1dep32_batch64_eplb0_mtp0.yaml @@ -0,0 +1,133 @@ +name: "glm5_nvfp4_ISL1K_OSL1K_ctx3dep2_gen1dep32_batch64_eplb0_mtp0" + +# ctx: 3 prefill workers, TP2/EP2, max_batch=64 +# gen: 1 decode worker, TP32/EP32, enable_attention_dp=true, max_batch=64 +# concurrency: 2253 + +model: + path: "nvidia/GLM5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3" + precision: "fp4" + +resources: + gpu_type: "gb300" + + prefill_nodes: 2 + prefill_workers: 3 + gpus_per_prefill: 2 + + decode_workers: 1 + decode_nodes: 8 + gpus_per_decode: 32 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1064 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: CUTEDSL + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + + decode: + tensor_parallel_size: 32 + moe_expert_parallel_size: 32 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 64 + max_num_tokens: 64 + max_seq_len: 2088 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + moe_config: + backend: CUTEDSL + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.7 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + +benchmark: + type: "sa-bench" + isl: 1024 + osl: 1024 + concurrencies: "2253" + req_rate: "inf" + custom_tokenizer: "glm_moe_dsa" + use_chat_template: false + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/STP/ctx4dep2_gen1dep16_batch512_eplb256_mtp0.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/STP/ctx4dep2_gen1dep16_batch512_eplb256_mtp0.yaml new file mode 100644 index 00000000..dbca4fd5 --- /dev/null +++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/STP/ctx4dep2_gen1dep16_batch512_eplb256_mtp0.yaml @@ -0,0 +1,191 @@ +name: "glm5_nvfp4_ISL1K_OSL1K_ctx4dep2_gen1dep16_batch512_eplb256_mtp0" + +# ctx: 4 prefill workers, TP2/EP2 +# gen: 1 decode worker, TP16/EP16, EPLB: num_slots=256, max_batch=512, concurrency: 8192 + +model: + path: "nvidia/GLM5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3" + precision: "fp4" + +resources: + gpu_type: "gb300" + + prefill_nodes: 2 + prefill_workers: 4 + gpus_per_prefill: 2 + + decode_workers: 1 + decode_nodes: 4 + gpus_per_decode: 16 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1064 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: CUTEDSL + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + + decode: + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 512 + max_num_tokens: 512 + max_seq_len: 2088 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + - 72 + - 80 + - 88 + - 96 + - 104 + - 112 + - 120 + - 128 + - 136 + - 144 + - 152 + - 160 + - 168 + - 176 + - 184 + - 192 + - 200 + - 208 + - 216 + - 224 + - 232 + - 240 + - 248 + - 256 + - 264 + - 272 + - 280 + - 288 + - 296 + - 304 + - 312 + - 320 + - 328 + - 336 + - 344 + - 352 + - 360 + - 368 + - 376 + - 384 + - 392 + - 400 + - 408 + - 416 + - 424 + - 432 + - 440 + - 448 + - 456 + - 464 + - 472 + - 480 + - 488 + - 496 + - 504 + - 512 + moe_config: + backend: CUTEDSL + use_low_precision_moe_combine: true + load_balancer: + layer_updates_per_iter: 1 + num_slots: 256 + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.75 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + +benchmark: + type: "sa-bench" + isl: 1024 + osl: 1024 + concurrencies: "8192" + req_rate: "inf" + custom_tokenizer: "glm_moe_dsa" + use_chat_template: false + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/STP/ctx4dep2_gen1dep32_batch128_eplb0_mtp0.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/STP/ctx4dep2_gen1dep32_batch128_eplb0_mtp0.yaml new file mode 100644 index 00000000..1c8d2d78 --- /dev/null +++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/STP/ctx4dep2_gen1dep32_batch128_eplb0_mtp0.yaml @@ -0,0 +1,141 @@ +name: "glm5_nvfp4_ISL1K_OSL1K_ctx4dep2_gen1dep32_batch128_eplb0_mtp0" + +# ctx: 4 prefill workers, TP2/EP2, max_batch=128 +# gen: 1 decode worker, TP32/EP32, enable_attention_dp=true, max_batch=128 +# concurrency: 4301 + +model: + path: "nvidia/GLM5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3" + precision: "fp4" + +resources: + gpu_type: "gb300" + + prefill_nodes: 2 + prefill_workers: 4 + gpus_per_prefill: 2 + + decode_workers: 1 + decode_nodes: 8 + gpus_per_decode: 32 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1064 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: CUTEDSL + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + + decode: + tensor_parallel_size: 32 + moe_expert_parallel_size: 32 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 128 + max_num_tokens: 128 + max_seq_len: 2088 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + - 72 + - 80 + - 88 + - 96 + - 104 + - 112 + - 120 + - 128 + moe_config: + backend: CUTEDSL + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.7 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + +benchmark: + type: "sa-bench" + isl: 1024 + osl: 1024 + concurrencies: "4301" + req_rate: "inf" + custom_tokenizer: "glm_moe_dsa" + use_chat_template: false + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/STP/ctx6dep2_gen1dep32_batch256_eplb288_mtp0.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/STP/ctx6dep2_gen1dep32_batch256_eplb288_mtp0.yaml new file mode 100644 index 00000000..0d6870ff --- /dev/null +++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL1K_OSL1K/STP/ctx6dep2_gen1dep32_batch256_eplb288_mtp0.yaml @@ -0,0 +1,160 @@ +name: "glm5_nvfp4_ISL1K_OSL1K_ctx6dep2_gen1dep32_batch256_eplb288_mtp0" + +# ctx: 6 prefill workers, EPLB: num_slots=288, TP2/EP2, max_batch=256 +# gen: 1 decode worker, TP32/EP32, enable_attention_dp=true, max_batch=256 +# concurrency: 8192 + +model: + path: "nvidia/GLM5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3" + precision: "fp4" + +resources: + gpu_type: "gb300" + + prefill_nodes: 3 + prefill_workers: 6 + gpus_per_prefill: 2 + + decode_workers: 1 + decode_nodes: 8 + gpus_per_decode: 32 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1064 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: CUTEDSL + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + + decode: + tensor_parallel_size: 32 + moe_expert_parallel_size: 32 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 256 + max_num_tokens: 256 + max_seq_len: 2088 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + - 72 + - 80 + - 88 + - 96 + - 104 + - 112 + - 120 + - 128 + - 136 + - 144 + - 152 + - 160 + - 168 + - 176 + - 184 + - 192 + - 200 + - 208 + - 216 + - 224 + - 232 + - 240 + - 248 + - 256 + moe_config: + backend: CUTEDSL + use_low_precision_moe_combine: true + load_balancer: + layer_updates_per_iter: 1 + num_slots: 288 + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.7 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + +benchmark: + type: "sa-bench" + isl: 1024 + osl: 1024 + concurrencies: "8192" + req_rate: "inf" + custom_tokenizer: "glm_moe_dsa" + use_chat_template: false + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/MTP/ctx12dep2_gen1dep16_batch32_eplb0_mtp2.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/MTP/ctx12dep2_gen1dep16_batch32_eplb0_mtp2.yaml new file mode 100644 index 00000000..8940ea72 --- /dev/null +++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/MTP/ctx12dep2_gen1dep16_batch32_eplb0_mtp2.yaml @@ -0,0 +1,135 @@ +name: "glm5_nvfp4_ISL8K_OSL1K_ctx12dep2_gen1dep16_batch32_eplb0_mtp2" + +# ctx: 12 prefill workers, TP2/EP2 +# gen: 1 decode worker, TP16/EP16, enable_attention_dp=true, max_batch=32 +# concurrency: 666 + +model: + path: "nvidia/GLM5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3" + precision: "fp4" + +resources: + gpu_type: "gb300" + + prefill_nodes: 6 + prefill_workers: 12 + gpus_per_prefill: 2 + + decode_workers: 1 + decode_nodes: 4 + gpus_per_decode: 16 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 2 + max_num_tokens: 16640 + max_seq_len: 8232 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: CUTEDSL + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 2 + + decode: + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 32 + max_num_tokens: 96 + max_seq_len: 9256 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + moe_config: + backend: CUTEDSL + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.7 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 2 + +benchmark: + type: "sa-bench" + isl: 8192 + osl: 1024 + concurrencies: "666" + req_rate: "inf" + custom_tokenizer: "glm_moe_dsa" + use_chat_template: false + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/MTP/ctx13dep2_gen1dep8_batch128_eplb0_mtp1.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/MTP/ctx13dep2_gen1dep8_batch128_eplb0_mtp1.yaml new file mode 100644 index 00000000..29eba0b3 --- /dev/null +++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/MTP/ctx13dep2_gen1dep8_batch128_eplb0_mtp1.yaml @@ -0,0 +1,147 @@ +name: "glm5_nvfp4_ISL8K_OSL1K_ctx13dep2_gen1dep8_batch128_eplb0_mtp1" + +# ctx: 13 prefill workers, TP2/EP2 +# gen: 1 decode worker, TP8/EP8, enable_attention_dp=true, max_batch=128 +# concurrency: 1076 + +model: + path: "nvidia/GLM5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3" + precision: "fp4" + +resources: + gpu_type: "gb300" + + prefill_nodes: 7 + prefill_workers: 13 + gpus_per_prefill: 2 + + decode_workers: 1 + decode_nodes: 2 + gpus_per_decode: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 2 + max_num_tokens: 16640 + max_seq_len: 8232 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: CUTEDSL + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 1 + + decode: + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 128 + max_num_tokens: 256 + max_seq_len: 9256 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + - 72 + - 80 + - 88 + - 96 + - 104 + - 112 + - 120 + - 128 + moe_config: + backend: CUTEDSL + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 1 + +benchmark: + type: "sa-bench" + isl: 8192 + osl: 1024 + concurrencies: "1076" + req_rate: "inf" + custom_tokenizer: "glm_moe_dsa" + use_chat_template: false + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/MTP/ctx15dep2_gen1dep32_batch16_eplb0_mtp3.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/MTP/ctx15dep2_gen1dep32_batch16_eplb0_mtp3.yaml new file mode 100644 index 00000000..f8fcdac9 --- /dev/null +++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/MTP/ctx15dep2_gen1dep32_batch16_eplb0_mtp3.yaml @@ -0,0 +1,133 @@ +name: "glm5_nvfp4_ISL8K_OSL1K_ctx15dep2_gen1dep32_batch16_eplb0_mtp3" + +# ctx: 15 prefill workers, TP2/EP2 +# gen: 1 decode worker, TP32/EP32, enable_attention_dp=true, max_batch=16 +# concurrency: 666 + +model: + path: "nvidia/GLM5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3" + precision: "fp4" + +resources: + gpu_type: "gb300" + + prefill_nodes: 8 + prefill_workers: 15 + gpus_per_prefill: 2 + + decode_workers: 1 + decode_nodes: 8 + gpus_per_decode: 32 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 2 + max_num_tokens: 16640 + max_seq_len: 8232 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: CUTEDSL + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + + decode: + tensor_parallel_size: 32 + moe_expert_parallel_size: 32 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 16 + max_num_tokens: 64 + max_seq_len: 9256 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + moe_config: + backend: CUTEDSL + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + +benchmark: + type: "sa-bench" + isl: 8192 + osl: 1024 + concurrencies: "666" + req_rate: "inf" + custom_tokenizer: "glm_moe_dsa" + use_chat_template: false + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/MTP/ctx18dep2_gen1dep16_batch64_eplb0_mtp1.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/MTP/ctx18dep2_gen1dep16_batch64_eplb0_mtp1.yaml new file mode 100644 index 00000000..775fa68f --- /dev/null +++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/MTP/ctx18dep2_gen1dep16_batch64_eplb0_mtp1.yaml @@ -0,0 +1,139 @@ +name: "glm5_nvfp4_ISL8K_OSL1K_ctx18dep2_gen1dep16_batch64_eplb0_mtp1" + +# ctx: 18 prefill workers, TP2/EP2 +# gen: 1 decode worker, TP16/EP16, enable_attention_dp=true, max_batch=64 +# concurrency: 1229 + +model: + path: "nvidia/GLM5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3" + precision: "fp4" + +resources: + gpu_type: "gb300" + + prefill_nodes: 9 + prefill_workers: 18 + gpus_per_prefill: 2 + + decode_workers: 1 + decode_nodes: 4 + gpus_per_decode: 16 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 2 + max_num_tokens: 16640 + max_seq_len: 8232 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: CUTEDSL + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 1 + + decode: + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 64 + max_num_tokens: 128 + max_seq_len: 9256 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + moe_config: + backend: CUTEDSL + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.7 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 1 + +benchmark: + type: "sa-bench" + isl: 8192 + osl: 1024 + concurrencies: "1229" + req_rate: "inf" + custom_tokenizer: "glm_moe_dsa" + use_chat_template: false + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/MTP/ctx1dep2_gen1tep8_batch16_eplb0_mtp3.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/MTP/ctx1dep2_gen1tep8_batch16_eplb0_mtp3.yaml new file mode 100644 index 00000000..c457cce0 --- /dev/null +++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/MTP/ctx1dep2_gen1tep8_batch16_eplb0_mtp3.yaml @@ -0,0 +1,134 @@ +name: "glm5_nvfp4_ISL8K_OSL1K_ctx1dep2_gen1tep8_batch16_eplb0_mtp3" + +# ctx: 1 prefill worker, TP2/EP2 +# gen: 1 decode worker, TP8/EP8 (MNNVL), max_batch=16 +# concurrency: 24 + +model: + path: "nvidia/GLM5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3" + precision: "fp4" + +resources: + gpu_type: "gb300" + + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 2 + + decode_workers: 1 + decode_nodes: 2 + gpus_per_decode: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 2 + max_num_tokens: 16640 + max_seq_len: 8232 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: CUTEDSL + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + + decode: + allreduce_strategy: MNNVL + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 16 + max_num_tokens: 64 + max_seq_len: 9256 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + +benchmark: + type: "sa-bench" + isl: 8192 + osl: 1024 + concurrencies: "24" + req_rate: "inf" + custom_tokenizer: "glm_moe_dsa" + use_chat_template: false + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/MTP/ctx1dep2_gen2tep8_batch8_eplb0_mtp3.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/MTP/ctx1dep2_gen2tep8_batch8_eplb0_mtp3.yaml new file mode 100644 index 00000000..517cf361 --- /dev/null +++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/MTP/ctx1dep2_gen2tep8_batch8_eplb0_mtp3.yaml @@ -0,0 +1,133 @@ +name: "glm5_nvfp4_ISL8K_OSL1K_ctx1dep2_gen2tep8_batch8_eplb0_mtp3" + +# ctx: 1 prefill worker, TP2/EP2 +# gen: 2 decode workers, TP8/EP8 (MNNVL), max_batch=8 +# concurrency: 22 + +model: + path: "nvidia/GLM5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3" + precision: "fp4" + +resources: + gpu_type: "gb300" + + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 2 + + decode_workers: 2 + decode_nodes: 4 + gpus_per_decode: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 2 + max_num_tokens: 16640 + max_seq_len: 8232 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: CUTEDSL + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + + decode: + allreduce_strategy: MNNVL + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 8 + max_num_tokens: 32 + max_seq_len: 9256 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + +benchmark: + type: "sa-bench" + isl: 8192 + osl: 1024 + concurrencies: "22" + req_rate: "inf" + custom_tokenizer: "glm_moe_dsa" + use_chat_template: false + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/MTP/ctx1dep2_gen4tep8_batch4_allconc_eplb0_mtp3.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/MTP/ctx1dep2_gen4tep8_batch4_allconc_eplb0_mtp3.yaml new file mode 100644 index 00000000..20599c3f --- /dev/null +++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/MTP/ctx1dep2_gen4tep8_batch4_allconc_eplb0_mtp3.yaml @@ -0,0 +1,132 @@ +name: "glm5_nvfp4_ISL8K_OSL1K_ctx1dep2_gen4tep8_batch4_allconc_eplb0_mtp3" + +# ctx: 1 prefill worker, TP2/EP2 +# gen: 4 decode workers, TP8/EP8 (MNNVL), max_batch=4 +# concurrencies: 4 (batch1), 24 (batch4) + +model: + path: "nvidia/GLM5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3" + precision: "fp4" + +resources: + gpu_type: "gb300" + + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 2 + + decode_workers: 4 + decode_nodes: 8 + gpus_per_decode: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 2 + max_num_tokens: 16640 + max_seq_len: 8232 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: CUTEDSL + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + + decode: + allreduce_strategy: MNNVL + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 4 + max_num_tokens: 16 + max_seq_len: 9256 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + +benchmark: + type: "sa-bench" + isl: 8192 + osl: 1024 + concurrencies: "4x24" + req_rate: "inf" + custom_tokenizer: "glm_moe_dsa" + use_chat_template: false + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/MTP/ctx1dep2_gen5tep4_batch1_eplb0_mtp3.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/MTP/ctx1dep2_gen5tep4_batch1_eplb0_mtp3.yaml new file mode 100644 index 00000000..0037f722 --- /dev/null +++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/MTP/ctx1dep2_gen5tep4_batch1_eplb0_mtp3.yaml @@ -0,0 +1,131 @@ +name: "glm5_nvfp4_ISL8K_OSL1K_ctx1dep2_gen5tep4_batch1_eplb0_mtp3" + +# ctx: 1 prefill worker, TP2/EP2 +# gen: 5 decode workers, TP4/EP4, max_batch=1 +# concurrency: 5 + +model: + path: "nvidia/GLM5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3" + precision: "fp4" + +resources: + gpu_type: "gb300" + + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 2 + + decode_workers: 5 + decode_nodes: 5 + gpus_per_decode: 4 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 2 + max_num_tokens: 16640 + max_seq_len: 8232 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: CUTEDSL + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + + decode: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 1 + max_num_tokens: 4 + max_seq_len: 9256 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.85 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + +benchmark: + type: "sa-bench" + isl: 8192 + osl: 1024 + concurrencies: "5" + req_rate: "inf" + custom_tokenizer: "glm_moe_dsa" + use_chat_template: false + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/MTP/ctx5dep2_gen1dep32_batch4_eplb0_mtp3.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/MTP/ctx5dep2_gen1dep32_batch4_eplb0_mtp3.yaml new file mode 100644 index 00000000..6e233408 --- /dev/null +++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/MTP/ctx5dep2_gen1dep32_batch4_eplb0_mtp3.yaml @@ -0,0 +1,131 @@ +name: "glm5_nvfp4_ISL8K_OSL1K_ctx5dep2_gen1dep32_batch4_eplb0_mtp3" + +# ctx: 5 prefill workers, TP2/EP2 +# gen: 1 decode worker, TP32/EP32, enable_attention_dp=true, enable_lm_head_tp_in_adp=true, max_batch=4 +# concurrency: 180 + +model: + path: "nvidia/GLM5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3" + precision: "fp4" + +resources: + gpu_type: "gb300" + + prefill_nodes: 3 + prefill_workers: 5 + gpus_per_prefill: 2 + + decode_workers: 1 + decode_nodes: 8 + gpus_per_decode: 32 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 2 + max_num_tokens: 16640 + max_seq_len: 8232 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: CUTEDSL + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + + decode: + tensor_parallel_size: 32 + moe_expert_parallel_size: 32 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 4 + max_num_tokens: 16 + max_seq_len: 9256 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + moe_config: + backend: CUTEDSL + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + +benchmark: + type: "sa-bench" + isl: 8192 + osl: 1024 + concurrencies: "180" + req_rate: "inf" + custom_tokenizer: "glm_moe_dsa" + use_chat_template: false + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/MTP/ctx9dep2_gen1dep32_batch8_eplb0_mtp3.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/MTP/ctx9dep2_gen1dep32_batch8_eplb0_mtp3.yaml new file mode 100644 index 00000000..bd1cb583 --- /dev/null +++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/MTP/ctx9dep2_gen1dep32_batch8_eplb0_mtp3.yaml @@ -0,0 +1,132 @@ +name: "glm5_nvfp4_ISL8K_OSL1K_ctx9dep2_gen1dep32_batch8_eplb0_mtp3" + +# ctx: 9 prefill workers, TP2/EP2 +# gen: 1 decode worker, TP32/EP32, enable_attention_dp=true, max_batch=8 +# concurrency: 333 + +model: + path: "nvidia/GLM5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3" + precision: "fp4" + +resources: + gpu_type: "gb300" + + prefill_nodes: 5 + prefill_workers: 9 + gpus_per_prefill: 2 + + decode_workers: 1 + decode_nodes: 8 + gpus_per_decode: 32 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 2 + max_num_tokens: 16640 + max_seq_len: 8232 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: CUTEDSL + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + + decode: + tensor_parallel_size: 32 + moe_expert_parallel_size: 32 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 8 + max_num_tokens: 32 + max_seq_len: 9256 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + moe_config: + backend: CUTEDSL + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + speculative_config: + decoding_type: MTP + num_nextn_predict_layers: 3 + +benchmark: + type: "sa-bench" + isl: 8192 + osl: 1024 + concurrencies: "333" + req_rate: "inf" + custom_tokenizer: "glm_moe_dsa" + use_chat_template: false + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/STP/ctx12dep2_gen1dep16_batch64_eplb0_mtp0.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/STP/ctx12dep2_gen1dep16_batch64_eplb0_mtp0.yaml new file mode 100644 index 00000000..611aebb6 --- /dev/null +++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/STP/ctx12dep2_gen1dep16_batch64_eplb0_mtp0.yaml @@ -0,0 +1,133 @@ +name: "glm5_nvfp4_ISL8K_OSL1K_ctx12dep2_gen1dep16_batch64_eplb0_mtp0" + +# ctx: 12 prefill workers, TP2/EP2 +# gen: 1 decode worker, TP16/EP16, enable_attention_dp=true, max_batch=64 +# concurrency: 1127 + +model: + path: "nvidia/GLM5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3" + precision: "fp4" + +resources: + gpu_type: "gb300" + + prefill_nodes: 6 + prefill_workers: 12 + gpus_per_prefill: 2 + + decode_workers: 1 + decode_nodes: 4 + gpus_per_decode: 16 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 2 + max_num_tokens: 16640 + max_seq_len: 8232 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: CUTEDSL + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + + decode: + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 64 + max_num_tokens: 64 + max_seq_len: 9256 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + moe_config: + backend: CUTEDSL + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + +benchmark: + type: "sa-bench" + isl: 8192 + osl: 1024 + concurrencies: "1127" + req_rate: "inf" + custom_tokenizer: "glm_moe_dsa" + use_chat_template: false + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/STP/ctx15dep2_gen1dep32_batch32_eplb0_mtp0.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/STP/ctx15dep2_gen1dep32_batch32_eplb0_mtp0.yaml new file mode 100644 index 00000000..831e703d --- /dev/null +++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/STP/ctx15dep2_gen1dep32_batch32_eplb0_mtp0.yaml @@ -0,0 +1,129 @@ +name: "glm5_nvfp4_ISL8K_OSL1K_ctx15dep2_gen1dep32_batch32_eplb0_mtp0" + +# ctx: 15 prefill workers, TP2/EP2 +# gen: 1 decode worker, TP32/EP32, enable_attention_dp=true, max_batch=32 +# concurrency: 1229 + +model: + path: "nvidia/GLM5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3" + precision: "fp4" + +resources: + gpu_type: "gb300" + + prefill_nodes: 8 + prefill_workers: 15 + gpus_per_prefill: 2 + + decode_workers: 1 + decode_nodes: 8 + gpus_per_decode: 32 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 2 + max_num_tokens: 16640 + max_seq_len: 8232 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: CUTEDSL + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + + decode: + tensor_parallel_size: 32 + moe_expert_parallel_size: 32 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 32 + max_num_tokens: 32 + max_seq_len: 9256 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + moe_config: + backend: CUTEDSL + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.75 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + +benchmark: + type: "sa-bench" + isl: 8192 + osl: 1024 + concurrencies: "1229" + req_rate: "inf" + custom_tokenizer: "glm_moe_dsa" + use_chat_template: false + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/STP/ctx1dep2_gen2tep8_batch16_eplb0_mtp0.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/STP/ctx1dep2_gen2tep8_batch16_eplb0_mtp0.yaml new file mode 100644 index 00000000..8ff2f420 --- /dev/null +++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/STP/ctx1dep2_gen2tep8_batch16_eplb0_mtp0.yaml @@ -0,0 +1,127 @@ +name: "glm5_nvfp4_ISL8K_OSL1K_ctx1dep2_gen2tep8_batch16_eplb0_mtp0" + +# ctx: 1 prefill worker, TP2/EP2 +# gen: 2 decode workers, TP8/EP8 (MNNVL), max_batch=16 +# concurrency: 42 +model: + path: "nvidia/GLM5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3" + precision: "fp4" + +resources: + gpu_type: "gb300" + + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 2 + + decode_workers: 2 + decode_nodes: 4 + gpus_per_decode: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 2 + max_num_tokens: 16640 + max_seq_len: 8232 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: CUTEDSL + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + + decode: + allreduce_strategy: MNNVL + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 16 + max_num_tokens: 16 + max_seq_len: 9256 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + +benchmark: + type: "sa-bench" + isl: 8192 + osl: 1024 + concurrencies: "42" + req_rate: "inf" + custom_tokenizer: "glm_moe_dsa" + use_chat_template: false + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/STP/ctx1dep2_gen4tep8_batch1_eplb0_mtp0.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/STP/ctx1dep2_gen4tep8_batch1_eplb0_mtp0.yaml new file mode 100644 index 00000000..cc8faa11 --- /dev/null +++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/STP/ctx1dep2_gen4tep8_batch1_eplb0_mtp0.yaml @@ -0,0 +1,126 @@ +name: "glm5_nvfp4_ISL8K_OSL1K_ctx1dep2_gen4tep8_batch1_eplb0_mtp0" + +# ctx: 1 prefill worker, TP2/EP2 +# gen: 4 decode workers, TP8/EP8 (MNNVL), max_batch=1 +# concurrency: 4 + +model: + path: "nvidia/GLM5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3" + precision: "fp4" + +resources: + gpu_type: "gb300" + + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 2 + + decode_workers: 4 + decode_nodes: 8 + gpus_per_decode: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 2 + max_num_tokens: 16640 + max_seq_len: 8232 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: CUTEDSL + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + + decode: + allreduce_strategy: MNNVL + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 1 + max_num_tokens: 1 + max_seq_len: 9256 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + +benchmark: + type: "sa-bench" + isl: 8192 + osl: 1024 + concurrencies: "4" + req_rate: "inf" + custom_tokenizer: "glm_moe_dsa" + use_chat_template: false + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/STP/ctx1dep2_gen5tep4_batch4_allconc_eplb0_mtp0.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/STP/ctx1dep2_gen5tep4_batch4_allconc_eplb0_mtp0.yaml new file mode 100644 index 00000000..06d02024 --- /dev/null +++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/STP/ctx1dep2_gen5tep4_batch4_allconc_eplb0_mtp0.yaml @@ -0,0 +1,125 @@ +name: "glm5_nvfp4_ISL8K_OSL1K_ctx1dep2_gen5tep4_batch4_allconc_eplb0_mtp0" + +# ctx: 1 prefill worker, TP2/EP2 +# gen: 5 decode workers, TP4/EP4, max_batch=4 +# concurrencies: 5 (batch1), 10 (batch2), 25 (batch4) — merged as 5x10x25 + +model: + path: "nvidia/GLM5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3" + precision: "fp4" + +resources: + gpu_type: "gb300" + + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 2 + + decode_workers: 5 + decode_nodes: 5 + gpus_per_decode: 4 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 2 + max_num_tokens: 16640 + max_seq_len: 8232 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: CUTEDSL + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + + decode: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 4 + max_num_tokens: 4 + max_seq_len: 9256 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + +benchmark: + type: "sa-bench" + isl: 8192 + osl: 1024 + concurrencies: "5x10x25" + req_rate: "inf" + custom_tokenizer: "glm_moe_dsa" + use_chat_template: false + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/STP/ctx20dep2_gen1dep16_batch128_eplb0_mtp0.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/STP/ctx20dep2_gen1dep16_batch128_eplb0_mtp0.yaml new file mode 100644 index 00000000..ead937c9 --- /dev/null +++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/STP/ctx20dep2_gen1dep16_batch128_eplb0_mtp0.yaml @@ -0,0 +1,141 @@ +name: "glm5_nvfp4_ISL8K_OSL1K_ctx20dep2_gen1dep16_batch128_eplb0_mtp0" + +# ctx: 20 prefill workers, TP2/EP2 +# gen: 1 decode worker, TP16/EP16, enable_attention_dp=true, max_batch=128 +# concurrency: 2151 + +model: + path: "nvidia/GLM5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3" + precision: "fp4" + +resources: + gpu_type: "gb300" + + prefill_nodes: 10 + prefill_workers: 20 + gpus_per_prefill: 2 + + decode_workers: 1 + decode_nodes: 4 + gpus_per_decode: 16 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 2 + max_num_tokens: 16640 + max_seq_len: 8232 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: CUTEDSL + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + + decode: + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 128 + max_num_tokens: 128 + max_seq_len: 9256 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + - 72 + - 80 + - 88 + - 96 + - 104 + - 112 + - 120 + - 128 + moe_config: + backend: CUTEDSL + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + +benchmark: + type: "sa-bench" + isl: 8192 + osl: 1024 + concurrencies: "2151" + req_rate: "inf" + custom_tokenizer: "glm_moe_dsa" + use_chat_template: false + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/STP/ctx2dep2_gen3tep8_batch32_eplb0_mtp0.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/STP/ctx2dep2_gen3tep8_batch32_eplb0_mtp0.yaml new file mode 100644 index 00000000..e06ea268 --- /dev/null +++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/STP/ctx2dep2_gen3tep8_batch32_eplb0_mtp0.yaml @@ -0,0 +1,130 @@ +name: "glm5_nvfp4_ISL8K_OSL1K_ctx2dep2_gen3tep8_batch32_eplb0_mtp0" + +# ctx: 2 prefill workers, TP2/EP2 +# gen: 3 decode workers, TP8/EP8 (MNNVL), max_batch=32 +# concurrency: 117 + +model: + path: "nvidia/GLM5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3" + precision: "fp4" + +resources: + gpu_type: "gb300" + + prefill_nodes: 1 + prefill_workers: 2 + gpus_per_prefill: 2 + + decode_workers: 3 + decode_nodes: 6 + gpus_per_decode: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 2 + max_num_tokens: 16640 + max_seq_len: 8232 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: CUTEDSL + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + + decode: + allreduce_strategy: MNNVL + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 32 + max_num_tokens: 32 + max_seq_len: 9256 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + +benchmark: + type: "sa-bench" + isl: 8192 + osl: 1024 + concurrencies: "117" + req_rate: "inf" + custom_tokenizer: "glm_moe_dsa" + use_chat_template: false + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/STP/ctx4dep2_gen3tep8_batch64_eplb0_mtp0.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/STP/ctx4dep2_gen3tep8_batch64_eplb0_mtp0.yaml new file mode 100644 index 00000000..f4b3cc09 --- /dev/null +++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/STP/ctx4dep2_gen3tep8_batch64_eplb0_mtp0.yaml @@ -0,0 +1,134 @@ +name: "glm5_nvfp4_ISL8K_OSL1K_ctx4dep2_gen3tep8_batch64_eplb0_mtp0" + +# ctx: 4 prefill workers, TP2/EP2 +# gen: 3 decode workers, TP8/EP8 (MNNVL), max_batch=64 +# concurrency: 231 + +model: + path: "nvidia/GLM5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3" + precision: "fp4" + +resources: + gpu_type: "gb300" + + prefill_nodes: 2 + prefill_workers: 4 + gpus_per_prefill: 2 + + decode_workers: 3 + decode_nodes: 6 + gpus_per_decode: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 2 + max_num_tokens: 16640 + max_seq_len: 8232 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: CUTEDSL + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + + decode: + allreduce_strategy: MNNVL + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 64 + max_num_tokens: 64 + max_seq_len: 9256 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + +benchmark: + type: "sa-bench" + isl: 8192 + osl: 1024 + concurrencies: "231" + req_rate: "inf" + custom_tokenizer: "glm_moe_dsa" + use_chat_template: false + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/STP/ctx9dep2_gen1dep32_batch16_eplb0_mtp0.yaml b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/STP/ctx9dep2_gen1dep32_batch16_eplb0_mtp0.yaml new file mode 100644 index 00000000..75f56785 --- /dev/null +++ b/recipes/GLM5/disagg/trtllm_dynamo/gb300_nvfp4/ISL8K_OSL1K/STP/ctx9dep2_gen1dep32_batch16_eplb0_mtp0.yaml @@ -0,0 +1,127 @@ +name: "glm5_nvfp4_ISL8K_OSL1K_ctx9dep2_gen1dep32_batch16_eplb0_mtp0" + +# ctx: 9 prefill workers, TP2/EP2 +# gen: 1 decode worker, TP32/EP32, enable_attention_dp=true, max_batch=16 +# concurrency: 615 + +model: + path: "nvidia/GLM5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.3" + precision: "fp4" + +resources: + gpu_type: "gb300" + + prefill_nodes: 5 + prefill_workers: 9 + gpus_per_prefill: 2 + + decode_workers: 1 + decode_nodes: 8 + gpus_per_decode: 32 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + MIMALLOC_PURGE_DELAY: "0" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 2 + moe_expert_parallel_size: 2 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 2 + max_num_tokens: 16640 + max_seq_len: 8232 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: CUTEDSL + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + + decode: + tensor_parallel_size: 32 + moe_expert_parallel_size: 32 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + trust_remote_code: true + custom_tokenizer: "glm_moe_dsa" + max_batch_size: 16 + max_num_tokens: 16 + max_seq_len: 9256 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + moe_config: + backend: CUTEDSL + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.75 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + +benchmark: + type: "sa-bench" + isl: 8192 + osl: 1024 + concurrencies: "615" + req_rate: "inf" + custom_tokenizer: "glm_moe_dsa" + use_chat_template: false + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen1dep16_batch32_eplb0_mtp3.yaml b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen1dep16_batch32_eplb0_mtp3.yaml new file mode 100644 index 00000000..03462b07 --- /dev/null +++ b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen1dep16_batch32_eplb0_mtp3.yaml @@ -0,0 +1,136 @@ +name: "kimi_k25_nvfp4_ISL1K_OSL1K_ctx1dep4_gen1dep16_batch32_eplb0_mtp3" + +# ctx: 1 prefill worker, TP4/EP4 +# gen: 1 decode worker, TP16/EP16, enable_attention_dp=true, max_batch=32 +# MTP (Eagle speculative decoding, max_draft_len=3) +# concurrency: 666 + +model: + path: "nvidia/Kimi-K2.5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2" + precision: "fp4" + +resources: + gpu_type: "gb200" + + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 4 + + decode_workers: 1 + decode_nodes: 4 + gpus_per_decode: 16 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1064 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: TRTLLM + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.4 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + speculative_config: + decoding_type: Eagle + max_draft_len: 3 + speculative_model_dir: "/eagle-model" + + decode: + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + trust_remote_code: true + max_batch_size: 32 + max_num_tokens: 128 + max_seq_len: 2088 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.7 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + speculative_config: + decoding_type: Eagle + max_draft_len: 3 + speculative_model_dir: "/eagle-model" + +extra_mount: + - "nvidia/Kimi-K2.5-Thinking-Eagle3:/eagle-model" + +benchmark: + type: "sa-bench" + isl: 1024 + osl: 1024 + concurrencies: "666" + req_rate: "inf" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen1dep32_batch16_eplb0_mtp3.yaml b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen1dep32_batch16_eplb0_mtp3.yaml new file mode 100644 index 00000000..6a29059c --- /dev/null +++ b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen1dep32_batch16_eplb0_mtp3.yaml @@ -0,0 +1,134 @@ +name: "kimi_k25_nvfp4_ISL1K_OSL1K_ctx1dep4_gen1dep32_batch16_eplb0_mtp3" + +# ctx: 1 prefill worker, TP4/EP4 +# gen: 1 decode worker, TP32/EP32, enable_attention_dp=true, max_batch=16 +# MTP (Eagle speculative decoding, max_draft_len=3) +# concurrency: 666 + +model: + path: "nvidia/Kimi-K2.5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2" + precision: "fp4" + +resources: + gpu_type: "gb200" + + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 4 + + decode_workers: 1 + decode_nodes: 8 + gpus_per_decode: 32 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1064 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: TRTLLM + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.4 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + speculative_config: + decoding_type: Eagle + max_draft_len: 3 + speculative_model_dir: "/eagle-model" + + decode: + tensor_parallel_size: 32 + moe_expert_parallel_size: 32 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + trust_remote_code: true + max_batch_size: 16 + max_num_tokens: 64 + max_seq_len: 2088 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + speculative_config: + decoding_type: Eagle + max_draft_len: 3 + speculative_model_dir: "/eagle-model" + +extra_mount: + - "nvidia/Kimi-K2.5-Thinking-Eagle3:/eagle-model" + +benchmark: + type: "sa-bench" + isl: 1024 + osl: 1024 + concurrencies: "666" + req_rate: "inf" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen1dep8_batch512_eplb0_mtp1.yaml b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen1dep8_batch512_eplb0_mtp1.yaml new file mode 100644 index 00000000..739bd487 --- /dev/null +++ b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen1dep8_batch512_eplb0_mtp1.yaml @@ -0,0 +1,196 @@ +name: "kimi_k25_nvfp4_ISL1K_OSL1K_ctx1dep4_gen1dep8_batch512_eplb0_mtp1" + +# ctx: 1 prefill worker, TP4/EP4 +# gen: 1 decode worker, TP8/EP8, enable_attention_dp=true, max_batch=512 +# MTP (Eagle speculative decoding, max_draft_len=1) +# concurrency: 4301 + +model: + path: "nvidia/Kimi-K2.5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2" + precision: "fp4" + +resources: + gpu_type: "gb200" + + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 4 + + decode_workers: 1 + decode_nodes: 2 + gpus_per_decode: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1064 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: TRTLLM + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.4 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + speculative_config: + decoding_type: Eagle + max_draft_len: 1 + speculative_model_dir: "/eagle-model" + + decode: + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + trust_remote_code: true + max_batch_size: 512 + max_num_tokens: 1024 + max_seq_len: 2088 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + - 72 + - 80 + - 88 + - 96 + - 104 + - 112 + - 120 + - 128 + - 136 + - 144 + - 152 + - 160 + - 168 + - 176 + - 184 + - 192 + - 200 + - 208 + - 216 + - 224 + - 232 + - 240 + - 248 + - 256 + - 264 + - 272 + - 280 + - 288 + - 296 + - 304 + - 312 + - 320 + - 328 + - 336 + - 344 + - 352 + - 360 + - 368 + - 376 + - 384 + - 392 + - 400 + - 408 + - 416 + - 424 + - 432 + - 440 + - 448 + - 456 + - 464 + - 472 + - 480 + - 488 + - 496 + - 504 + - 512 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + speculative_config: + decoding_type: Eagle + max_draft_len: 1 + speculative_model_dir: "/eagle-model" + +extra_mount: + - "nvidia/Kimi-K2.5-Thinking-Eagle3:/eagle-model" + +benchmark: + type: "sa-bench" + isl: 1024 + osl: 1024 + concurrencies: "4301" + req_rate: "inf" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen4tep8_batch64_allconc_eplb0_mtp3.yaml b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen4tep8_batch64_allconc_eplb0_mtp3.yaml new file mode 100644 index 00000000..a768bec4 --- /dev/null +++ b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen4tep8_batch64_allconc_eplb0_mtp3.yaml @@ -0,0 +1,141 @@ +name: "kimi_k25_nvfp4_ISL1K_OSL1K_ctx1dep4_gen4tep8_batch64_allconc_eplb0_mtp3" + +# ctx: 1 prefill worker, TP4/EP4 +# gen: 4 decode workers, TP8/EP8, allreduce_strategy=MNNVL, max_batch=64 +# MTP (Eagle speculative decoding, max_draft_len=3) +# Covers all gen4tep8 concurrencies: 8, 48, 92, 192, 336 + +model: + path: "nvidia/Kimi-K2.5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2" + precision: "fp4" + +resources: + gpu_type: "gb200" + + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 4 + + decode_workers: 4 + decode_nodes: 8 + gpus_per_decode: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1064 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: TRTLLM + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.4 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + speculative_config: + decoding_type: Eagle + max_draft_len: 3 + speculative_model_dir: "/eagle-model" + + decode: + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + allreduce_strategy: MNNVL + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + trust_remote_code: true + max_batch_size: 64 + max_num_tokens: 256 + max_seq_len: 2088 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + speculative_config: + decoding_type: Eagle + max_draft_len: 3 + speculative_model_dir: "/eagle-model" + +extra_mount: + - "nvidia/Kimi-K2.5-Thinking-Eagle3:/eagle-model" + +benchmark: + type: "sa-bench" + isl: 1024 + osl: 1024 + concurrencies: "8x48x92x192x336" + req_rate: "inf" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen5tep4_batch2_allconc_eplb0_mtp3.yaml b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen5tep4_batch2_allconc_eplb0_mtp3.yaml new file mode 100644 index 00000000..c2e24b41 --- /dev/null +++ b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/MTP/ctx1dep4_gen5tep4_batch2_allconc_eplb0_mtp3.yaml @@ -0,0 +1,132 @@ +name: "kimi_k25_nvfp4_ISL1K_OSL1K_ctx1dep4_gen5tep4_batch2_allconc_eplb0_mtp3" + +# ctx: 1 prefill worker, TP4/EP4 +# gen: 5 decode workers, TP4/EP4, max_batch=2 +# MTP (Eagle speculative decoding, max_draft_len=3) +# Covers all gen5tep4 concurrencies: 10, 15 + +model: + path: "nvidia/Kimi-K2.5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2" + precision: "fp4" + +resources: + gpu_type: "gb200" + + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 4 + + decode_workers: 5 + decode_nodes: 5 + gpus_per_decode: 4 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1064 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: TRTLLM + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.4 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + speculative_config: + decoding_type: Eagle + max_draft_len: 3 + speculative_model_dir: "/eagle-model" + + decode: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + trust_remote_code: true + max_batch_size: 2 + max_num_tokens: 8 + max_seq_len: 2088 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.85 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + speculative_config: + decoding_type: Eagle + max_draft_len: 3 + speculative_model_dir: "/eagle-model" + +extra_mount: + - "nvidia/Kimi-K2.5-Thinking-Eagle3:/eagle-model" + +benchmark: + type: "sa-bench" + isl: 1024 + osl: 1024 + concurrencies: "10x15" + req_rate: "inf" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/MTP/ctx2dep4_gen1dep16_batch128_eplb0_mtp1.yaml b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/MTP/ctx2dep4_gen1dep16_batch128_eplb0_mtp1.yaml new file mode 100644 index 00000000..68d7dd06 --- /dev/null +++ b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/MTP/ctx2dep4_gen1dep16_batch128_eplb0_mtp1.yaml @@ -0,0 +1,148 @@ +name: "kimi_k25_nvfp4_ISL1K_OSL1K_ctx2dep4_gen1dep16_batch128_eplb0_mtp1" + +# ctx: 2 prefill workers, TP4/EP4 +# gen: 1 decode worker, TP16/EP16, enable_attention_dp=true, max_batch=128 +# MTP (Eagle speculative decoding, max_draft_len=1) +# concurrency: 2253 + +model: + path: "nvidia/Kimi-K2.5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2" + precision: "fp4" + +resources: + gpu_type: "gb200" + + prefill_nodes: 2 + prefill_workers: 2 + gpus_per_prefill: 4 + + decode_workers: 1 + decode_nodes: 4 + gpus_per_decode: 16 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1064 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: TRTLLM + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.4 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + speculative_config: + decoding_type: Eagle + max_draft_len: 1 + speculative_model_dir: "/eagle-model" + + decode: + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + trust_remote_code: true + max_batch_size: 128 + max_num_tokens: 256 + max_seq_len: 2088 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + - 72 + - 80 + - 88 + - 96 + - 104 + - 112 + - 120 + - 128 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.7 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + speculative_config: + decoding_type: Eagle + max_draft_len: 1 + speculative_model_dir: "/eagle-model" + +extra_mount: + - "nvidia/Kimi-K2.5-Thinking-Eagle3:/eagle-model" + +benchmark: + type: "sa-bench" + isl: 1024 + osl: 1024 + concurrencies: "2253" + req_rate: "inf" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/MTP/ctx2dep4_gen1dep32_batch64_eplb0_mtp1.yaml b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/MTP/ctx2dep4_gen1dep32_batch64_eplb0_mtp1.yaml new file mode 100644 index 00000000..1cb17478 --- /dev/null +++ b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/MTP/ctx2dep4_gen1dep32_batch64_eplb0_mtp1.yaml @@ -0,0 +1,140 @@ +name: "kimi_k25_nvfp4_ISL1K_OSL1K_ctx2dep4_gen1dep32_batch64_eplb0_mtp1" + +# ctx: 2 prefill workers, TP4/EP4 +# gen: 1 decode worker, TP32/EP32, enable_attention_dp=true, max_batch=64 +# MTP (Eagle speculative decoding, max_draft_len=1) +# concurrency: 2253 + +model: + path: "nvidia/Kimi-K2.5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2" + precision: "fp4" + +resources: + gpu_type: "gb200" + + prefill_nodes: 2 + prefill_workers: 2 + gpus_per_prefill: 4 + + decode_workers: 1 + decode_nodes: 8 + gpus_per_decode: 32 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1064 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: TRTLLM + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.4 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + speculative_config: + decoding_type: Eagle + max_draft_len: 1 + speculative_model_dir: "/eagle-model" + + decode: + tensor_parallel_size: 32 + moe_expert_parallel_size: 32 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + trust_remote_code: true + max_batch_size: 64 + max_num_tokens: 128 + max_seq_len: 2088 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.6 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + speculative_config: + decoding_type: Eagle + max_draft_len: 1 + speculative_model_dir: "/eagle-model" + +extra_mount: + - "nvidia/Kimi-K2.5-Thinking-Eagle3:/eagle-model" + +benchmark: + type: "sa-bench" + isl: 1024 + osl: 1024 + concurrencies: "2253" + req_rate: "inf" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/MTP/ctx2dep4_gen3dep8_batch256_eplb0_mtp1.yaml b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/MTP/ctx2dep4_gen3dep8_batch256_eplb0_mtp1.yaml new file mode 100644 index 00000000..eb43aab7 --- /dev/null +++ b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/MTP/ctx2dep4_gen3dep8_batch256_eplb0_mtp1.yaml @@ -0,0 +1,164 @@ +name: "kimi_k25_nvfp4_ISL1K_OSL1K_ctx2dep4_gen3dep8_batch256_eplb0_mtp1" + +# ctx: 2 prefill workers, TP4/EP4 +# gen: 3 decode workers, TP8/EP8, enable_attention_dp=true, max_batch=256 +# MTP (Eagle speculative decoding, max_draft_len=1) +# concurrency: 6759 + +model: + path: "nvidia/Kimi-K2.5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2" + precision: "fp4" + +resources: + gpu_type: "gb200" + + prefill_nodes: 2 + prefill_workers: 2 + gpus_per_prefill: 4 + + decode_workers: 3 + decode_nodes: 6 + gpus_per_decode: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1064 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: TRTLLM + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.4 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + speculative_config: + decoding_type: Eagle + max_draft_len: 1 + speculative_model_dir: "/eagle-model" + + decode: + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + trust_remote_code: true + max_batch_size: 256 + max_num_tokens: 512 + max_seq_len: 2088 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + - 72 + - 80 + - 88 + - 96 + - 104 + - 112 + - 120 + - 128 + - 136 + - 144 + - 152 + - 160 + - 168 + - 176 + - 184 + - 192 + - 200 + - 208 + - 216 + - 224 + - 232 + - 240 + - 248 + - 256 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + speculative_config: + decoding_type: Eagle + max_draft_len: 1 + speculative_model_dir: "/eagle-model" + +extra_mount: + - "nvidia/Kimi-K2.5-Thinking-Eagle3:/eagle-model" + +benchmark: + type: "sa-bench" + isl: 1024 + osl: 1024 + concurrencies: "6759" + req_rate: "inf" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen1dep16_batch32_eplb0_mtp0.yaml b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen1dep16_batch32_eplb0_mtp0.yaml new file mode 100644 index 00000000..ce3eff43 --- /dev/null +++ b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen1dep16_batch32_eplb0_mtp0.yaml @@ -0,0 +1,125 @@ +name: "kimi_k25_nvfp4_ISL1K_OSL1K_ctx1dep4_gen1dep16_batch32_eplb0_mtp0" + +# ctx: 1 prefill worker, TP4/EP4 +# gen: 1 decode worker, TP16/EP16, enable_attention_dp=true, max_batch=32 +# STP (no speculative decoding) +# concurrency: 666 + +model: + path: "nvidia/Kimi-K2.5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2" + precision: "fp4" + +resources: + gpu_type: "gb200" + + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 4 + + decode_workers: 1 + decode_nodes: 4 + gpus_per_decode: 16 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1064 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: TRTLLM + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.4 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + + decode: + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + trust_remote_code: true + max_batch_size: 32 + max_num_tokens: 32 + max_seq_len: 2088 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.75 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + +benchmark: + type: "sa-bench" + isl: 1024 + osl: 1024 + concurrencies: "666" + req_rate: "inf" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen1dep32_batch64_eplb0_mtp0.yaml b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen1dep32_batch64_eplb0_mtp0.yaml new file mode 100644 index 00000000..105b84bf --- /dev/null +++ b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen1dep32_batch64_eplb0_mtp0.yaml @@ -0,0 +1,129 @@ +name: "kimi_k25_nvfp4_ISL1K_OSL1K_ctx1dep4_gen1dep32_batch64_eplb0_mtp0" + +# ctx: 1 prefill worker, TP4/EP4 +# gen: 1 decode worker, TP32/EP32, enable_attention_dp=true, max_batch=64 +# STP (no speculative decoding) +# concurrency: 2253 + +model: + path: "nvidia/Kimi-K2.5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2" + precision: "fp4" + +resources: + gpu_type: "gb200" + + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 4 + + decode_workers: 1 + decode_nodes: 8 + gpus_per_decode: 32 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1064 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: TRTLLM + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.4 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + + decode: + tensor_parallel_size: 32 + moe_expert_parallel_size: 32 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + trust_remote_code: true + max_batch_size: 64 + max_num_tokens: 64 + max_seq_len: 2088 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.7 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + +benchmark: + type: "sa-bench" + isl: 1024 + osl: 1024 + concurrencies: "2253" + req_rate: "inf" + +frontend: + type: "dynamo" + enable_multiple_frontends: true + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen1dep8_batch768_allconc_eplb0_mtp0.yaml b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen1dep8_batch768_allconc_eplb0_mtp0.yaml new file mode 100644 index 00000000..9fb194dd --- /dev/null +++ b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen1dep8_batch768_allconc_eplb0_mtp0.yaml @@ -0,0 +1,217 @@ +name: "kimi_k25_nvfp4_ISL1K_OSL1K_ctx1dep4_gen1dep8_batch768_allconc_eplb0_mtp0" + +# ctx: 1 prefill worker, TP4/EP4 +# gen: 1 decode worker, TP8/EP8, enable_attention_dp=true, max_batch=768 +# STP (no speculative decoding) +# Covers all dep8 concurrencies: 4301, 6452 + +model: + path: "nvidia/Kimi-K2.5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2" + precision: "fp4" + +resources: + gpu_type: "gb200" + + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 4 + + decode_workers: 1 + decode_nodes: 2 + gpus_per_decode: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1064 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: TRTLLM + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.4 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + + decode: + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + trust_remote_code: true + max_batch_size: 768 + max_num_tokens: 768 + max_seq_len: 2088 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + - 72 + - 80 + - 88 + - 96 + - 104 + - 112 + - 120 + - 128 + - 136 + - 144 + - 152 + - 160 + - 168 + - 176 + - 184 + - 192 + - 200 + - 208 + - 216 + - 224 + - 232 + - 240 + - 248 + - 256 + - 264 + - 272 + - 280 + - 288 + - 296 + - 304 + - 312 + - 320 + - 328 + - 336 + - 344 + - 352 + - 360 + - 368 + - 376 + - 384 + - 392 + - 400 + - 408 + - 416 + - 424 + - 432 + - 440 + - 448 + - 456 + - 464 + - 472 + - 480 + - 488 + - 496 + - 504 + - 512 + - 520 + - 528 + - 536 + - 544 + - 552 + - 560 + - 568 + - 576 + - 584 + - 592 + - 600 + - 608 + - 616 + - 624 + - 632 + - 640 + - 648 + - 656 + - 664 + - 672 + - 680 + - 688 + - 696 + - 704 + - 712 + - 720 + - 728 + - 736 + - 744 + - 752 + - 760 + - 768 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + +benchmark: + type: "sa-bench" + isl: 1024 + osl: 1024 + concurrencies: "4301x6452" + req_rate: "inf" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen4tep8_batch128_allconc_eplb0_mtp0.yaml b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen4tep8_batch128_allconc_eplb0_mtp0.yaml new file mode 100644 index 00000000..5639da41 --- /dev/null +++ b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen4tep8_batch128_allconc_eplb0_mtp0.yaml @@ -0,0 +1,138 @@ +name: "kimi_k25_nvfp4_ISL1K_OSL1K_ctx1dep4_gen4tep8_batch128_allconc_eplb0_mtp0" + +# ctx: 1 prefill worker, TP4/EP4 +# gen: 4 decode workers, TP8/EP8, allreduce_strategy=MNNVL, max_batch=128 +# STP (no speculative decoding) +# Covers all gen4tep8 concurrencies: 4, 192, 360, 668 + +model: + path: "nvidia/Kimi-K2.5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2" + precision: "fp4" + +resources: + gpu_type: "gb200" + + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 4 + + decode_workers: 4 + decode_nodes: 8 + gpus_per_decode: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1064 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: TRTLLM + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.4 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + + decode: + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + allreduce_strategy: MNNVL + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + trust_remote_code: true + max_batch_size: 128 + max_num_tokens: 128 + max_seq_len: 2088 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + - 72 + - 80 + - 88 + - 96 + - 104 + - 112 + - 120 + - 128 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + +benchmark: + type: "sa-bench" + isl: 1024 + osl: 1024 + concurrencies: "4x192x360x668" + req_rate: "inf" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen5tep4_batch8_allconc_eplb0_mtp0.yaml b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen5tep4_batch8_allconc_eplb0_mtp0.yaml new file mode 100644 index 00000000..f9496feb --- /dev/null +++ b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/STP/ctx1dep4_gen5tep4_batch8_allconc_eplb0_mtp0.yaml @@ -0,0 +1,122 @@ +name: "kimi_k25_nvfp4_ISL1K_OSL1K_ctx1dep4_gen5tep4_batch8_allconc_eplb0_mtp0" + +# ctx: 1 prefill worker, TP4/EP4 +# gen: 5 decode workers, TP4/EP4, max_batch=8 +# STP (no speculative decoding) +# Covers all gen5tep4 concurrencies: 5, 15, 30, 55 + +model: + path: "nvidia/Kimi-K2.5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2" + precision: "fp4" + +resources: + gpu_type: "gb200" + + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 4 + + decode_workers: 5 + decode_nodes: 5 + gpus_per_decode: 4 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1064 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: TRTLLM + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.4 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + + decode: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + trust_remote_code: true + max_batch_size: 8 + max_num_tokens: 8 + max_seq_len: 2088 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + +benchmark: + type: "sa-bench" + isl: 1024 + osl: 1024 + concurrencies: "5x15x30x55" + req_rate: "inf" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/STP/ctx2dep4_gen1dep16_batch256_eplb0_mtp0.yaml b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/STP/ctx2dep4_gen1dep16_batch256_eplb0_mtp0.yaml new file mode 100644 index 00000000..71b016c4 --- /dev/null +++ b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/STP/ctx2dep4_gen1dep16_batch256_eplb0_mtp0.yaml @@ -0,0 +1,153 @@ +name: "kimi_k25_nvfp4_ISL1K_OSL1K_ctx2dep4_gen1dep16_batch256_eplb0_mtp0" + +# ctx: 2 prefill workers, TP4/EP4 +# gen: 1 decode worker, TP16/EP16, enable_attention_dp=true, max_batch=256 +# STP (no speculative decoding) +# concurrency: 4301 + +model: + path: "nvidia/Kimi-K2.5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2" + precision: "fp4" + +resources: + gpu_type: "gb200" + + prefill_nodes: 2 + prefill_workers: 2 + gpus_per_prefill: 4 + + decode_workers: 1 + decode_nodes: 4 + gpus_per_decode: 16 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1064 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: TRTLLM + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.4 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + + decode: + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + trust_remote_code: true + max_batch_size: 256 + max_num_tokens: 256 + max_seq_len: 2088 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + - 72 + - 80 + - 88 + - 96 + - 104 + - 112 + - 120 + - 128 + - 136 + - 144 + - 152 + - 160 + - 168 + - 176 + - 184 + - 192 + - 200 + - 208 + - 216 + - 224 + - 232 + - 240 + - 248 + - 256 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.75 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + +benchmark: + type: "sa-bench" + isl: 1024 + osl: 1024 + concurrencies: "4301" + req_rate: "inf" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/STP/ctx2dep4_gen1dep32_batch128_eplb0_mtp0.yaml b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/STP/ctx2dep4_gen1dep32_batch128_eplb0_mtp0.yaml new file mode 100644 index 00000000..52b75bb4 --- /dev/null +++ b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL1K_OSL1K/STP/ctx2dep4_gen1dep32_batch128_eplb0_mtp0.yaml @@ -0,0 +1,137 @@ +name: "kimi_k25_nvfp4_ISL1K_OSL1K_ctx2dep4_gen1dep32_batch128_eplb0_mtp0" + +# ctx: 2 prefill workers, TP4/EP4 +# gen: 1 decode worker, TP32/EP32, enable_attention_dp=true, max_batch=128 +# STP (no speculative decoding) +# concurrency: 4301 + +model: + path: "nvidia/Kimi-K2.5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2" + precision: "fp4" + +resources: + gpu_type: "gb200" + + prefill_nodes: 2 + prefill_workers: 2 + gpus_per_prefill: 4 + + decode_workers: 1 + decode_nodes: 8 + gpus_per_decode: 32 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + max_batch_size: 16 + max_num_tokens: 16384 + max_seq_len: 1064 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: TRTLLM + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.4 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + + decode: + tensor_parallel_size: 32 + moe_expert_parallel_size: 32 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + trust_remote_code: true + max_batch_size: 128 + max_num_tokens: 128 + max_seq_len: 2088 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + - 72 + - 80 + - 88 + - 96 + - 104 + - 112 + - 120 + - 128 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.7 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + +benchmark: + type: "sa-bench" + isl: 1024 + osl: 1024 + concurrencies: "4301" + req_rate: "inf" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/MTP/ctx1dep4_gen2tep8_batch32_eplb0_mtp3.yaml b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/MTP/ctx1dep4_gen2tep8_batch32_eplb0_mtp3.yaml new file mode 100644 index 00000000..bb3f8d1e --- /dev/null +++ b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/MTP/ctx1dep4_gen2tep8_batch32_eplb0_mtp3.yaml @@ -0,0 +1,137 @@ +name: "kimi_k25_nvfp4_ISL8K_OSL1K_ctx1dep4_gen2tep8_batch32_eplb0_mtp3" + +# ctx: 1 prefill worker, TP4/EP4 +# gen: 2 decode workers, TP8/EP8, allreduce_strategy=MNNVL, max_batch=32 +# MTP Eagle speculative decoding, max_draft_len=3 +# concurrency: 90 + +model: + path: "nvidia/Kimi-K2.5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2" + precision: "fp4" + +resources: + gpu_type: "gb200" + + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 4 + + decode_workers: 2 + decode_nodes: 4 + gpus_per_decode: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + max_batch_size: 2 + max_num_tokens: 16384 + max_seq_len: 8232 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: TRTLLM + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.4 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + speculative_config: + decoding_type: Eagle + max_draft_len: 3 + speculative_model_dir: "/eagle-model" + + decode: + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + allreduce_strategy: MNNVL + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + trust_remote_code: true + max_batch_size: 32 + max_num_tokens: 128 + max_seq_len: 9256 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + speculative_config: + decoding_type: Eagle + max_draft_len: 3 + speculative_model_dir: "/eagle-model" + +extra_mount: + - "nvidia/Kimi-K2.5-Thinking-Eagle3:/eagle-model" + +benchmark: + type: "sa-bench" + isl: 8192 + osl: 1024 + concurrencies: "90" + req_rate: "inf" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/MTP/ctx1dep4_gen4tep8_batch1_eplb0_mtp3.yaml b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/MTP/ctx1dep4_gen4tep8_batch1_eplb0_mtp3.yaml new file mode 100644 index 00000000..8b7f02d6 --- /dev/null +++ b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/MTP/ctx1dep4_gen4tep8_batch1_eplb0_mtp3.yaml @@ -0,0 +1,133 @@ +name: "kimi_k25_nvfp4_ISL8K_OSL1K_ctx1dep4_gen4tep8_batch1_eplb0_mtp3" + +# ctx: 1 prefill worker, TP4/EP4 +# gen: 4 decode workers, TP8/EP8, allreduce_strategy=MNNVL, max_batch=1 +# MTP Eagle speculative decoding, max_draft_len=3 +# concurrency: 8 + +model: + path: "nvidia/Kimi-K2.5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2" + precision: "fp4" + +resources: + gpu_type: "gb200" + + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 4 + + decode_workers: 4 + decode_nodes: 8 + gpus_per_decode: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + max_batch_size: 2 + max_num_tokens: 16384 + max_seq_len: 8232 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: TRTLLM + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.4 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + speculative_config: + decoding_type: Eagle + max_draft_len: 3 + speculative_model_dir: "/eagle-model" + + decode: + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + allreduce_strategy: MNNVL + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + trust_remote_code: true + max_batch_size: 1 + max_num_tokens: 4 + max_seq_len: 9256 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + speculative_config: + decoding_type: Eagle + max_draft_len: 3 + speculative_model_dir: "/eagle-model" + +extra_mount: + - "nvidia/Kimi-K2.5-Thinking-Eagle3:/eagle-model" + +benchmark: + type: "sa-bench" + isl: 8192 + osl: 1024 + concurrencies: "8" + req_rate: "inf" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/MTP/ctx1dep4_gen5tep4_batch8_allconc_eplb0_mtp3.yaml b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/MTP/ctx1dep4_gen5tep4_batch8_allconc_eplb0_mtp3.yaml new file mode 100644 index 00000000..1883e739 --- /dev/null +++ b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/MTP/ctx1dep4_gen5tep4_batch8_allconc_eplb0_mtp3.yaml @@ -0,0 +1,133 @@ +name: "kimi_k25_nvfp4_ISL8K_OSL1K_ctx1dep4_gen5tep4_batch8_allconc_eplb0_mtp3" + +# ctx: 1 prefill worker, TP4/EP4 +# gen: 5 decode workers, TP4/EP4, max_batch=8 +# MTP Eagle speculative decoding, max_draft_len=3 +# Covers all gen5tep4 concurrencies: 10, 15, 60 + +model: + path: "nvidia/Kimi-K2.5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2" + precision: "fp4" + +resources: + gpu_type: "gb200" + + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 4 + + decode_workers: 5 + decode_nodes: 5 + gpus_per_decode: 4 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + max_batch_size: 2 + max_num_tokens: 16384 + max_seq_len: 8232 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: TRTLLM + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.4 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + speculative_config: + decoding_type: Eagle + max_draft_len: 3 + speculative_model_dir: "/eagle-model" + + decode: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + trust_remote_code: true + max_batch_size: 8 + max_num_tokens: 32 + max_seq_len: 9256 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.85 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + speculative_config: + decoding_type: Eagle + max_draft_len: 3 + speculative_model_dir: "/eagle-model" + +extra_mount: + - "nvidia/Kimi-K2.5-Thinking-Eagle3:/eagle-model" + +benchmark: + type: "sa-bench" + isl: 8192 + osl: 1024 + concurrencies: "10x15x60" + req_rate: "inf" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/MTP/ctx2dep4_gen1dep16_batch8_eplb0_mtp3.yaml b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/MTP/ctx2dep4_gen1dep16_batch8_eplb0_mtp3.yaml new file mode 100644 index 00000000..5aced422 --- /dev/null +++ b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/MTP/ctx2dep4_gen1dep16_batch8_eplb0_mtp3.yaml @@ -0,0 +1,133 @@ +name: "kimi_k25_nvfp4_ISL8K_OSL1K_ctx2dep4_gen1dep16_batch8_eplb0_mtp3" + +# ctx: 2 prefill workers, TP4/EP4 +# gen: 1 decode worker, TP16/EP16, enable_attention_dp=true, max_batch=8 +# MTP Eagle speculative decoding, max_draft_len=3 +# concurrency: 180 + +model: + path: "nvidia/Kimi-K2.5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2" + precision: "fp4" + +resources: + gpu_type: "gb200" + + prefill_nodes: 2 + prefill_workers: 2 + gpus_per_prefill: 4 + + decode_workers: 1 + decode_nodes: 4 + gpus_per_decode: 16 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + max_batch_size: 2 + max_num_tokens: 16384 + max_seq_len: 8232 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: TRTLLM + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.4 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + speculative_config: + decoding_type: Eagle + max_draft_len: 3 + speculative_model_dir: "/eagle-model" + + decode: + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + trust_remote_code: true + max_batch_size: 8 + max_num_tokens: 32 + max_seq_len: 9256 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + speculative_config: + decoding_type: Eagle + max_draft_len: 3 + speculative_model_dir: "/eagle-model" + +extra_mount: + - "nvidia/Kimi-K2.5-Thinking-Eagle3:/eagle-model" + +benchmark: + type: "sa-bench" + isl: 8192 + osl: 1024 + concurrencies: "180" + req_rate: "inf" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/MTP/ctx5dep4_gen1dep32_batch16_eplb0_mtp3.yaml b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/MTP/ctx5dep4_gen1dep32_batch16_eplb0_mtp3.yaml new file mode 100644 index 00000000..764f2d46 --- /dev/null +++ b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/MTP/ctx5dep4_gen1dep32_batch16_eplb0_mtp3.yaml @@ -0,0 +1,134 @@ +name: "kimi_k25_nvfp4_ISL8K_OSL1K_ctx5dep4_gen1dep32_batch16_eplb0_mtp3" + +# ctx: 5 prefill workers, TP4/EP4 +# gen: 1 decode worker, TP32/EP32, enable_attention_dp=true, max_batch=16 +# MTP Eagle speculative decoding, max_draft_len=3 +# concurrency: 666 + +model: + path: "nvidia/Kimi-K2.5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2" + precision: "fp4" + +resources: + gpu_type: "gb200" + + prefill_nodes: 5 + prefill_workers: 5 + gpus_per_prefill: 4 + + decode_workers: 1 + decode_nodes: 8 + gpus_per_decode: 32 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + max_batch_size: 2 + max_num_tokens: 16384 + max_seq_len: 8232 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: TRTLLM + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.4 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + speculative_config: + decoding_type: Eagle + max_draft_len: 3 + speculative_model_dir: "/eagle-model" + + decode: + tensor_parallel_size: 32 + moe_expert_parallel_size: 32 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + trust_remote_code: true + max_batch_size: 16 + max_num_tokens: 64 + max_seq_len: 9256 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.75 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + speculative_config: + decoding_type: Eagle + max_draft_len: 3 + speculative_model_dir: "/eagle-model" + +extra_mount: + - "nvidia/Kimi-K2.5-Thinking-Eagle3:/eagle-model" + +benchmark: + type: "sa-bench" + isl: 8192 + osl: 1024 + concurrencies: "666" + req_rate: "inf" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/MTP/ctx5dep4_gen1dep8_batch256_allconc_eplb0_mtp1.yaml b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/MTP/ctx5dep4_gen1dep8_batch256_allconc_eplb0_mtp1.yaml new file mode 100644 index 00000000..31308fe6 --- /dev/null +++ b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/MTP/ctx5dep4_gen1dep8_batch256_allconc_eplb0_mtp1.yaml @@ -0,0 +1,164 @@ +name: "kimi_k25_nvfp4_ISL8K_OSL1K_ctx5dep4_gen1dep8_batch256_allconc_eplb0_mtp1" + +# ctx: 5 prefill workers, TP4/EP4 +# gen: 1 decode worker, TP8/EP8, enable_attention_dp=true, max_batch=256 +# MTP Eagle speculative decoding, max_draft_len=1 +# Covers all dep8 mtp1 concurrencies: 1229, 2253 + +model: + path: "nvidia/Kimi-K2.5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2" + precision: "fp4" + +resources: + gpu_type: "gb200" + + prefill_nodes: 5 + prefill_workers: 5 + gpus_per_prefill: 4 + + decode_workers: 1 + decode_nodes: 2 + gpus_per_decode: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + max_batch_size: 2 + max_num_tokens: 16384 + max_seq_len: 8232 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: TRTLLM + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.4 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + speculative_config: + decoding_type: Eagle + max_draft_len: 1 + speculative_model_dir: "/eagle-model" + + decode: + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + trust_remote_code: true + max_batch_size: 256 + max_num_tokens: 512 + max_seq_len: 9256 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + - 72 + - 80 + - 88 + - 96 + - 104 + - 112 + - 120 + - 128 + - 136 + - 144 + - 152 + - 160 + - 168 + - 176 + - 184 + - 192 + - 200 + - 208 + - 216 + - 224 + - 232 + - 240 + - 248 + - 256 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + speculative_config: + decoding_type: Eagle + max_draft_len: 1 + speculative_model_dir: "/eagle-model" + +extra_mount: + - "nvidia/Kimi-K2.5-Thinking-Eagle3:/eagle-model" + +benchmark: + type: "sa-bench" + isl: 8192 + osl: 1024 + concurrencies: "1229x2253" + req_rate: "inf" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/MTP/ctx8dep4_gen1dep32_batch32_eplb0_mtp3.yaml b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/MTP/ctx8dep4_gen1dep32_batch32_eplb0_mtp3.yaml new file mode 100644 index 00000000..9bd03c05 --- /dev/null +++ b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/MTP/ctx8dep4_gen1dep32_batch32_eplb0_mtp3.yaml @@ -0,0 +1,136 @@ +name: "kimi_k25_nvfp4_ISL8K_OSL1K_ctx8dep4_gen1dep32_batch32_eplb0_mtp3" + +# ctx: 8 prefill workers, TP4/EP4 +# gen: 1 decode worker, TP32/EP32, enable_attention_dp=true, max_batch=32 +# MTP Eagle speculative decoding, max_draft_len=3 +# concurrency: 1229 + +model: + path: "nvidia/Kimi-K2.5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2" + precision: "fp4" + +resources: + gpu_type: "gb200" + + prefill_nodes: 8 + prefill_workers: 8 + gpus_per_prefill: 4 + + decode_workers: 1 + decode_nodes: 8 + gpus_per_decode: 32 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + max_batch_size: 2 + max_num_tokens: 16384 + max_seq_len: 8232 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: TRTLLM + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.4 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + speculative_config: + decoding_type: Eagle + max_draft_len: 3 + speculative_model_dir: "/eagle-model" + + decode: + tensor_parallel_size: 32 + moe_expert_parallel_size: 32 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_lm_head_tp_in_adp: true + trust_remote_code: true + max_batch_size: 32 + max_num_tokens: 128 + max_seq_len: 9256 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.75 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + speculative_config: + decoding_type: Eagle + max_draft_len: 3 + speculative_model_dir: "/eagle-model" + +extra_mount: + - "nvidia/Kimi-K2.5-Thinking-Eagle3:/eagle-model" + +benchmark: + type: "sa-bench" + isl: 8192 + osl: 1024 + concurrencies: "1229" + req_rate: "inf" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/STP/ctx1dep4_gen4tep4_batch32_allconc_eplb0_mtp0.yaml b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/STP/ctx1dep4_gen4tep4_batch32_allconc_eplb0_mtp0.yaml new file mode 100644 index 00000000..8c1f0aa8 --- /dev/null +++ b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/STP/ctx1dep4_gen4tep4_batch32_allconc_eplb0_mtp0.yaml @@ -0,0 +1,126 @@ +name: "kimi_k25_nvfp4_ISL8K_OSL1K_ctx1dep4_gen4tep4_batch32_allconc_eplb0_mtp0" + +# ctx: 1 prefill worker, TP4/EP4 +# gen: 4 decode workers, TP4/EP4, max_batch=32 +# Single concurrency point: 156 + +model: + path: "nvidia/Kimi-K2.5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2" + precision: "fp4" + +resources: + gpu_type: "gb200" + + # Prefill: 1 worker x TP4 = 4 GPUs = 1 node + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 4 + + # Decode: 4 workers x TP4 = 16 GPUs = 4 nodes + decode_workers: 4 + decode_nodes: 4 + gpus_per_decode: 4 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + max_batch_size: 2 + max_num_tokens: 16384 + max_seq_len: 8232 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: TRTLLM + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.4 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + + decode: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + trust_remote_code: true + max_batch_size: 32 + max_num_tokens: 32 + max_seq_len: 9256 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + +benchmark: + type: "sa-bench" + isl: 8192 + osl: 1024 + concurrencies: "156" + req_rate: "inf" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/STP/ctx1dep4_gen4tep8_batch1_allconc_eplb0_mtp0.yaml b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/STP/ctx1dep4_gen4tep8_batch1_allconc_eplb0_mtp0.yaml new file mode 100644 index 00000000..d4c5086b --- /dev/null +++ b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/STP/ctx1dep4_gen4tep8_batch1_allconc_eplb0_mtp0.yaml @@ -0,0 +1,123 @@ +name: "kimi_k25_nvfp4_ISL8K_OSL1K_ctx1dep4_gen4tep8_batch1_allconc_eplb0_mtp0" + +# ctx: 1 prefill worker, TP4/EP4 +# gen: 4 decode workers, TP8/EP8, allreduce_strategy=MNNVL, max_batch=1 +# Single concurrency point: 4 + +model: + path: "nvidia/Kimi-K2.5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2" + precision: "fp4" + +resources: + gpu_type: "gb200" + + # Prefill: 1 worker x TP4 = 4 GPUs = 1 node + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 4 + + # Decode: 4 workers x TP8 = 32 GPUs = 8 nodes + decode_workers: 4 + decode_nodes: 8 + gpus_per_decode: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + max_batch_size: 2 + max_num_tokens: 16384 + max_seq_len: 8232 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: TRTLLM + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.4 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + + decode: + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + allreduce_strategy: MNNVL + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + trust_remote_code: true + max_batch_size: 1 + max_num_tokens: 1 + max_seq_len: 9256 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + +benchmark: + type: "sa-bench" + isl: 8192 + osl: 1024 + concurrencies: "4" + req_rate: "inf" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/STP/ctx1dep4_gen5tep4_batch16_allconc_eplb0_mtp0.yaml b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/STP/ctx1dep4_gen5tep4_batch16_allconc_eplb0_mtp0.yaml new file mode 100644 index 00000000..8f6ea063 --- /dev/null +++ b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/STP/ctx1dep4_gen5tep4_batch16_allconc_eplb0_mtp0.yaml @@ -0,0 +1,126 @@ +name: "kimi_k25_nvfp4_ISL8K_OSL1K_ctx1dep4_gen5tep4_batch16_allconc_eplb0_mtp0" + +# ctx: 1 prefill worker, TP4/EP4 +# gen: 5 decode workers, TP4/EP4, max_batch=16 +# Covers all concurrencies: 5, 15, 30, 60, 105 + +model: + path: "nvidia/Kimi-K2.5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2" + precision: "fp4" + +resources: + gpu_type: "gb200" + + # Prefill: 1 worker x TP4 = 4 GPUs = 1 node + prefill_nodes: 1 + prefill_workers: 1 + gpus_per_prefill: 4 + + # Decode: 5 workers x TP4 = 20 GPUs = 5 nodes + decode_workers: 5 + decode_nodes: 5 + gpus_per_decode: 4 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + max_batch_size: 2 + max_num_tokens: 16384 + max_seq_len: 8232 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: TRTLLM + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.4 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + + decode: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + enable_attention_dp: false + enable_lm_head_tp_in_adp: false + trust_remote_code: true + # max_batch_size=16 covers all concs: 5, 15, 30, 60, 105 + # cuda_graph pre-compiles graphs for each batch size up to the max + max_batch_size: 16 + max_num_tokens: 16 + max_seq_len: 9256 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.9 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + +benchmark: + type: "sa-bench" + isl: 8192 + osl: 1024 + concurrencies: "5x15x30x60x105" + req_rate: "inf" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/STP/ctx2dep4_gen1dep16_batch16_eplb0_mtp0.yaml b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/STP/ctx2dep4_gen1dep16_batch16_eplb0_mtp0.yaml new file mode 100644 index 00000000..4bfaa0e2 --- /dev/null +++ b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/STP/ctx2dep4_gen1dep16_batch16_eplb0_mtp0.yaml @@ -0,0 +1,124 @@ +name: "kimi_k25_nvfp4_ISL8K_OSL1K_ctx2dep4_gen1dep16_batch16_eplb0_mtp0" + +# ctx: 2 prefill workers, TP4/EP4 +# gen: 1 decode worker, TP16/EP16, enable_attention_dp=true, max_batch=16 +# concurrency: 333 + +model: + path: "nvidia/Kimi-K2.5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2" + precision: "fp4" + +resources: + gpu_type: "gb200" + + # Prefill: 2 workers x TP4 = 8 GPUs = 2 nodes + prefill_nodes: 2 + prefill_workers: 2 + gpus_per_prefill: 4 + + # Decode: 1 worker x TP16 = 16 GPUs = 4 nodes + decode_workers: 1 + decode_nodes: 4 + gpus_per_decode: 16 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + max_batch_size: 2 + max_num_tokens: 16384 + max_seq_len: 8232 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: TRTLLM + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.4 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + + decode: + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + trust_remote_code: true + max_batch_size: 16 + max_num_tokens: 16 + max_seq_len: 9256 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + +benchmark: + type: "sa-bench" + isl: 8192 + osl: 1024 + concurrencies: "333" + req_rate: "inf" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/STP/ctx3dep4_gen1dep16_batch32_eplb0_mtp0.yaml b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/STP/ctx3dep4_gen1dep16_batch32_eplb0_mtp0.yaml new file mode 100644 index 00000000..d7d51627 --- /dev/null +++ b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/STP/ctx3dep4_gen1dep16_batch32_eplb0_mtp0.yaml @@ -0,0 +1,126 @@ +name: "kimi_k25_nvfp4_ISL8K_OSL1K_ctx3dep4_gen1dep16_batch32_eplb0_mtp0" + +# ctx: 3 prefill workers, TP4/EP4 +# gen: 1 decode worker, TP16/EP16, enable_attention_dp=true, max_batch=32 +# concurrency: 615 + +model: + path: "nvidia/Kimi-K2.5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2" + precision: "fp4" + +resources: + gpu_type: "gb200" + + # Prefill: 3 workers x TP4 = 12 GPUs = 3 nodes + prefill_nodes: 3 + prefill_workers: 3 + gpus_per_prefill: 4 + + # Decode: 1 worker x TP16 = 16 GPUs = 4 nodes + decode_workers: 1 + decode_nodes: 4 + gpus_per_decode: 16 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + max_batch_size: 2 + max_num_tokens: 16384 + max_seq_len: 8232 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: TRTLLM + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.4 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + + decode: + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + trust_remote_code: true + max_batch_size: 32 + max_num_tokens: 32 + max_seq_len: 9256 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + +benchmark: + type: "sa-bench" + isl: 8192 + osl: 1024 + concurrencies: "615" + req_rate: "inf" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/STP/ctx5dep4_gen1dep8_batch256_allconc_eplb0_mtp0.yaml b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/STP/ctx5dep4_gen1dep8_batch256_allconc_eplb0_mtp0.yaml new file mode 100644 index 00000000..e8df1179 --- /dev/null +++ b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/STP/ctx5dep4_gen1dep8_batch256_allconc_eplb0_mtp0.yaml @@ -0,0 +1,155 @@ +name: "kimi_k25_nvfp4_ISL8K_OSL1K_ctx5dep4_gen1dep8_batch256_allconc_eplb0_mtp0" + +# ctx: 5 prefill workers, TP4/EP4 +# gen: 1 decode worker, TP8/EP8, enable_attention_dp=true, max_batch=256 +# Single concurrency point: 2151 + +model: + path: "nvidia/Kimi-K2.5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2" + precision: "fp4" + +resources: + gpu_type: "gb200" + + # Prefill: 5 workers x TP4 = 20 GPUs = 5 nodes + prefill_nodes: 5 + prefill_workers: 5 + gpus_per_prefill: 4 + + # Decode: 1 worker x TP8 = 8 GPUs = 2 nodes + decode_workers: 1 + decode_nodes: 2 + gpus_per_decode: 8 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + max_batch_size: 2 + max_num_tokens: 16384 + max_seq_len: 8232 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: TRTLLM + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.4 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + + decode: + tensor_parallel_size: 8 + moe_expert_parallel_size: 8 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + trust_remote_code: true + # max_batch_size=256, cuda_graph pre-compiles graphs for all batch sizes up to 256 + max_batch_size: 256 + max_num_tokens: 256 + max_seq_len: 9256 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + - 72 + - 80 + - 88 + - 96 + - 104 + - 112 + - 120 + - 128 + - 136 + - 144 + - 152 + - 160 + - 168 + - 176 + - 184 + - 192 + - 200 + - 208 + - 216 + - 224 + - 232 + - 240 + - 248 + - 256 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + +benchmark: + type: "sa-bench" + isl: 8192 + osl: 1024 + concurrencies: "2151" + req_rate: "inf" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/STP/ctx7dep4_gen1dep16_batch128_eplb0_mtp0.yaml b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/STP/ctx7dep4_gen1dep16_batch128_eplb0_mtp0.yaml new file mode 100644 index 00000000..db177892 --- /dev/null +++ b/recipes/kimi2.5/trtllm_dynamo/disagg/gb200Nvfp4/ISL8K_OSL1K/STP/ctx7dep4_gen1dep16_batch128_eplb0_mtp0.yaml @@ -0,0 +1,138 @@ +name: "kimi_k25_nvfp4_ISL8K_OSL1K_ctx7dep4_gen1dep16_batch128_eplb0_mtp0" + +# ctx: 7 prefill workers, TP4/EP4 +# gen: 1 decode worker, TP16/EP16, enable_attention_dp=true, max_batch=128 +# concurrency: 2253 + +model: + path: "nvidia/Kimi-K2.5-NVFP4" + container: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.1.0-dev.2" + precision: "fp4" + +resources: + gpu_type: "gb200" + + # Prefill: 7 workers x TP4 = 28 GPUs = 7 nodes + prefill_nodes: 7 + prefill_workers: 7 + gpus_per_prefill: 4 + + # Decode: 1 worker x TP16 = 16 GPUs = 4 nodes + decode_workers: 1 + decode_nodes: 4 + gpus_per_decode: 16 + + gpus_per_node: 4 + +backend: + type: trtllm + + prefill_environment: + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + decode_environment: + ENROOT_ALLOW_DEV: "yes" + NCCL_GRAPH_MIXING_SUPPORT: "0" + TLLM_AUTOTUNER_LOG_LEVEL_DEBUG_TO_INFO: "1" + TLLM_LOG_LEVEL: "INFO" + TRTLLM_ENABLE_PDL: "1" + TRTLLM_SERVER_DISABLE_GC: "1" + TRTLLM_WORKER_DISABLE_GC: "1" + + trtllm_config: + prefill: + tensor_parallel_size: 4 + moe_expert_parallel_size: 4 + pipeline_parallel_size: 1 + enable_attention_dp: true + disable_overlap_scheduler: true + trust_remote_code: true + max_batch_size: 2 + max_num_tokens: 16384 + max_seq_len: 8232 + print_iter_log: true + cuda_graph_config: null + moe_config: + backend: TRTLLM + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.4 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + + decode: + tensor_parallel_size: 16 + moe_expert_parallel_size: 16 + pipeline_parallel_size: 1 + enable_attention_dp: true + enable_lm_head_tp_in_adp: false + trust_remote_code: true + max_batch_size: 128 + max_num_tokens: 128 + max_seq_len: 9256 + print_iter_log: true + stream_interval: 100 + num_postprocess_workers: 4 + cuda_graph_config: + enable_padding: true + batch_sizes: + - 1 + - 2 + - 4 + - 8 + - 16 + - 24 + - 32 + - 40 + - 48 + - 56 + - 64 + - 72 + - 80 + - 88 + - 96 + - 104 + - 112 + - 120 + - 128 + moe_config: + backend: TRTLLM + use_low_precision_moe_combine: true + kv_cache_config: + dtype: fp8 + enable_block_reuse: false + free_gpu_memory_fraction: 0.8 + cache_transceiver_config: + backend: UCX + max_tokens_in_buffer: 16384 + nvfp4_gemm_config: + allowed_backends: + - cutlass + - cublaslt + - cutedsl + - cuda_core + +benchmark: + type: "sa-bench" + isl: 8192 + osl: 1024 + concurrencies: "2253" + req_rate: "inf" + +frontend: + type: "dynamo" + enable_multiple_frontends: false + +health_check: + max_attempts: 360 + interval_seconds: 10 + +dynamo: + install: false diff --git a/recipes/vllm/deepseek-v4-pro/8k1k/disagg-gb200-1p1d-dep8-dep8.yaml b/recipes/vllm/deepseek-v4-pro/8k1k/disagg-gb200-1p1d-dep8-dep8.yaml new file mode 100644 index 00000000..10d038a5 --- /dev/null +++ b/recipes/vllm/deepseek-v4-pro/8k1k/disagg-gb200-1p1d-dep8-dep8.yaml @@ -0,0 +1,88 @@ +name: "svf-vllm-disagg-gb200-1p1d-dep8-dep8" +model: + path: "deepseekv4-fp4" + container: "vllm/vllm-openai:deepseekv4-cu130" + precision: "fp4" +dynamo: + hash: 6a159fedd8e4a1563aa647c31f622aedbf254b5b +setup_script: vllm-container-deps.sh +resources: + gpu_type: "gb200" + gpus_per_node: 4 + prefill_nodes: 2 + decode_nodes: 2 + prefill_workers: 1 + decode_workers: 1 + gpus_per_prefill: 8 + gpus_per_decode: 8 +frontend: + type: dynamo + enable_multiple_frontends: false +backend: + type: vllm + connector: null + prefill_environment: + TILELANG_CLEANUP_TEMP_FILES: "1" + VLLM_USE_NCCL_SYMM_MEM: "1" + NCCL_CUMEM_ENABLE: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_NVLS_ENABLE: "1" + VLLM_SERVER_DEV_MODE: "1" + decode_environment: + TILELANG_CLEANUP_TEMP_FILES: "1" + VLLM_USE_NCCL_SYMM_MEM: "1" + NCCL_CUMEM_ENABLE: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_NVLS_ENABLE: "1" + VLLM_SERVER_DEV_MODE: "1" + vllm_config: + prefill: + kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}' + served-model-name: "deepseek-ai/DeepSeek-V4-Pro" + kv-cache-dtype: "fp8" + tensor-parallel-size: 1 + pipeline-parallel-size: 1 + data-parallel-size: 8 + data-parallel-rpc-port: 13345 + enable-expert-parallel: true + enforce-eager: true + max-model-len: auto + max-num-seqs: 4 + max-num-batched-tokens: 16384 + trust-remote-code: true + no-enable-prefix-caching: true + no-enable-flashinfer-autotune: true + no-async-scheduling: true + block-size: 256 + gpu-memory-utilization: 0.9 + no-disable-hybrid-kv-cache-manager: true + enable-sleep-mode: true + decode: + kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}' + served-model-name: "deepseek-ai/DeepSeek-V4-Pro" + kv-cache-dtype: "fp8" + tensor-parallel-size: 1 + pipeline-parallel-size: 1 + data-parallel-size: 8 + data-parallel-rpc-port: 13345 + enable-expert-parallel: true + max-model-len: auto + max-num-seqs: 64 + max-cudagraph-capture-size: 64 + max-num-batched-tokens: 64 + trust-remote-code: true + no-enable-prefix-caching: true + block-size: 256 + compilation-config: '{"cudagraph_mode":"FULL_DECODE_ONLY","mode":0}' + gpu-memory-utilization: 0.9 + stream-interval: 50 + no-disable-hybrid-kv-cache-manager: true + enable-sleep-mode: true +benchmark: + type: "sa-bench" + isl: 8192 + osl: 1024 + concurrencies: "4x8x16x32x64x256" + req_rate: "inf" + custom_tokenizer: "deepseek_v4" + use_chat_template: false diff --git a/recipes/vllm/deepseek-v4-pro/8k1k/disagg-gb200-2p1d-dep8-dep16.yaml b/recipes/vllm/deepseek-v4-pro/8k1k/disagg-gb200-2p1d-dep8-dep16.yaml new file mode 100644 index 00000000..a46d9bf7 --- /dev/null +++ b/recipes/vllm/deepseek-v4-pro/8k1k/disagg-gb200-2p1d-dep8-dep16.yaml @@ -0,0 +1,88 @@ +name: "svf-vllm-disagg-gb200-2p1d-dep8-dep16" +model: + path: "deepseekv4-fp4" + container: "vllm/vllm-openai:deepseekv4-cu130" + precision: "fp4" +dynamo: + hash: 6a159fedd8e4a1563aa647c31f622aedbf254b5b +setup_script: vllm-container-deps.sh +resources: + gpu_type: "gb200" + gpus_per_node: 4 + prefill_nodes: 2 + decode_nodes: 2 + prefill_workers: 2 + decode_workers: 1 + gpus_per_prefill: 8 + gpus_per_decode: 8 +frontend: + type: dynamo + enable_multiple_frontends: false +backend: + type: vllm + connector: null + prefill_environment: + TILELANG_CLEANUP_TEMP_FILES: "1" + VLLM_USE_NCCL_SYMM_MEM: "1" + NCCL_CUMEM_ENABLE: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_NVLS_ENABLE: "1" + VLLM_SERVER_DEV_MODE: "1" + decode_environment: + TILELANG_CLEANUP_TEMP_FILES: "1" + VLLM_USE_NCCL_SYMM_MEM: "1" + NCCL_CUMEM_ENABLE: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_NVLS_ENABLE: "1" + VLLM_SERVER_DEV_MODE: "1" + vllm_config: + prefill: + kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}' + served-model-name: "deepseek-ai/DeepSeek-V4-Pro" + kv-cache-dtype: "fp8" + tensor-parallel-size: 1 + pipeline-parallel-size: 1 + data-parallel-size: 8 + data-parallel-rpc-port: 13345 + enable-expert-parallel: true + enforce-eager: true + max-model-len: auto + max-num-seqs: 4 + max-num-batched-tokens: 16384 + trust-remote-code: true + no-enable-prefix-caching: true + no-enable-flashinfer-autotune: true + no-async-scheduling: true + block-size: 256 + gpu-memory-utilization: 0.9 + no-disable-hybrid-kv-cache-manager: true + enable-sleep-mode: true + decode: + kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}' + served-model-name: "deepseek-ai/DeepSeek-V4-Pro" + kv-cache-dtype: "fp8" + tensor-parallel-size: 1 + pipeline-parallel-size: 1 + data-parallel-size: 16 + data-parallel-rpc-port: 13345 + enable-expert-parallel: true + max-model-len: auto + max-num-seqs: 64 + max-cudagraph-capture-size: 64 + max-num-batched-tokens: 64 + trust-remote-code: true + no-enable-prefix-caching: true + block-size: 256 + compilation-config: '{"cudagraph_mode":"FULL_DECODE_ONLY","mode":0}' + gpu-memory-utilization: 0.9 + stream-interval: 50 + no-disable-hybrid-kv-cache-manager: true + enable-sleep-mode: true +benchmark: + type: "sa-bench" + isl: 8192 + osl: 1024 + concurrencies: "1024" + req_rate: "inf" + custom_tokenizer: "deepseek_v4" + use_chat_template: false diff --git a/recipes/vllm/deepseek-v4-pro/8k1k/disagg-gb200-4p1d-dep8-dep16.yaml b/recipes/vllm/deepseek-v4-pro/8k1k/disagg-gb200-4p1d-dep8-dep16.yaml new file mode 100644 index 00000000..32089c84 --- /dev/null +++ b/recipes/vllm/deepseek-v4-pro/8k1k/disagg-gb200-4p1d-dep8-dep16.yaml @@ -0,0 +1,88 @@ +name: "svf-vllm-disagg-gb200-4p1d-dep8-dep16" +model: + path: "deepseekv4-fp4" + container: "vllm/vllm-openai:deepseekv4-cu130" + precision: "fp4" +dynamo: + hash: 6a159fedd8e4a1563aa647c31f622aedbf254b5b +setup_script: vllm-container-deps.sh +resources: + gpu_type: "gb200" + gpus_per_node: 4 + prefill_nodes: 2 + decode_nodes: 2 + prefill_workers: 4 + decode_workers: 1 + gpus_per_prefill: 8 + gpus_per_decode: 8 +frontend: + type: dynamo + enable_multiple_frontends: false +backend: + type: vllm + connector: null + prefill_environment: + TILELANG_CLEANUP_TEMP_FILES: "1" + VLLM_USE_NCCL_SYMM_MEM: "1" + NCCL_CUMEM_ENABLE: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_NVLS_ENABLE: "1" + VLLM_SERVER_DEV_MODE: "1" + decode_environment: + TILELANG_CLEANUP_TEMP_FILES: "1" + VLLM_USE_NCCL_SYMM_MEM: "1" + NCCL_CUMEM_ENABLE: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_NVLS_ENABLE: "1" + VLLM_SERVER_DEV_MODE: "1" + vllm_config: + prefill: + kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}' + served-model-name: "deepseek-ai/DeepSeek-V4-Pro" + kv-cache-dtype: "fp8" + tensor-parallel-size: 1 + pipeline-parallel-size: 1 + data-parallel-size: 8 + data-parallel-rpc-port: 13345 + enable-expert-parallel: true + enforce-eager: true + max-model-len: auto + max-num-seqs: 4 + max-num-batched-tokens: 16384 + trust-remote-code: true + no-enable-prefix-caching: true + no-enable-flashinfer-autotune: true + no-async-scheduling: true + block-size: 256 + gpu-memory-utilization: 0.9 + no-disable-hybrid-kv-cache-manager: true + enable-sleep-mode: true + decode: + kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}' + served-model-name: "deepseek-ai/DeepSeek-V4-Pro" + kv-cache-dtype: "fp8" + tensor-parallel-size: 1 + pipeline-parallel-size: 1 + data-parallel-size: 16 + data-parallel-rpc-port: 13345 + enable-expert-parallel: true + max-model-len: auto + max-num-seqs: 256 + max-cudagraph-capture-size: 256 + max-num-batched-tokens: 256 + trust-remote-code: true + no-enable-prefix-caching: true + block-size: 256 + compilation-config: '{"cudagraph_mode":"FULL_DECODE_ONLY","mode":0}' + gpu-memory-utilization: 0.9 + stream-interval: 50 + no-disable-hybrid-kv-cache-manager: true + enable-sleep-mode: true +benchmark: + type: "sa-bench" + isl: 8192 + osl: 1024 + concurrencies: "2048" + req_rate: "inf" + custom_tokenizer: "deepseek_v4" + use_chat_template: false diff --git a/recipes/vllm/deepseek-v4-pro/8k1k/disagg-gb200-7p1d-dep8-dep16.yaml b/recipes/vllm/deepseek-v4-pro/8k1k/disagg-gb200-7p1d-dep8-dep16.yaml new file mode 100644 index 00000000..1568e492 --- /dev/null +++ b/recipes/vllm/deepseek-v4-pro/8k1k/disagg-gb200-7p1d-dep8-dep16.yaml @@ -0,0 +1,87 @@ +name: "svf-vllm-disagg-gb200-7p1d-dep8-dep16" +model: + path: "deepseekv4-fp4" + container: "vllm/vllm-openai:deepseekv4-cu130" + precision: "fp4" +dynamo: + hash: 6a159fedd8e4a1563aa647c31f622aedbf254b5b +setup_script: vllm-container-deps.sh +resources: + gpu_type: "gb200" + gpus_per_node: 4 + prefill_nodes: 14 + decode_nodes: 4 + prefill_workers: 7 + decode_workers: 1 + gpus_per_prefill: 8 + gpus_per_decode: 16 +frontend: + type: dynamo + enable_multiple_frontends: false +backend: + type: vllm + connector: null + prefill_environment: + TILELANG_CLEANUP_TEMP_FILES: "1" + VLLM_USE_NCCL_SYMM_MEM: "1" + NCCL_CUMEM_ENABLE: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_NVLS_ENABLE: "1" + VLLM_SERVER_DEV_MODE: "1" + decode_environment: + TILELANG_CLEANUP_TEMP_FILES: "1" + VLLM_USE_NCCL_SYMM_MEM: "1" + NCCL_CUMEM_ENABLE: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_NVLS_ENABLE: "1" + VLLM_SERVER_DEV_MODE: "1" + vllm_config: + prefill: + kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}' + served-model-name: "deepseek-ai/DeepSeek-V4-Pro" + kv-cache-dtype: "fp8" + tensor-parallel-size: 1 + pipeline-parallel-size: 1 + data-parallel-size: 8 + data-parallel-rpc-port: 13345 + enable-expert-parallel: true + enforce-eager: true + max-model-len: auto + max-num-seqs: 2 + max-num-batched-tokens: 16384 + trust-remote-code: true + no-enable-prefix-caching: true + no-enable-flashinfer-autotune: true + block-size: 256 + gpu-memory-utilization: 0.88 + no-disable-hybrid-kv-cache-manager: true + enable-sleep-mode: true + decode: + kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}' + served-model-name: "deepseek-ai/DeepSeek-V4-Pro" + kv-cache-dtype: "fp8" + tensor-parallel-size: 1 + pipeline-parallel-size: 1 + data-parallel-size: 16 + data-parallel-rpc-port: 13345 + enable-expert-parallel: true + max-model-len: auto + max-num-seqs: 256 + max-cudagraph-capture-size: 256 + max-num-batched-tokens: 256 + trust-remote-code: true + no-enable-prefix-caching: true + block-size: 256 + compilation-config: '{"cudagraph_mode":"FULL_DECODE_ONLY","mode":0}' + gpu-memory-utilization: 0.9 + stream-interval: 50 + no-disable-hybrid-kv-cache-manager: true + enable-sleep-mode: true +benchmark: + type: "sa-bench" + isl: 8192 + osl: 1024 + concurrencies: "4096" + req_rate: "inf" + custom_tokenizer: "deepseek_v4" + use_chat_template: false diff --git a/recipes/vllm/kimi-k2.5/1k1k/disagg-gb200-1p1d-dep4-dep16.yaml b/recipes/vllm/kimi-k2.5/1k1k/disagg-gb200-1p1d-dep4-dep16.yaml new file mode 100644 index 00000000..ecdc9233 --- /dev/null +++ b/recipes/vllm/kimi-k2.5/1k1k/disagg-gb200-1p1d-dep4-dep16.yaml @@ -0,0 +1,101 @@ +name: "kimi-vllm-disagg-gb200-1p1d-dep4-dep16" + +model: + path: "kimi-k2.5-nvfp4" + container: "vllm/vllm-openai:v0.18.0-cu130" + precision: "fp4" + +dynamo: + version: 1.0.1 + install: true + +setup_script: vllm-container-deps.sh + +resources: + gpu_type: "gb200" + gpus_per_node: 4 + prefill_nodes: 1 + decode_nodes: 4 + prefill_workers: 1 + decode_workers: 1 + gpus_per_prefill: 4 + gpus_per_decode: 16 + +frontend: + type: dynamo + enable_multiple_frontends: false + +backend: + type: vllm + connector: null + + prefill_environment: + VLLM_USE_FLASHINFER_MOE_FP4: "1" + VLLM_USE_NCCL_SYMM_MEM: "1" + NCCL_CUMEM_ENABLE: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_NVLS_ENABLE: "1" + + decode_environment: + VLLM_USE_FLASHINFER_MOE_FP4: "1" + VLLM_USE_NCCL_SYMM_MEM: "1" + NCCL_CUMEM_ENABLE: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_NVLS_ENABLE: "1" + + vllm_config: + prefill: + kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}' + served-model-name: "nvidia/Kimi-K2.5-NVFP4" + kv-cache-dtype: "fp8" + tensor-parallel-size: 1 + pipeline-parallel-size: 1 + data-parallel-size: 4 + data-parallel-rpc-port: 13345 + enable-expert-parallel: true + max-model-len: 3072 + max-num-seqs: 4096 + enforce-eager: true + compilation-config: '{"custom_ops":["+quant_fp8","+rms_norm","+rotary_embedding"],"pass_config":{"fuse_attn_quant":true,"fuse_allreduce_rms":true}}' + max-num-batched-tokens: 16384 + safetensors-load-strategy: "prefetch" + trust-remote-code: true + no-enable-prefix-caching: true + no-enable-chunked-prefill: true + attention-backend: "FLASHINFER_MLA" + block-size: 64 + attention-config: '{"use_trtllm_ragged_deepseek_prefill": true}' + all2all-backend: "flashinfer_nvlink_one_sided" + gpu-memory-utilization: 0.9 + + decode: + kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}' + served-model-name: "nvidia/Kimi-K2.5-NVFP4" + kv-cache-dtype: "fp8" + tensor-parallel-size: 1 + pipeline-parallel-size: 1 + data-parallel-size: 16 + data-parallel-rpc-port: 13345 + enable-expert-parallel: true + max-model-len: 3072 + max-num-seqs: 4096 + max-num-batched-tokens: 10240 + safetensors-load-strategy: "prefetch" + trust-remote-code: true + no-enable-prefix-caching: true + no-enable-chunked-prefill: true + async-scheduling: true + attention-backend: "FLASHINFER_MLA" + block-size: 64 + all2all-backend: "flashinfer_nvlink_one_sided" + compilation-config: '{"cudagraph_mode":"FULL_DECODE_ONLY","custom_ops":["+quant_fp8","+rms_norm","+rotary_embedding"],"pass_config":{"fuse_attn_quant":true,"fuse_allreduce_rms":true}}' + gpu-memory-utilization: 0.9 + stream-interval: 50 + max-cudagraph-capture-size: 512 + +benchmark: + type: "sa-bench" + isl: 1024 + osl: 1024 + concurrencies: "256x512x1024x2048x3072x4096" + req_rate: "inf" diff --git a/recipes/vllm/kimi-k2.5/1k1k/disagg-gb200-1p4d-dep4-tep4.yaml b/recipes/vllm/kimi-k2.5/1k1k/disagg-gb200-1p4d-dep4-tep4.yaml new file mode 100644 index 00000000..43167b5f --- /dev/null +++ b/recipes/vllm/kimi-k2.5/1k1k/disagg-gb200-1p4d-dep4-tep4.yaml @@ -0,0 +1,98 @@ +name: "kimi-vllm-disagg-gb200-1p4d-dep4-tep4" + +model: + path: "kimi-k2.5-nvfp4" + container: "vllm/vllm-openai:v0.18.0-cu130" + precision: "fp4" + +dynamo: + version: 1.0.1 + install: true + +setup_script: vllm-container-deps.sh + +resources: + gpu_type: "gb200" + gpus_per_node: 4 + prefill_nodes: 1 + decode_nodes: 4 + prefill_workers: 1 + decode_workers: 4 + gpus_per_prefill: 4 + gpus_per_decode: 4 + +frontend: + type: dynamo + enable_multiple_frontends: false + +backend: + type: vllm + connector: null + + prefill_environment: + VLLM_USE_FLASHINFER_MOE_FP4: "1" + VLLM_USE_NCCL_SYMM_MEM: "1" + NCCL_CUMEM_ENABLE: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_NVLS_ENABLE: "1" + + decode_environment: + VLLM_USE_FLASHINFER_MOE_FP4: "1" + VLLM_USE_NCCL_SYMM_MEM: "1" + NCCL_CUMEM_ENABLE: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_NVLS_ENABLE: "1" + + vllm_config: + prefill: + kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}' + served-model-name: "nvidia/Kimi-K2.5-NVFP4" + kv-cache-dtype: "fp8" + tensor-parallel-size: 1 + pipeline-parallel-size: 1 + data-parallel-size: 4 + data-parallel-rpc-port: 13345 + enable-expert-parallel: true + max-model-len: 3072 + max-num-seqs: 1024 + enforce-eager: true + compilation-config: '{"custom_ops":["+quant_fp8","+rms_norm","+rotary_embedding"],"pass_config":{"fuse_attn_quant":true,"fuse_allreduce_rms":true}}' + max-num-batched-tokens: 16384 + safetensors-load-strategy: "prefetch" + trust-remote-code: true + no-enable-prefix-caching: true + no-enable-chunked-prefill: true + attention-backend: "FLASHINFER_MLA" + block-size: 64 + attention-config: '{"use_trtllm_ragged_deepseek_prefill": true}' + all2all-backend: "flashinfer_nvlink_one_sided" + gpu-memory-utilization: 0.9 + + decode: + kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}' + served-model-name: "nvidia/Kimi-K2.5-NVFP4" + kv-cache-dtype: "fp8" + tensor-parallel-size: 4 + pipeline-parallel-size: 1 + enable-expert-parallel: true + max-model-len: 3072 + max-num-seqs: 1024 + max-num-batched-tokens: 10240 + safetensors-load-strategy: "prefetch" + trust-remote-code: true + no-enable-prefix-caching: true + no-enable-chunked-prefill: true + async-scheduling: true + attention-backend: "FLASHINFER_MLA" + block-size: 64 + compilation-config: '{"cudagraph_mode":"FULL_DECODE_ONLY","custom_ops":["+quant_fp8","+rms_norm","+rotary_embedding"],"pass_config":{"fuse_attn_quant":true,"fuse_allreduce_rms":true}}' + gpu-memory-utilization: 0.9 + stream-interval: 50 + max-cudagraph-capture-size: 1024 + +benchmark: + type: "sa-bench" + isl: 1024 + osl: 1024 + concurrencies: "4x8x16x32x64x128" + req_rate: "inf" diff --git a/recipes/vllm/kimi-k2.5/8k1k/disagg-gb200-1p4d-dep4-tep4.yaml b/recipes/vllm/kimi-k2.5/8k1k/disagg-gb200-1p4d-dep4-tep4.yaml new file mode 100644 index 00000000..1ab6ca27 --- /dev/null +++ b/recipes/vllm/kimi-k2.5/8k1k/disagg-gb200-1p4d-dep4-tep4.yaml @@ -0,0 +1,98 @@ +name: "kimi-vllm-disagg-gb200-1p4d-dep4-tep4" + +model: + path: "kimi-k2.5-nvfp4" + container: "vllm/vllm-openai:v0.18.0-cu130" + precision: "fp4" + +dynamo: + version: 1.0.1 + install: true + +setup_script: vllm-container-deps.sh + +resources: + gpu_type: "gb200" + gpus_per_node: 4 + prefill_nodes: 1 + decode_nodes: 4 + prefill_workers: 1 + decode_workers: 4 + gpus_per_prefill: 4 + gpus_per_decode: 4 + +frontend: + type: dynamo + enable_multiple_frontends: false + +backend: + type: vllm + connector: null + + prefill_environment: + VLLM_USE_FLASHINFER_MOE_FP4: "1" + VLLM_USE_NCCL_SYMM_MEM: "1" + NCCL_CUMEM_ENABLE: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_NVLS_ENABLE: "1" + + decode_environment: + VLLM_USE_FLASHINFER_MOE_FP4: "1" + VLLM_USE_NCCL_SYMM_MEM: "1" + NCCL_CUMEM_ENABLE: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_NVLS_ENABLE: "1" + + vllm_config: + prefill: + kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}' + served-model-name: "nvidia/Kimi-K2.5-NVFP4" + kv-cache-dtype: "fp8" + tensor-parallel-size: 1 + pipeline-parallel-size: 1 + data-parallel-size: 4 + data-parallel-rpc-port: 13345 + enable-expert-parallel: true + max-model-len: 10240 + max-num-seqs: 64 + enforce-eager: true + compilation-config: '{"custom_ops":["+quant_fp8","+rms_norm","+rotary_embedding"],"pass_config":{"fuse_attn_quant":true,"fuse_allreduce_rms":true}}' + max-num-batched-tokens: 16384 + safetensors-load-strategy: "prefetch" + trust-remote-code: true + no-enable-prefix-caching: true + no-enable-chunked-prefill: true + attention-backend: "FLASHINFER_MLA" + block-size: 64 + attention-config: '{"use_trtllm_ragged_deepseek_prefill": true}' + all2all-backend: "flashinfer_nvlink_one_sided" + gpu-memory-utilization: 0.9 + + decode: + kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}' + served-model-name: "nvidia/Kimi-K2.5-NVFP4" + kv-cache-dtype: "fp8" + tensor-parallel-size: 4 + pipeline-parallel-size: 1 + enable-expert-parallel: true + max-model-len: 10240 + max-num-seqs: 16 + max-num-batched-tokens: 10240 + safetensors-load-strategy: "prefetch" + trust-remote-code: true + no-enable-prefix-caching: true + no-enable-chunked-prefill: true + async-scheduling: true + attention-backend: "FLASHINFER_MLA" + block-size: 64 + compilation-config: '{"cudagraph_mode":"FULL_DECODE_ONLY","custom_ops":["+quant_fp8","+rms_norm","+rotary_embedding"],"pass_config":{"fuse_attn_quant":true,"fuse_allreduce_rms":true}}' + gpu-memory-utilization: 0.9 + stream-interval: 50 + max-cudagraph-capture-size: 16 + +benchmark: + type: "sa-bench" + isl: 8192 + osl: 1024 + concurrencies: "4x8x16x32x128" + req_rate: "inf" diff --git a/recipes/vllm/kimi-k2.5/8k1k/disagg-gb200-3p1d-dep4-dep16.yaml b/recipes/vllm/kimi-k2.5/8k1k/disagg-gb200-3p1d-dep4-dep16.yaml new file mode 100644 index 00000000..ca4e9813 --- /dev/null +++ b/recipes/vllm/kimi-k2.5/8k1k/disagg-gb200-3p1d-dep4-dep16.yaml @@ -0,0 +1,101 @@ +name: "kimi-vllm-disagg-gb200-3p1d-dep4-dep16" + +model: + path: "kimi-k2.5-nvfp4" + container: "vllm/vllm-openai:v0.18.0-cu130" + precision: "fp4" + +dynamo: + version: 1.0.1 + install: true + +setup_script: vllm-container-deps.sh + +resources: + gpu_type: "gb200" + gpus_per_node: 4 + prefill_nodes: 3 + decode_nodes: 4 + prefill_workers: 3 + decode_workers: 1 + gpus_per_prefill: 4 + gpus_per_decode: 16 + +frontend: + type: dynamo + enable_multiple_frontends: false + +backend: + type: vllm + connector: null + + prefill_environment: + VLLM_USE_FLASHINFER_MOE_FP4: "1" + VLLM_USE_NCCL_SYMM_MEM: "1" + NCCL_CUMEM_ENABLE: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_NVLS_ENABLE: "1" + + decode_environment: + VLLM_USE_FLASHINFER_MOE_FP4: "1" + VLLM_USE_NCCL_SYMM_MEM: "1" + NCCL_CUMEM_ENABLE: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_NVLS_ENABLE: "1" + + vllm_config: + prefill: + kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}' + served-model-name: "nvidia/Kimi-K2.5-NVFP4" + kv-cache-dtype: "fp8" + tensor-parallel-size: 1 + pipeline-parallel-size: 1 + data-parallel-size: 4 + data-parallel-rpc-port: 13345 + enable-expert-parallel: true + max-model-len: 10240 + max-num-seqs: 64 + enforce-eager: true + compilation-config: '{"custom_ops":["+quant_fp8","+rms_norm","+rotary_embedding"],"pass_config":{"fuse_attn_quant":true,"fuse_allreduce_rms":true}}' + max-num-batched-tokens: 16384 + safetensors-load-strategy: "prefetch" + trust-remote-code: true + no-enable-prefix-caching: true + no-enable-chunked-prefill: true + attention-backend: "FLASHINFER_MLA" + block-size: 64 + attention-config: '{"use_trtllm_ragged_deepseek_prefill": true}' + all2all-backend: "flashinfer_nvlink_one_sided" + gpu-memory-utilization: 0.9 + + decode: + kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}' + served-model-name: "nvidia/Kimi-K2.5-NVFP4" + kv-cache-dtype: "fp8" + tensor-parallel-size: 1 + pipeline-parallel-size: 1 + data-parallel-size: 16 + data-parallel-rpc-port: 13345 + enable-expert-parallel: true + max-model-len: 10240 + max-num-seqs: 256 + max-num-batched-tokens: 10240 + safetensors-load-strategy: "prefetch" + trust-remote-code: true + no-enable-prefix-caching: true + no-enable-chunked-prefill: true + async-scheduling: true + attention-backend: "FLASHINFER_MLA" + block-size: 64 + all2all-backend: "flashinfer_nvlink_one_sided" + compilation-config: '{"cudagraph_mode":"FULL_DECODE_ONLY","custom_ops":["+quant_fp8","+rms_norm","+rotary_embedding"],"pass_config":{"fuse_attn_quant":true,"fuse_allreduce_rms":true}}' + gpu-memory-utilization: 0.9 + stream-interval: 50 + max-cudagraph-capture-size: 256 + +benchmark: + type: "sa-bench" + isl: 8192 + osl: 1024 + concurrencies: "512x1024" + req_rate: "inf" diff --git a/recipes/vllm/kimi-k2.5/8k1k/disagg-gb200-5p1d-dep4-dep8.yaml b/recipes/vllm/kimi-k2.5/8k1k/disagg-gb200-5p1d-dep4-dep8.yaml new file mode 100644 index 00000000..cd9f94a9 --- /dev/null +++ b/recipes/vllm/kimi-k2.5/8k1k/disagg-gb200-5p1d-dep4-dep8.yaml @@ -0,0 +1,101 @@ +name: "kimi-vllm-disagg-gb200-5p1d-dep4-dep8" + +model: + path: "kimi-k2.5-nvfp4" + container: "vllm/vllm-openai:v0.18.0-cu130" + precision: "fp4" + +dynamo: + version: 1.0.1 + install: true + +setup_script: vllm-container-deps.sh + +resources: + gpu_type: "gb200" + gpus_per_node: 4 + prefill_nodes: 5 + decode_nodes: 2 + prefill_workers: 5 + decode_workers: 1 + gpus_per_prefill: 4 + gpus_per_decode: 8 + +frontend: + type: dynamo + enable_multiple_frontends: false + +backend: + type: vllm + connector: null + + prefill_environment: + VLLM_USE_FLASHINFER_MOE_FP4: "1" + VLLM_USE_NCCL_SYMM_MEM: "1" + NCCL_CUMEM_ENABLE: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_NVLS_ENABLE: "1" + + decode_environment: + VLLM_USE_FLASHINFER_MOE_FP4: "1" + VLLM_USE_NCCL_SYMM_MEM: "1" + NCCL_CUMEM_ENABLE: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_NVLS_ENABLE: "1" + + vllm_config: + prefill: + kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}' + served-model-name: "nvidia/Kimi-K2.5-NVFP4" + kv-cache-dtype: "fp8" + tensor-parallel-size: 1 + pipeline-parallel-size: 1 + data-parallel-size: 4 + data-parallel-rpc-port: 13345 + enable-expert-parallel: true + max-model-len: 10240 + max-num-seqs: 64 + enforce-eager: true + compilation-config: '{"custom_ops":["+quant_fp8","+rms_norm","+rotary_embedding"],"pass_config":{"fuse_attn_quant":true,"fuse_allreduce_rms":true}}' + max-num-batched-tokens: 16384 + safetensors-load-strategy: "prefetch" + trust-remote-code: true + no-enable-prefix-caching: true + no-enable-chunked-prefill: true + attention-backend: "FLASHINFER_MLA" + block-size: 64 + attention-config: '{"use_trtllm_ragged_deepseek_prefill": true}' + all2all-backend: "flashinfer_nvlink_one_sided" + gpu-memory-utilization: 0.9 + + decode: + kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}' + served-model-name: "nvidia/Kimi-K2.5-NVFP4" + kv-cache-dtype: "fp8" + tensor-parallel-size: 1 + pipeline-parallel-size: 1 + data-parallel-size: 8 + data-parallel-rpc-port: 13345 + enable-expert-parallel: true + max-model-len: 10240 + max-num-seqs: 512 + max-num-batched-tokens: 10240 + safetensors-load-strategy: "prefetch" + trust-remote-code: true + no-enable-prefix-caching: true + no-enable-chunked-prefill: true + async-scheduling: true + attention-backend: "FLASHINFER_MLA" + block-size: 64 + all2all-backend: "flashinfer_nvlink_one_sided" + compilation-config: '{"cudagraph_mode":"FULL_DECODE_ONLY","custom_ops":["+quant_fp8","+rms_norm","+rotary_embedding"],"pass_config":{"fuse_attn_quant":true,"fuse_allreduce_rms":true}}' + gpu-memory-utilization: 0.9 + stream-interval: 50 + max-cudagraph-capture-size: 512 + +benchmark: + type: "sa-bench" + isl: 8192 + osl: 1024 + concurrencies: "2048" + req_rate: "inf" diff --git a/recipes/vllm/kimi-k2.5/8k1k/disagg-gb200-6p1d-dep4-dep16.yaml b/recipes/vllm/kimi-k2.5/8k1k/disagg-gb200-6p1d-dep4-dep16.yaml new file mode 100644 index 00000000..47d3d7ee --- /dev/null +++ b/recipes/vllm/kimi-k2.5/8k1k/disagg-gb200-6p1d-dep4-dep16.yaml @@ -0,0 +1,101 @@ +name: "kimi-vllm-disagg-gb200-6p1d-dep4-dep16" + +model: + path: "kimi-k2.5-nvfp4" + container: "vllm/vllm-openai:v0.18.0-cu130" + precision: "fp4" + +dynamo: + version: 1.0.1 + install: true + +setup_script: vllm-container-deps.sh + +resources: + gpu_type: "gb200" + gpus_per_node: 4 + prefill_nodes: 6 + decode_nodes: 4 + prefill_workers: 6 + decode_workers: 1 + gpus_per_prefill: 4 + gpus_per_decode: 16 + +frontend: + type: dynamo + enable_multiple_frontends: false + +backend: + type: vllm + connector: null + + prefill_environment: + VLLM_USE_FLASHINFER_MOE_FP4: "1" + VLLM_USE_NCCL_SYMM_MEM: "1" + NCCL_CUMEM_ENABLE: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_NVLS_ENABLE: "1" + + decode_environment: + VLLM_USE_FLASHINFER_MOE_FP4: "1" + VLLM_USE_NCCL_SYMM_MEM: "1" + NCCL_CUMEM_ENABLE: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_NVLS_ENABLE: "1" + + vllm_config: + prefill: + kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}' + served-model-name: "nvidia/Kimi-K2.5-NVFP4" + kv-cache-dtype: "fp8" + tensor-parallel-size: 1 + pipeline-parallel-size: 1 + data-parallel-size: 4 + data-parallel-rpc-port: 13345 + enable-expert-parallel: true + max-model-len: 10240 + max-num-seqs: 64 + enforce-eager: true + compilation-config: '{"custom_ops":["+quant_fp8","+rms_norm","+rotary_embedding"],"pass_config":{"fuse_attn_quant":true,"fuse_allreduce_rms":true}}' + max-num-batched-tokens: 16384 + safetensors-load-strategy: "prefetch" + trust-remote-code: true + no-enable-prefix-caching: true + no-enable-chunked-prefill: true + attention-backend: "FLASHINFER_MLA" + block-size: 64 + attention-config: '{"use_trtllm_ragged_deepseek_prefill": true}' + all2all-backend: "flashinfer_nvlink_one_sided" + gpu-memory-utilization: 0.9 + + decode: + kv-transfer-config: '{"kv_connector": "NixlConnector", "kv_role": "kv_both"}' + served-model-name: "nvidia/Kimi-K2.5-NVFP4" + kv-cache-dtype: "fp8" + tensor-parallel-size: 1 + pipeline-parallel-size: 1 + data-parallel-size: 16 + data-parallel-rpc-port: 13345 + enable-expert-parallel: true + max-model-len: 10240 + max-num-seqs: 512 + max-num-batched-tokens: 10240 + safetensors-load-strategy: "prefetch" + trust-remote-code: true + no-enable-prefix-caching: true + no-enable-chunked-prefill: true + async-scheduling: true + attention-backend: "FLASHINFER_MLA" + block-size: 64 + all2all-backend: "flashinfer_nvlink_one_sided" + compilation-config: '{"cudagraph_mode":"FULL_DECODE_ONLY","custom_ops":["+quant_fp8","+rms_norm","+rotary_embedding"],"pass_config":{"fuse_attn_quant":true,"fuse_allreduce_rms":true}}' + gpu-memory-utilization: 0.9 + stream-interval: 50 + max-cudagraph-capture-size: 512 + +benchmark: + type: "sa-bench" + isl: 8192 + osl: 1024 + concurrencies: "3072x4096" + req_rate: "inf" diff --git a/recipes/vllm/minimax-m2.5/b200-fp4/1k1k.yaml b/recipes/vllm/minimax-m2.5/b200-fp4/1k1k.yaml new file mode 100644 index 00000000..daef7b0d --- /dev/null +++ b/recipes/vllm/minimax-m2.5/b200-fp4/1k1k.yaml @@ -0,0 +1,103 @@ +# MiniMax-M2.5 NVFP4 B200 — 1K/1K ISL/OSL +# Aggregated vLLM, single-node +# requires github.com/NVIDIA/srt-slurm, branch sa-submission-q2-2026 +# usage examples: +# srtctl apply -f 1k1k.yaml # run all variants +# srtctl apply -f 1k1k.yaml:zip_override_lowlat # full lowlat sweep +# srtctl apply -f 1k1k.yaml:zip_override_lowlat[2] # lowlat, tep2 variant only +# srtctl apply -f 1k1k.yaml:zip_override_hightput # full high tput sweep +# srtctl dry-run -f 1k1k.yaml # preview the variants + +base: + name: "minimax-m2.5-nvfp4-b200-1k1k" + + model: + path: "minimax_m2.5_fp4" + container: "vllm/vllm-openai:v0.19.0-cu130" + precision: "fp4" + + resources: + gpu_type: "b200" + gpus_per_node: 8 + agg_nodes: 1 + agg_workers: 1 + gpus_per_agg: 1 + + frontend: + type: dynamo + enable_multiple_frontends: false + + dynamo: + install: true + top_of_tree: true # currently need ToT for vllm 0.19.0 + + setup_script: vllm-container-deps.sh + + backend: + type: vllm + + aggregated_environment: + DYN_HEALTH_CHECK_ENABLED: "false" + PYTHONUNBUFFERED: "1" + + vllm_config: + aggregated: + tensor-parallel-size: 1 + gpu-memory-utilization: 0.90 + max-model-len: 2248 + max-num-batched-tokens: 2048 + kv-cache-dtype: fp8 + max-cudagraph-capture-size: 2048 + stream-interval: 20 + no-enable-prefix-caching: true + trust-remote-code: true + + benchmark: + type: "sa-bench" + isl: 1024 + osl: 1024 + req_rate: "inf" + + +zip_override_lowlat: + name: + - "minimax-m2.5-nvfp4-b200-1k1k-lowlat-tp1" + - "minimax-m2.5-nvfp4-b200-1k1k-lowlat-tp2" + - "minimax-m2.5-nvfp4-b200-1k1k-lowlat-tep2" + resources: + gpus_per_agg: [1, 2, 2] + backend: + vllm_config: + aggregated: + tensor-parallel-size: [1, 2, 2] + enable-expert-parallel: [false, false, true] + benchmark: + concurrencies: ["4","4x8x16x32x64x128x256x512","128x256"] + +override_maxtput: + name: "minimax-m2.5-nvfp4-b200-1k1k-maxtput-dep2" + resources: + gpus_per_agg: 2 + backend: + vllm_config: + aggregated: + tensor-parallel-size: 1 + enable-expert-parallel: true + data-parallel-size: 2 + benchmark: + concurrencies: "512" + +zip_override_hightput: + name: + - "minimax-m2.5-nvfp4-b200-1k1k-hightput-tp4" + - "minimax-m2.5-nvfp4-b200-1k1k-hightput-tep4" + - "minimax-m2.5-nvfp4-b200-1k1k-hightput-tp8" + resources: + gpus_per_agg: [4, 4, 8] + backend: + vllm_config: + aggregated: + tensor-parallel-size: [4, 4, 8] + enable-expert-parallel: [false, true, false] + benchmark: + concurrencies: ["4x8x16x32x64x128x256x512", "32x64x128", "4"] diff --git a/recipes/vllm/minimax-m2.5/b200-fp4/8k1k.yaml b/recipes/vllm/minimax-m2.5/b200-fp4/8k1k.yaml new file mode 100644 index 00000000..7d817e73 --- /dev/null +++ b/recipes/vllm/minimax-m2.5/b200-fp4/8k1k.yaml @@ -0,0 +1,88 @@ +# MiniMax-M2.5 NVFP4 B200 — 8K/1K ISL/OSL +# Aggregated vLLM, single-node +# requires github.com/NVIDIA/srt-slurm, branch sa-submission-q2-2026 +# usage examples: +# srtctl apply -f 8k1k.yaml # run all variants +# srtctl apply -f 8k1k.yaml:zip_override_lowlat # full lowlat sweep +# srtctl apply -f 8k1k.yaml:zip_override_lowlat[2] # lowlat, tep2 variant only +# srtctl apply -f 8k1k.yaml:zip_override_maxtput # full max tput sweep +# srtctl dry-run -f 8k1k.yaml # preview the variants + +base: + name: "minimax-m2.5-nvfp4-b200-8k1k" + + model: + path: "minimax_m2.5_fp4" + container: "vllm/vllm-openai:v0.19.0-cu130" + precision: "fp4" + + resources: + gpu_type: "b200" + gpus_per_node: 8 + agg_nodes: 1 + agg_workers: 1 + gpus_per_agg: 1 + + frontend: + type: dynamo + enable_multiple_frontends: false + + dynamo: + install: true + top_of_tree: true # currently need ToT for vllm 0.19.0 + + setup_script: vllm-container-deps.sh + + backend: + type: vllm + + aggregated_environment: + DYN_HEALTH_CHECK_ENABLED: "false" + PYTHONUNBUFFERED: "1" + + vllm_config: + aggregated: + tensor-parallel-size: 1 + gpu-memory-utilization: 0.90 + max-model-len: 9416 + max-num-batched-tokens: 16384 + kv-cache-dtype: fp8 + max-cudagraph-capture-size: 2048 + stream-interval: 20 + no-enable-prefix-caching: true + trust-remote-code: true + + benchmark: + type: "sa-bench" + isl: 8192 + osl: 1024 + req_rate: "inf" + +zip_override_lowlat: + name: + - "minimax-m2.5-nvfp4-b200-8k1k-lowlat-tp1" + - "minimax-m2.5-nvfp4-b200-8k1k-lowlat-tp2" + - "minimax-m2.5-nvfp4-b200-8k1k-lowlat-tep2" + resources: + gpus_per_agg: [1, 2, 2] + backend: + vllm_config: + aggregated: + tensor-parallel-size: [1, 2, 2] + enable-expert-parallel: [false, false, true] + benchmark: + concurrencies: ["4x8x16x32x256x512", "4x8x16x32x64x128x256x512", "128x256x512"] + +zip_override_maxtput: + name: + - "minimax-m2.5-nvfp4-b200-8k1k-maxtput-tp4" + - "minimax-m2.5-nvfp4-b200-8k1k-maxtput-tp8" + resources: + gpus_per_agg: [4, 8] + backend: + vllm_config: + aggregated: + tensor-parallel-size: [4, 8] + enable-expert-parallel: false + benchmark: + concurrencies: ["4x8x16x32x64x128x256x512", "4"] diff --git a/src/srtctl/backends/vllm.py b/src/srtctl/backends/vllm.py index ff20cb40..1acbd50c 100644 --- a/src/srtctl/backends/vllm.py +++ b/src/srtctl/backends/vllm.py @@ -132,12 +132,16 @@ def get_process_environment(self, process: Process) -> dict[str, str]: vLLM with dynamo requires unique ports for each worker: - DYN_VLLM_KV_EVENT_PORT: ZMQ port for KV events publishing - VLLM_NIXL_SIDE_CHANNEL_PORT: Port for NIXL side channel transfers + - VLLM_NIXL_SIDE_CHANNEL_HOST: Routable IP for NIXL side channel (not 0.0.0.0/localhost) """ + from srtctl.core.slurm import get_hostname_ip + env: dict[str, str] = {} if process.kv_events_port is not None: env["DYN_VLLM_KV_EVENT_PORT"] = str(process.kv_events_port) if process.nixl_port is not None: env["VLLM_NIXL_SIDE_CHANNEL_PORT"] = str(process.nixl_port) + env["VLLM_NIXL_SIDE_CHANNEL_HOST"] = get_hostname_ip(process.node) return env def get_served_model_name(self, default: str) -> str: diff --git a/src/srtctl/benchmarks/__init__.py b/src/srtctl/benchmarks/__init__.py index 3a2d6449..088617a6 100644 --- a/src/srtctl/benchmarks/__init__.py +++ b/src/srtctl/benchmarks/__init__.py @@ -4,7 +4,7 @@ """Benchmark runners for srtctl.""" # Import runners to trigger registration -from srtctl.benchmarks import gpqa, gsm8k, longbenchv2, mmlu, mooncake_router, router, sa_bench, sglang_bench +from srtctl.benchmarks import gpqa, gsm8k, lm_eval, longbenchv2, mmlu, mooncake_router, router, sa_bench, sglang_bench from srtctl.benchmarks.base import ( BenchmarkRunner, get_runner, @@ -18,6 +18,7 @@ "list_benchmarks", "register_benchmark", # Runners + "lm_eval", "sa_bench", "sglang_bench", "mmlu", diff --git a/src/srtctl/benchmarks/lm_eval.py b/src/srtctl/benchmarks/lm_eval.py new file mode 100644 index 00000000..c63ec097 --- /dev/null +++ b/src/srtctl/benchmarks/lm_eval.py @@ -0,0 +1,58 @@ +# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-FileCopyrightText: Copyright (c) 2026 SemiAnalysis LLC. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + +"""lm-eval benchmark runner for InferenceX evals.""" + +from __future__ import annotations + +from typing import TYPE_CHECKING + +from srtctl.benchmarks.base import SCRIPTS_DIR, BenchmarkRunner, register_benchmark + +if TYPE_CHECKING: + from srtctl.core.runtime import RuntimeContext + from srtctl.core.schema import SrtConfig + + +@register_benchmark("lm-eval") +class LMEvalRunner(BenchmarkRunner): + """lm-eval accuracy evaluation using InferenceX benchmark_lib. + + Runs lm-eval via the InferenceX benchmark_lib.sh harness, + which handles task selection, result collection, and summary generation. + """ + + @property + def name(self) -> str: + return "lm-eval" + + @property + def script_path(self) -> str: + return "/srtctl-benchmarks/lm-eval/bench.sh" + + @property + def local_script_dir(self) -> str: + return str(SCRIPTS_DIR / "lm-eval") + + def validate_config(self, config: SrtConfig) -> list[str]: + # lm-eval has sensible defaults + return [] + + def build_command( + self, + config: SrtConfig, + runtime: RuntimeContext, + ) -> list[str]: + endpoint = f"http://localhost:{runtime.frontend_port}" + # Always use the container mount path, not the host path. + # INFMAX_WORKSPACE env var contains the host path (used for mount setup + # in runtime.py), but inside the container it's at /infmax-workspace. + infmax_workspace = "/infmax-workspace" + + return [ + "bash", + self.script_path, + endpoint, + infmax_workspace, + ] diff --git a/src/srtctl/benchmarks/sa_bench.py b/src/srtctl/benchmarks/sa_bench.py index 9adc6678..5f220393 100644 --- a/src/srtctl/benchmarks/sa_bench.py +++ b/src/srtctl/benchmarks/sa_bench.py @@ -97,5 +97,9 @@ def build_command( str(prefill_gpus), str(decode_gpus), str(b.random_range_ratio) if b.random_range_ratio is not None else "0.8", + str(b.num_prompts_mult) if b.num_prompts_mult is not None else "10", + str(b.num_warmup_mult) if b.num_warmup_mult is not None else "2", + b.custom_tokenizer or "", + str(b.use_chat_template).lower(), ] return cmd diff --git a/src/srtctl/benchmarks/scripts/lm-eval/bench.sh b/src/srtctl/benchmarks/scripts/lm-eval/bench.sh new file mode 100755 index 00000000..a10e4e7d --- /dev/null +++ b/src/srtctl/benchmarks/scripts/lm-eval/bench.sh @@ -0,0 +1,77 @@ +#!/bin/bash +# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-FileCopyrightText: Copyright (c) 2026 SemiAnalysis LLC. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + +# lm-eval accuracy evaluation using InferenceX benchmark_lib +# Expects: endpoint [infmax_workspace] + +set -e + +ENDPOINT=$1 +INFMAX_WORKSPACE=${2:-/infmax-workspace} + +# Extract HOST and PORT from endpoint (e.g., http://localhost:8000) +HOST=$(echo "$ENDPOINT" | sed -E 's|https?://||; s|:.*||') +PORT=$(echo "$ENDPOINT" | sed -E 's|.*:([0-9]+).*|\1|') + +echo "lm-eval Config: endpoint=${ENDPOINT}; host=${HOST}; port=${PORT}; workspace=${INFMAX_WORKSPACE}" + +# Auto-discover the served model name from /v1/models if MODEL_NAME is not set. +# This ensures we use the exact name the server recognizes, regardless of what +# $MODEL (the HuggingFace ID from the workflow) is set to. +if [[ -z "${MODEL_NAME:-}" ]]; then + DISCOVERED_MODEL=$(curl -sf "${ENDPOINT}/v1/models" 2>/dev/null \ + | python3 -c "import sys,json; d=json.load(sys.stdin); print(d['data'][0]['id'])" 2>/dev/null || true) + if [[ -n "$DISCOVERED_MODEL" ]]; then + export MODEL_NAME="$DISCOVERED_MODEL" + echo "Auto-discovered MODEL_NAME from /v1/models: ${MODEL_NAME}" + else + echo "WARNING: Could not discover model name from /v1/models, using MODEL_NAME=${MODEL_NAME:-$MODEL}" + fi +else + echo "Using MODEL_NAME from environment: ${MODEL_NAME}" +fi + +# cd to workspace so that relative paths (e.g., utils/evals/*.yaml) resolve +cd "${INFMAX_WORKSPACE}" + +# Source the InferenceX benchmark library +source "${INFMAX_WORKSPACE}/benchmarks/benchmark_lib.sh" + +# Run lm-eval via benchmark_lib +# EVAL_CONC is set by the InferenceX workflow (median of conc list). +# benchmark_lib reads concurrency from EVAL_CONCURRENT_REQUESTS env var. +export EVAL_CONCURRENT_REQUESTS="${EVAL_CONC:-${EVAL_CONCURRENT_REQUESTS:-256}}" +echo "Running lm-eval with concurrent-requests=${EVAL_CONCURRENT_REQUESTS}..." +eval_rc=0 +run_eval --framework lm-eval --port "$PORT" || eval_rc=$? + +# Derive metadata env vars that append_lm_eval_summary needs but do_sweep.py +# does not pass directly (it passes PREFILL_TP/EP/etc, not TP/EP_SIZE/CONC). +export IS_MULTINODE="${IS_MULTINODE:-true}" +export TP="${TP:-${PREFILL_TP:-1}}" +export CONC="${CONC:-${EVAL_CONC:-${EVAL_CONCURRENT_REQUESTS:-1}}}" +export EP_SIZE="${EP_SIZE:-${PREFILL_EP:-1}}" +export DP_ATTENTION="${DP_ATTENTION:-${PREFILL_DP_ATTN:-false}}" +# Remap srt-slurm's DP_ATTN names to InferenceX's DP_ATTENTION names +export PREFILL_DP_ATTENTION="${PREFILL_DP_ATTENTION:-${PREFILL_DP_ATTN:-${DP_ATTENTION:-false}}}" +export DECODE_DP_ATTENTION="${DECODE_DP_ATTENTION:-${DECODE_DP_ATTN:-${DP_ATTENTION:-false}}}" + +# Generate the lm-eval summary +echo "Generating lm-eval summary..." +append_lm_eval_summary || true + +# Copy eval artifacts to /logs/eval_results/ +mkdir -p /logs/eval_results +echo "Copying eval artifacts to /logs/eval_results/..." +cp -v meta_env.json /logs/eval_results/ 2>/dev/null || true +cp -v results*.json /logs/eval_results/ 2>/dev/null || true +cp -v sample*.jsonl /logs/eval_results/ 2>/dev/null || true + +if [[ "$eval_rc" -ne 0 ]]; then + echo "lm-eval evaluation failed with exit code ${eval_rc}" + exit "$eval_rc" +fi + +echo "lm-eval evaluation complete" diff --git a/src/srtctl/benchmarks/scripts/sa-bench/backend_request_func.py b/src/srtctl/benchmarks/scripts/sa-bench/backend_request_func.py index dd2cac44..ded56a80 100644 --- a/src/srtctl/benchmarks/scripts/sa-bench/backend_request_func.py +++ b/src/srtctl/benchmarks/scripts/sa-bench/backend_request_func.py @@ -511,10 +511,107 @@ def get_model(pretrained_model_name_or_path: str) -> str: return pretrained_model_name_or_path +def _resolve_tokenizer_file(model_name_or_path): + """Resolve tokenizer.json from a local directory or HF hub cache.""" + from pathlib import Path + + local_path = Path(model_name_or_path) / "tokenizer.json" + if local_path.is_file(): + return str(local_path) + try: + from huggingface_hub import hf_hub_download + + return hf_hub_download(model_name_or_path, "tokenizer.json", local_files_only=True) + except Exception: + return None + + +def _fix_v5_tokenizer_components(tokenizer, model_name_or_path): + """Fix pre_tokenizer/decoder when transformers v5 LlamaTokenizerFast overwrites them. + + In transformers v5, LlamaTokenizerFast.__init__ rebuilds the pre_tokenizer + and decoder from scratch, discarding the originals from tokenizer.json. + This breaks models like DeepSeek-R1 that declare LlamaTokenizerFast but + actually use a ByteLevel pre_tokenizer. + + Ported from sglang/python/sglang/srt/utils/hf_transformers_utils.py. + """ + backend = getattr(tokenizer, "_tokenizer", None) + if backend is None: + return + + try: + from tokenizers import Tokenizer as RawTokenizer + + tok_file = _resolve_tokenizer_file(model_name_or_path) + if tok_file is None: + return + raw = RawTokenizer.from_file(tok_file) + except Exception: + return + + raw_pre = type(raw.pre_tokenizer).__name__ if raw.pre_tokenizer else None + loaded_pre = type(backend.pre_tokenizer).__name__ if backend.pre_tokenizer else None + + if raw_pre and loaded_pre and raw_pre != loaded_pre: + print( + f"[sa-bench] Fixing v5 tokenizer component mismatch for {model_name_or_path}: " + f"pre_tokenizer {loaded_pre} -> {raw_pre}, " + f"decoder {type(backend.decoder).__name__ if backend.decoder else None} " + f"-> {type(raw.decoder).__name__ if raw.decoder else None}", + flush=True, + ) + backend.pre_tokenizer = raw.pre_tokenizer + backend.decoder = raw.decoder + + +def _load_glm_moe_dsa_tokenizer(pretrained_model_name_or_path: str) -> "PreTrainedTokenizerFast": + """Load GLM-Moe-Dsa / GLM-5 tokenizer directly from tokenizer.json. + + Works around incompatibilities when the checkpoint was saved with + transformers 5.x (TokenizersBackend / list-style extra_special_tokens). + """ + import json + from pathlib import Path + + from tokenizers import Tokenizer as RustTokenizer + from transformers import PreTrainedTokenizerFast + + _SAFE_CONFIG_KEYS = ( + "pad_token", "pad_token_id", "eos_token", "eos_token_id", + "bos_token", "bos_token_id", "unk_token", "unk_token_id", + "model_max_length", "padding_side", "truncation_side", + ) + + path = Path(pretrained_model_name_or_path) + tokenizer_json = path / "tokenizer.json" + if not tokenizer_json.exists(): + raise FileNotFoundError( + f"Expected tokenizer.json at {tokenizer_json}. " + "GlmMoeDsaTokenizer loads from tokenizer.json only." + ) + + rust_tok = RustTokenizer.from_file(str(tokenizer_json)) + init_kwargs = {} + config_path = path / "tokenizer_config.json" + if config_path.exists(): + with open(config_path, encoding="utf-8") as f: + config = json.load(f) + for key in _SAFE_CONFIG_KEYS: + if key in config: + init_kwargs[key] = config[key] + if "extra_special_tokens" in config: + init_kwargs["additional_special_tokens"] = config["extra_special_tokens"] + + return PreTrainedTokenizerFast(tokenizer_object=rust_tok, **init_kwargs) + + def get_tokenizer( pretrained_model_name_or_path: str, tokenizer_mode: str = "auto", trust_remote_code: bool = False, + custom_tokenizer: str | None = None, + backend: str | None = None, **kwargs, ) -> PreTrainedTokenizer | PreTrainedTokenizerFast: if pretrained_model_name_or_path is not None and not os.path.exists(pretrained_model_name_or_path): @@ -533,12 +630,60 @@ def get_tokenizer( "to use mistral tokenizer mode." ) from e return MistralTokenizer.from_pretrained(str(pretrained_model_name_or_path)) - else: - return AutoTokenizer.from_pretrained( - pretrained_model_name_or_path, - trust_remote_code=trust_remote_code, - **kwargs, - ) + + if custom_tokenizer: + if custom_tokenizer == "glm_moe_dsa": + return _load_glm_moe_dsa_tokenizer(pretrained_model_name_or_path) + if custom_tokenizer == "deepseek_v4": + if backend == "sglang": + # SGLang has no client-side DeepseekV4Tokenizer package; we + # vendor sglang's own server-side encoder (encoding_dsv4.py) + # under ./tokenizers/ so the sa-bench client renders the + # exact same DSML prompt the sglang server builds. + from tokenizers.sglang_deepseek_v4 import ( + SGLangDeepseekV4Tokenizer, + ) + return SGLangDeepseekV4Tokenizer.from_pretrained( + str(pretrained_model_name_or_path) + ) + if backend in (None, "vllm"): + try: + from vllm.tokenizers.deepseek_v4 import DeepseekV4Tokenizer + except ImportError as e: + raise ImportError( + "DeepseekV4Tokenizer requires vllm package.\n" + "Please install it with `pip install vllm` " + "to use deepseek_v4 tokenizer." + ) from e + return DeepseekV4Tokenizer.from_pretrained( + str(pretrained_model_name_or_path) + ) + raise ValueError( + f"custom_tokenizer='deepseek_v4' does not support backend={backend!r}; " + "expected 'vllm' or 'sglang'." + ) + from importlib import import_module + try: + module_path, class_name = custom_tokenizer.rsplit('.', 1) + module = import_module(module_path) + tokenizer_class = getattr(module, class_name) + return tokenizer_class.from_pretrained( + pretrained_model_name_or_path, + trust_remote_code=trust_remote_code, + **kwargs, + ) + except (ValueError, ImportError, AttributeError) as e: + raise ValueError( + f"Failed to load custom_tokenizer '{custom_tokenizer}'. " + "Expected 'glm_moe_dsa' or 'module.path.ClassName'.") from e + + tokenizer = AutoTokenizer.from_pretrained( + pretrained_model_name_or_path, + trust_remote_code=trust_remote_code, + **kwargs, + ) + _fix_v5_tokenizer_components(tokenizer, pretrained_model_name_or_path) + return tokenizer ASYNC_REQUEST_FUNCS = { diff --git a/src/srtctl/benchmarks/scripts/sa-bench/bench.sh b/src/srtctl/benchmarks/scripts/sa-bench/bench.sh index ed907308..acddf754 100644 --- a/src/srtctl/benchmarks/scripts/sa-bench/bench.sh +++ b/src/srtctl/benchmarks/scripts/sa-bench/bench.sh @@ -60,6 +60,22 @@ TOTAL_GPUS=${9:-0} PREFILL_GPUS=${10:-0} DECODE_GPUS=${11:-0} RANDOM_RANGE_RATIO=${12:-0.8} +NUM_PROMPTS_MULT=${13:-10} +NUM_WARMUP_MULT=${14:-2} +CUSTOM_TOKENIZER=${15:-} +USE_CHAT_TEMPLATE=${16:-true} + +# Build optional custom tokenizer args +CUSTOM_TOKENIZER_ARGS=() +if [ -n "$CUSTOM_TOKENIZER" ]; then + CUSTOM_TOKENIZER_ARGS=(--custom-tokenizer "$CUSTOM_TOKENIZER") +fi + +# Build optional chat template args +CHAT_TEMPLATE_ARGS=() +if [ "$USE_CHAT_TEMPLATE" = "true" ]; then + CHAT_TEMPLATE_ARGS=(--use-chat-template) +fi # Parse endpoint into host:port HOST=$(echo "$ENDPOINT" | sed 's|http://||' | cut -d: -f1) @@ -119,7 +135,8 @@ for concurrency in "${CONCURRENCY_LIST[@]}"; do --request-rate 250 \ --percentile-metrics ttft,tpot,itl,e2el \ --max-concurrency "$concurrency" \ - --trust-remote-code + --trust-remote-code \ + "${CUSTOM_TOKENIZER_ARGS[@]}" num_prompts=$((concurrency * 10)) @@ -149,7 +166,8 @@ for concurrency in "${CONCURRENCY_LIST[@]}"; do --percentile-metrics ttft,tpot,itl,e2el \ --max-concurrency "$concurrency" \ --trust-remote-code \ - --use-chat-template \ + "${CHAT_TEMPLATE_ARGS[@]}" \ + "${CUSTOM_TOKENIZER_ARGS[@]}" \ --save-result --result-dir "$result_dir" --result-filename "$result_filename" set +x diff --git a/src/srtctl/benchmarks/scripts/sa-bench/benchmark_serving.py b/src/srtctl/benchmarks/scripts/sa-bench/benchmark_serving.py index 4363ef6e..75b3a97f 100644 --- a/src/srtctl/benchmarks/scripts/sa-bench/benchmark_serving.py +++ b/src/srtctl/benchmarks/scripts/sa-bench/benchmark_serving.py @@ -837,6 +837,8 @@ def main(args: argparse.Namespace): tokenizer_id, tokenizer_mode=tokenizer_mode, trust_remote_code=args.trust_remote_code, + custom_tokenizer=args.custom_tokenizer, + backend=backend, ) if args.dataset is not None: @@ -1279,6 +1281,14 @@ def main(args: argparse.Namespace): '"custom" will use --tokenizer to select the preregistered tokenizer.', ) + parser.add_argument( + "--custom-tokenizer", + type=str, + default=None, + help="Custom tokenizer to use (e.g., 'glm_moe_dsa' or 'module.path.ClassName'). " + "When set, overrides the default tokenizer loading.", + ) + parser.add_argument( "--served-model-name", type=str, diff --git a/src/srtctl/benchmarks/scripts/sa-bench/tokenizers/__init__.py b/src/srtctl/benchmarks/scripts/sa-bench/tokenizers/__init__.py new file mode 100644 index 00000000..42d334ba --- /dev/null +++ b/src/srtctl/benchmarks/scripts/sa-bench/tokenizers/__init__.py @@ -0,0 +1 @@ +"""Custom tokenizers bundled with sa-bench.""" diff --git a/src/srtctl/benchmarks/scripts/sa-bench/tokenizers/_sglang_encoding_dsv4.py b/src/srtctl/benchmarks/scripts/sa-bench/tokenizers/_sglang_encoding_dsv4.py new file mode 100644 index 00000000..2212e090 --- /dev/null +++ b/src/srtctl/benchmarks/scripts/sa-bench/tokenizers/_sglang_encoding_dsv4.py @@ -0,0 +1,856 @@ +# SPDX-License-Identifier: Apache-2.0 +# +# Vendored from sgl-project/sglang PR #23600 (currently unmerged). +# Source: https://github.com/sgl-project/sglang/blob/f5d03db853862c8fb0e805df591bed883a71868b/python/sglang/srt/entrypoints/openai/encoding_dsv4.py +# Upstream SHA-256: 106b471e559153d93c4af34a4865b2a68b205b72ddd688dbed93dfd86e4b92cb +# +# This file is vendored because sglang does not ship a client-side +# tokenizer package equivalent to vllm.tokenizers.deepseek_v4. Keeping +# a byte-identical copy here lets the sa-bench client render the exact +# DeepSeek-V4 DSML prompt that sglang server builds internally, so +# input_tokens reported by the client match the server's #new-token. +# +# When sglang upstream merges an official client-side tokenizer package, +# this vendored copy can be removed in favor of that import. +# +# -------------------- Original sglang file begins below -------------------- +# Adapted from the DeepSeek-V4 release reference implementation. +""" +DeepSeek-V4 Encoding + +A self-contained implementation for encoding/decoding DeepSeek-V4 chat messages +with tool calling, thinking mode, and quick instruction task support. +""" + +import copy +import json +import re +from typing import Any, Dict, List, Optional, Tuple, Union + +# ============================================================ +# Special Tokens +# ============================================================ + +bos_token: str = "<|begin▁of▁sentence|>" +eos_token: str = "<|end▁of▁sentence|>" +thinking_start_token: str = "" +thinking_end_token: str = "" +dsml_token: str = "|DSML|" + +USER_SP_TOKEN = "<|User|>" +ASSISTANT_SP_TOKEN = "<|Assistant|>" +LATEST_REMINDER_SP_TOKEN = "<|latest_reminder|>" + +# Task special tokens for internal classification tasks +DS_TASK_SP_TOKENS = { + "action": "<|action|>", + "query": "<|query|>", + "authority": "<|authority|>", + "domain": "<|domain|>", + "title": "<|title|>", + "read_url": "<|read_url|>", +} +VALID_TASKS = set(DS_TASK_SP_TOKENS.keys()) + +# ============================================================ +# Templates +# ============================================================ + +system_msg_template: str = "{content}" +user_msg_template: str = "{content}" +latest_reminder_msg_template: str = "{content}" +assistant_msg_template: str = "{reasoning}{content}{tool_calls}" + eos_token +assistant_msg_wo_eos_template: str = "{reasoning}{content}{tool_calls}" +thinking_template: str = "{reasoning_content}" + +response_format_template: str = ( + "## Response Format:\n\nYou MUST strictly adhere to the following schema to reply:\n{schema}" +) +tool_call_template: str = ( + '<{dsml_token}invoke name="{name}">\n{arguments}\n' +) +tool_calls_template = ( + "<{dsml_token}{tc_block_name}>\n{tool_calls}\n" +) +tool_calls_block_name: str = "tool_calls" + +tool_output_template: str = "{content}" + +REASONING_EFFORT_MAX = ( + "Reasoning Effort: Absolute maximum with no shortcuts permitted.\n" + "You MUST be very thorough in your thinking and comprehensively decompose the problem to resolve the root cause, rigorously stress-testing your logic against all potential paths, edge cases, and adversarial scenarios.\n" + "Explicitly write out your entire deliberation process, documenting every intermediate step, considered alternative, and rejected hypothesis to ensure absolutely no assumption is left unchecked.\n\n" +) + +TOOLS_TEMPLATE = """## Tools + +You have access to a set of tools to help answer the user's question. You can invoke tools by writing a "<{dsml_token}tool_calls>" block like the following: + +<{dsml_token}tool_calls> +<{dsml_token}invoke name="$TOOL_NAME"> +<{dsml_token}parameter name="$PARAMETER_NAME" string="true|false">$PARAMETER_VALUE +... + +<{dsml_token}invoke name="$TOOL_NAME2"> +... + + + +String parameters should be specified as is and set `string="true"`. For all other types (numbers, booleans, arrays, objects), pass the value in JSON format and set `string="false"`. + +If thinking_mode is enabled (triggered by {thinking_start_token}), you MUST output your complete reasoning inside {thinking_start_token}...{thinking_end_token} BEFORE any tool calls or final response. + +Otherwise, output directly after {thinking_end_token} with tool calls or final response. + +### Available Tool Schemas + +{tool_schemas} + +You MUST strictly follow the above defined tool name and parameter schemas to invoke tool calls. +""" + +# ============================================================ +# Utility Functions +# ============================================================ + + +def to_json(value: Any) -> str: + """Serialize a value to JSON string.""" + try: + return json.dumps(value, ensure_ascii=False) + except: + return json.dumps(value, ensure_ascii=True) + + +def tools_from_openai_format(tools): + """Extract function definitions from OpenAI-format tool list.""" + return [tool["function"] for tool in tools] + + +def tool_calls_from_openai_format(tool_calls): + """Convert OpenAI-format tool calls to internal format.""" + return [ + { + "name": tool_call["function"]["name"], + "arguments": tool_call["function"]["arguments"], + } + for tool_call in tool_calls + ] + + +def tool_calls_to_openai_format(tool_calls): + """Convert internal tool calls to OpenAI format.""" + return [ + { + "type": "function", + "function": { + "name": tool_call["name"], + "arguments": tool_call["arguments"], + }, + } + for tool_call in tool_calls + ] + + +def encode_arguments_to_dsml(tool_call: Dict[str, str]) -> str: + """ + Encode tool call arguments into DSML parameter format. + + Args: + tool_call: Dict with "name" and "arguments" (JSON string) keys. + + Returns: + DSML-formatted parameter string. + """ + p_dsml_template = '<{dsml_token}parameter name="{key}" string="{is_str}">{value}' + P_dsml_strs = [] + + try: + arguments = json.loads(tool_call["arguments"]) + except Exception as err: + arguments = {"arguments": tool_call["arguments"]} + + for k, v in arguments.items(): + p_dsml_str = p_dsml_template.format( + dsml_token=dsml_token, + key=k, + is_str="true" if isinstance(v, str) else "false", + value=v if isinstance(v, str) else to_json(v), + ) + P_dsml_strs.append(p_dsml_str) + + return "\n".join(P_dsml_strs) + + +def decode_dsml_to_arguments( + tool_name: str, tool_args: Dict[str, Tuple[str, str]] +) -> Dict[str, str]: + """ + Decode DSML parameters back to a tool call dict. + + Args: + tool_name: Name of the tool. + tool_args: Dict mapping param_name -> (value, is_string_flag). + + Returns: + Dict with "name" and "arguments" (JSON string) keys. + """ + + def _decode_value(key: str, value: str, string: str): + if string == "true": + value = to_json(value) + return f"{to_json(key)}: {value}" + + tool_args_json = ( + "{" + + ", ".join( + [_decode_value(k, v, string=is_str) for k, (v, is_str) in tool_args.items()] + ) + + "}" + ) + return dict(name=tool_name, arguments=tool_args_json) + + +def render_tools(tools: List[Dict[str, Union[str, Dict[str, Any]]]]) -> str: + """ + Render tool schemas into the system prompt format. + + Args: + tools: List of tool schema dicts (each with name, description, parameters). + + Returns: + Formatted tools section string. + """ + tools_json = [to_json(t) for t in tools] + + return TOOLS_TEMPLATE.format( + tool_schemas="\n".join(tools_json), + dsml_token=dsml_token, + thinking_start_token=thinking_start_token, + thinking_end_token=thinking_end_token, + ) + + +def find_last_user_index(messages: List[Dict[str, Any]]) -> int: + """Find the index of the last user/developer message.""" + last_user_index = -1 + for idx in range(len(messages) - 1, -1, -1): + if messages[idx].get("role") in ["user", "developer"]: + last_user_index = idx + break + return last_user_index + + +# ============================================================ +# Message Rendering +# ============================================================ + + +def render_message( + index: int, + messages: List[Dict[str, Any]], + thinking_mode: str, + drop_thinking: bool = True, + reasoning_effort: Optional[str] = None, +) -> str: + """ + Render a single message at the given index into its encoded string form. + + This is the core function that converts each message in the conversation + into the DeepSeek-V4 format. + + Args: + index: Index of the message to render. + messages: Full list of messages in the conversation. + thinking_mode: Either "chat" or "thinking". + drop_thinking: Whether to drop reasoning content from earlier turns. + reasoning_effort: Optional reasoning effort level ("max", "high", or None). + + Returns: + Encoded string for this message. + """ + assert 0 <= index < len(messages) + assert thinking_mode in [ + "chat", + "thinking", + ], f"Invalid thinking_mode `{thinking_mode}`" + + prompt = "" + msg = messages[index] + last_user_idx = find_last_user_index(messages) + + role = msg.get("role") + content = msg.get("content") + tools = msg.get("tools") + response_format = msg.get("response_format") + tool_calls = msg.get("tool_calls") + reasoning_content = msg.get("reasoning_content") + wo_eos = msg.get("wo_eos", False) + + if tools: + tools = tools_from_openai_format(tools) + if tool_calls: + tool_calls = tool_calls_from_openai_format(tool_calls) + + # Reasoning effort prefix (only at index 0 in thinking mode with max effort) + assert reasoning_effort in [ + "max", + None, + "high", + ], f"Invalid reasoning effort: {reasoning_effort}" + if index == 0 and thinking_mode == "thinking" and reasoning_effort == "max": + prompt += REASONING_EFFORT_MAX + + if role == "system": + prompt += system_msg_template.format(content=content or "") + if tools: + prompt += "\n\n" + render_tools(tools) + if response_format: + prompt += "\n\n" + response_format_template.format( + schema=to_json(response_format) + ) + + elif role == "developer": + assert content, f"Invalid message for role `{role}`: {msg}" + + content_developer = USER_SP_TOKEN + content_developer += content + + if tools: + content_developer += "\n\n" + render_tools(tools) + if response_format: + content_developer += "\n\n" + response_format_template.format( + schema=to_json(response_format) + ) + + prompt += user_msg_template.format(content=content_developer) + + elif role == "user": + prompt += USER_SP_TOKEN + + # Handle content blocks (tool results mixed with text) + content_blocks = msg.get("content_blocks") + if content_blocks: + parts = [] + for block in content_blocks: + block_type = block.get("type") + if block_type == "text": + parts.append(block.get("text", "")) + elif block_type == "tool_result": + tool_content = block.get("content", "") + if isinstance(tool_content, list): + text_parts = [] + for b in tool_content: + if b.get("type") == "text": + text_parts.append(b.get("text", "")) + else: + text_parts.append(f"[Unsupported {b.get('type')}]") + tool_content = "\n\n".join(text_parts) + parts.append(tool_output_template.format(content=tool_content)) + else: + parts.append(f"[Unsupported {block_type}]") + prompt += "\n\n".join(parts) + else: + prompt += content or "" + + elif role == "latest_reminder": + prompt += LATEST_REMINDER_SP_TOKEN + latest_reminder_msg_template.format( + content=content + ) + + elif role == "tool": + raise NotImplementedError( + "deepseek_v4 merges tool messages into user; please preprocess with merge_tool_messages()" + ) + + elif role == "assistant": + thinking_part = "" + tc_content = "" + + if tool_calls: + tc_list = [ + tool_call_template.format( + dsml_token=dsml_token, + name=tc.get("name"), + arguments=encode_arguments_to_dsml(tc), + ) + for tc in tool_calls + ] + tc_content += "\n\n" + tool_calls_template.format( + dsml_token=dsml_token, + tool_calls="\n".join(tc_list), + tc_block_name=tool_calls_block_name, + ) + + summary_content = content or "" + rc = reasoning_content or "" + + # Check if previous message has a task - if so, this is a task output (no thinking) + prev_has_task = index - 1 >= 0 and messages[index - 1].get("task") is not None + + if thinking_mode == "thinking" and not prev_has_task: + if not drop_thinking or index > last_user_idx: + thinking_part = ( + thinking_template.format(reasoning_content=rc) + thinking_end_token + ) + else: + thinking_part = "" + + if wo_eos: + prompt += assistant_msg_wo_eos_template.format( + reasoning=thinking_part, + content=summary_content, + tool_calls=tc_content, + ) + else: + prompt += assistant_msg_template.format( + reasoning=thinking_part, + content=summary_content, + tool_calls=tc_content, + ) + else: + raise NotImplementedError(f"Unknown role: {role}") + + # Append transition tokens based on what follows + if index + 1 < len(messages) and messages[index + 1].get("role") not in [ + "assistant", + "latest_reminder", + ]: + return prompt + + task = messages[index].get("task") + if task is not None: + # Task special token for internal classification tasks + assert ( + task in VALID_TASKS + ), f"Invalid task: '{task}'. Valid tasks are: {list(VALID_TASKS)}" + task_sp_token = DS_TASK_SP_TOKENS[task] + + if task != "action": + # Non-action tasks: append task sp token directly after the message + prompt += task_sp_token + else: + # Action task: append Assistant + thinking token + action sp token + prompt += ASSISTANT_SP_TOKEN + prompt += ( + thinking_end_token + if thinking_mode != "thinking" + else thinking_start_token + ) + prompt += task_sp_token + + elif messages[index].get("role") in ["user", "developer"]: + # Normal generation: append Assistant + thinking token + prompt += ASSISTANT_SP_TOKEN + if not drop_thinking and thinking_mode == "thinking": + prompt += thinking_start_token + elif drop_thinking and thinking_mode == "thinking" and index >= last_user_idx: + prompt += thinking_start_token + else: + prompt += thinking_end_token + + return prompt + + +# ============================================================ +# Preprocessing +# ============================================================ + + +def merge_tool_messages(messages: List[Dict[str, Any]]) -> List[Dict[str, Any]]: + """ + Merge tool messages into the preceding user message using content_blocks format. + + DeepSeek-V4 does not have a standalone "tool" role; instead, tool results + are encoded as blocks within user messages. + + This function converts a standard OpenAI-format conversation (with separate + "tool" role messages) into V4 format where tool results are merged into + user messages. + + Args: + messages: List of message dicts in OpenAI format. + + Returns: + Processed message list with tool messages merged into user messages. + """ + merged: List[Dict[str, Any]] = [] + + for msg in messages: + msg = copy.deepcopy(msg) + role = msg.get("role") + + if role == "tool": + # Convert tool message to a user message with tool_result block + tool_block = { + "type": "tool_result", + "tool_use_id": msg.get("tool_call_id", ""), + "content": msg.get("content", ""), + } + # Merge into previous message if it's already a user (merged tool) + if ( + merged + and merged[-1].get("role") == "user" + and "content_blocks" in merged[-1] + ): + merged[-1]["content_blocks"].append(tool_block) + else: + merged.append( + { + "role": "user", + "content_blocks": [tool_block], + } + ) + elif role == "user": + text_block = {"type": "text", "text": msg.get("content", "")} + if ( + merged + and merged[-1].get("role") == "user" + and "content_blocks" in merged[-1] + and merged[-1].get("task") is None + ): + merged[-1]["content_blocks"].append(text_block) + else: + new_msg = { + "role": "user", + "content": msg.get("content", ""), + "content_blocks": [text_block], + } + # Preserve extra fields (task, wo_eos, mask, etc.) + for key in ("task", "wo_eos", "mask"): + if key in msg: + new_msg[key] = msg[key] + merged.append(new_msg) + else: + merged.append(msg) + + return merged + + +def sort_tool_results_by_call_order( + messages: List[Dict[str, Any]] +) -> List[Dict[str, Any]]: + """ + Sort tool_result blocks within user messages by the order of tool_calls + in the preceding assistant message. + + Args: + messages: Preprocessed message list (after merge_tool_messages). + + Returns: + Message list with sorted tool result blocks. + """ + last_tool_call_order: Dict[str, int] = {} + + for msg in messages: + role = msg.get("role") + if role == "assistant" and msg.get("tool_calls"): + last_tool_call_order = {} + for idx, tc in enumerate(msg["tool_calls"]): + tc_id = tc.get("id") or tc.get("function", {}).get("id", "") + if tc_id: + last_tool_call_order[tc_id] = idx + + elif role == "user" and msg.get("content_blocks"): + tool_blocks = [ + b for b in msg["content_blocks"] if b.get("type") == "tool_result" + ] + if len(tool_blocks) > 1 and last_tool_call_order: + sorted_blocks = sorted( + tool_blocks, + key=lambda b: last_tool_call_order.get(b.get("tool_use_id", ""), 0), + ) + sorted_idx = 0 + new_blocks = [] + for block in msg["content_blocks"]: + if block.get("type") == "tool_result": + new_blocks.append(sorted_blocks[sorted_idx]) + sorted_idx += 1 + else: + new_blocks.append(block) + msg["content_blocks"] = new_blocks + + return messages + + +# ============================================================ +# Main Encoding Function +# ============================================================ + + +def encode_messages( + messages: List[Dict[str, Any]], + thinking_mode: str, + context: Optional[List[Dict[str, Any]]] = None, + drop_thinking: bool = True, + add_default_bos_token: bool = True, + reasoning_effort: Optional[str] = None, +) -> str: + """ + Encode a list of messages into the DeepSeek-V4 prompt format. + + This is the main entry point for encoding conversations. It handles: + - BOS token insertion + - Thinking mode with optional reasoning content dropping + - Tool message merging into user messages + - Multi-turn conversation context + + Args: + messages: List of message dicts to encode. + thinking_mode: Either "chat" or "thinking". + context: Optional preceding context messages (already encoded prefix). + drop_thinking: If True, drop reasoning_content from earlier assistant turns + (only keep reasoning for messages after the last user message). + add_default_bos_token: Whether to prepend BOS token at conversation start. + reasoning_effort: Optional reasoning effort level ("max", "high", or None). + + Returns: + The encoded prompt string. + """ + context = context if context else [] + + # Preprocess: merge tool messages and sort tool results + messages = merge_tool_messages(messages) + messages = sort_tool_results_by_call_order(context + messages)[len(context) :] + if context: + context = merge_tool_messages(context) + context = sort_tool_results_by_call_order(context) + + full_messages = context + messages + + prompt = bos_token if add_default_bos_token and len(context) == 0 else "" + + # Resolve drop_thinking: if any message has tools defined, don't drop thinking + effective_drop_thinking = drop_thinking + if any(m.get("tools") for m in full_messages): + effective_drop_thinking = False + + if thinking_mode == "thinking" and effective_drop_thinking: + full_messages = _drop_thinking_messages(full_messages) + # After dropping, recalculate how many messages to render + # (context may have shrunk too) + num_to_render = len(full_messages) - len(_drop_thinking_messages(context)) + context_len = len(full_messages) - num_to_render + else: + num_to_render = len(messages) + context_len = len(context) + + for idx in range(num_to_render): + prompt += render_message( + idx + context_len, + full_messages, + thinking_mode=thinking_mode, + drop_thinking=effective_drop_thinking, + reasoning_effort=reasoning_effort, + ) + + return prompt + + +def _drop_thinking_messages(messages: List[Dict[str, Any]]) -> List[Dict[str, Any]]: + """ + Drop reasoning_content and non-essential messages before the last user message. + + Behavior: + - Messages with role in ["user", "system", "tool", "latest_reminder"] are always kept. + - Messages at or after the last user index are always kept. + - Assistant messages before the last user get reasoning_content removed. + - Developer messages before the last user are dropped entirely. + """ + last_user_idx = find_last_user_index(messages) + result = [] + keep_roles = {"user", "system", "tool", "latest_reminder", "direct_search_results"} + + for idx, msg in enumerate(messages): + role = msg.get("role") + if role in keep_roles or idx >= last_user_idx: + result.append(msg) + elif role == "assistant": + msg = copy.copy(msg) + msg.pop("reasoning_content", None) + result.append(msg) + # developer and other roles before last_user_idx are dropped + + return result + + +# ============================================================ +# Parsing (Decoding model output) +# ============================================================ + + +def _read_until_stop( + index: int, text: str, stop: List[str] +) -> Tuple[int, str, Optional[str]]: + """ + Read text from index until one of the stop strings is found. + + Returns: + Tuple of (new_index, content_before_stop, matched_stop_string_or_None). + """ + min_pos = len(text) + matched_stop = None + + for s in stop: + pos = text.find(s, index) + if pos != -1 and pos < min_pos: + min_pos = pos + matched_stop = s + + if matched_stop: + content = text[index:min_pos] + return min_pos + len(matched_stop), content, matched_stop + else: + content = text[index:] + return len(text), content, None + + +def parse_tool_calls( + index: int, text: str +) -> Tuple[int, Optional[str], List[Dict[str, str]]]: + """ + Parse DSML tool calls from text starting at the given index. + + Args: + index: Starting position in text. + text: The full text to parse. + + Returns: + Tuple of (new_index, last_stop_token, list_of_tool_call_dicts). + Each tool call dict has "name" and "arguments" keys. + """ + tool_calls: List[Dict[str, Any]] = [] + stop_token = None + tool_calls_end_token = f"" + + while index < len(text): + index, _, stop_token = _read_until_stop( + index, text, [f"<{dsml_token}invoke", tool_calls_end_token] + ) + if _ != ">\n": + raise ValueError(f"Tool call format error: expected '>\\n' but got '{_}'") + + if stop_token == tool_calls_end_token: + break + + if stop_token is None: + raise ValueError("Missing special token in tool calls") + + index, tool_name_content, stop_token = _read_until_stop( + index, text, [f"<{dsml_token}parameter", f"\n$', tool_name_content, flags=re.DOTALL + ) + if len(p_tool_name) != 1: + raise ValueError(f"Tool name format error: '{tool_name_content}'") + tool_name = p_tool_name[0] + + tool_args: Dict[str, Tuple[str, str]] = {} + while stop_token == f"<{dsml_token}parameter": + index, param_content, stop_token = _read_until_stop( + index, text, [f"/{dsml_token}parameter"] + ) + + param_kv = re.findall( + r'^ name="(.*?)" string="(true|false)">(.*?)<$', + param_content, + flags=re.DOTALL, + ) + if len(param_kv) != 1: + raise ValueError(f"Parameter format error: '{param_content}'") + param_name, string, param_value = param_kv[0] + + if param_name in tool_args: + raise ValueError(f"Duplicate parameter name: '{param_name}'") + tool_args[param_name] = (param_value, string) + + index, content, stop_token = _read_until_stop( + index, text, [f"<{dsml_token}parameter", f"\n": + raise ValueError( + f"Parameter format error: expected '>\\n' but got '{content}'" + ) + + tool_call = decode_dsml_to_arguments(tool_name=tool_name, tool_args=tool_args) + tool_calls.append(tool_call) + + return index, stop_token, tool_calls + + +def parse_message_from_completion_text(text: str, thinking_mode: str) -> Dict[str, Any]: + """ + Parse a model completion text into a structured assistant message. + + This function takes the raw text output from the model (a single assistant turn) + and extracts: + - reasoning_content (thinking block) + - content (summary/response) + - tool_calls (if any) + + NOTE: This function is designed to parse only correctly formatted strings and + will raise ValueError for malformed output. + + Args: + text: The raw completion text (including EOS token). + thinking_mode: Either "chat" or "thinking". + + Returns: + Dict with keys: "role", "content", "reasoning_content", "tool_calls". + tool_calls are in OpenAI format. + """ + summary_content, reasoning_content, tool_calls = "", "", [] + index, stop_token = 0, None + tool_calls_start_token = f"\n\n<{dsml_token}{tool_calls_block_name}" + + is_thinking = thinking_mode == "thinking" + is_tool_calling = False + + if is_thinking: + index, content_delta, stop_token = _read_until_stop( + index, text, [thinking_end_token, tool_calls_start_token] + ) + reasoning_content = content_delta + assert ( + stop_token == thinking_end_token + ), "Invalid thinking format: missing " + + index, content_delta, stop_token = _read_until_stop( + index, text, [eos_token, tool_calls_start_token] + ) + summary_content = content_delta + if stop_token == tool_calls_start_token: + is_tool_calling = True + else: + assert stop_token == eos_token, "Invalid format: missing EOS token" + + if is_tool_calling: + index, stop_token, tool_calls = parse_tool_calls(index, text) + + index, tool_ends_text, stop_token = _read_until_stop(index, text, [eos_token]) + assert not tool_ends_text, "Unexpected content after tool calls" + + assert len(text) == index and stop_token in [ + eos_token, + None, + ], "Unexpected content at end" + + for sp_token in [ + bos_token, + eos_token, + thinking_start_token, + thinking_end_token, + dsml_token, + ]: + assert ( + sp_token not in summary_content and sp_token not in reasoning_content + ), f"Unexpected special token '{sp_token}' in content" + + return { + "role": "assistant", + "content": summary_content, + "reasoning_content": reasoning_content, + "tool_calls": tool_calls_to_openai_format(tool_calls), + } diff --git a/src/srtctl/benchmarks/scripts/sa-bench/tokenizers/sglang_deepseek_v4.py b/src/srtctl/benchmarks/scripts/sa-bench/tokenizers/sglang_deepseek_v4.py new file mode 100644 index 00000000..595e7b2f --- /dev/null +++ b/src/srtctl/benchmarks/scripts/sa-bench/tokenizers/sglang_deepseek_v4.py @@ -0,0 +1,125 @@ + +# SPDX-License-Identifier: Apache-2.0 +""" +SGLang-side DeepSeek-V4 tokenizer for sa-bench. + +Mirrors what sglang's ``serving_chat._apply_jinja_template`` does +when ``chat_encoding_spec == "dsv4"`` (see +sgl-project/sglang PR #23600), so that the tokens counted on the +sa-bench client side match the tokens the sglang server actually +feeds into the model. + +The vllm counterpart lives in ``vllm.tokenizers.deepseek_v4``; sglang +has no equivalent client-side package, so we vendor the rendering +logic from ``encoding_dsv4.py`` in ``_sglang_encoding_dsv4.py``. +""" +from __future__ import annotations + +from typing import Any, Dict, List, Optional + +from transformers import AutoTokenizer + +from ._sglang_encoding_dsv4 import encode_messages as _encode_messages + + +class SGLangDeepseekV4Tokenizer: + """Client-side DeepSeek-V4 tokenizer matching sglang server behavior. + + The server-side call chain (sglang PR #23600) is: + + messages = request.messages # OpenAI-style + if messages[0]["role"] != "system": + messages.insert(0, {"role": "system", "content": ""}) + real_input = encoding_dsv4.encode_messages( + messages, + thinking_mode="chat", # default + reasoning_effort=None, # "medium" dropped + ) + prompt_ids = tokenizer.encode(real_input) + + We reproduce the exact same steps here. + """ + + def __init__(self, hf_tokenizer): + self._hf = hf_tokenizer + + @classmethod + def from_pretrained(cls, pretrained_model_name_or_path: str, **kwargs): + kwargs.setdefault("trust_remote_code", True) + hf = AutoTokenizer.from_pretrained( + pretrained_model_name_or_path, **kwargs + ) + return cls(hf) + + def _render_prompt( + self, + messages: List[Dict[str, Any]], + thinking_mode: str = "chat", + reasoning_effort: Optional[str] = None, + ) -> str: + msgs = [dict(m) for m in messages] + if not msgs or msgs[0].get("role") != "system": + msgs.insert(0, {"role": "system", "content": ""}) + + if reasoning_effort not in ("max", "high"): + reasoning_effort = None + + return _encode_messages( + msgs, + thinking_mode=thinking_mode, + reasoning_effort=reasoning_effort, + ) + + def apply_chat_template( + self, + messages: List[Dict[str, Any]], + tokenize: bool = True, + add_generation_prompt: bool = True, # noqa: ARG002 (encoder always adds the <|Assistant|>... tail) + tools: Optional[List[Dict[str, Any]]] = None, + thinking: bool = False, + reasoning_effort: Optional[str] = None, + **_: Any, + ): + msgs = [dict(m) for m in messages] + if tools: + if not msgs or msgs[0].get("role") != "system": + msgs.insert(0, {"role": "system", "content": ""}) + msgs[0]["tools"] = list(tools) + + thinking_mode = "thinking" if thinking else "chat" + prompt = self._render_prompt( + msgs, + thinking_mode=thinking_mode, + reasoning_effort=reasoning_effort, + ) + if not tokenize: + return prompt + return self._hf.encode(prompt, add_special_tokens=False) + + def encode(self, text, **kwargs): + return self._hf.encode(text, **kwargs) + + def decode(self, token_ids, **kwargs): + return self._hf.decode(token_ids, **kwargs) + + def __len__(self): + return len(self._hf) + + @property + def vocab_size(self): + return self._hf.vocab_size + + @property + def eos_token_id(self): + return self._hf.eos_token_id + + @property + def bos_token_id(self): + return self._hf.bos_token_id + + @property + def pad_token_id(self): + return self._hf.pad_token_id + + def __getattr__(self, name): + return getattr(self._hf, name) diff --git a/src/srtctl/cli/do_sweep.py b/src/srtctl/cli/do_sweep.py index ff6eaa91..77b79ac5 100644 --- a/src/srtctl/cli/do_sweep.py +++ b/src/srtctl/cli/do_sweep.py @@ -18,6 +18,7 @@ import os import sys import threading +import time from dataclasses import dataclass from pathlib import Path @@ -179,6 +180,118 @@ def _print_connection_info(self) -> None: logger.info("=" * 60) logger.info("") + def _run_post_eval(self, stop_event: threading.Event) -> int: + """Run lm-eval after the main benchmark completes (or directly in eval-only mode).""" + from srtctl.benchmarks import get_runner + from srtctl.core.health import wait_for_model + + # In eval-only mode the benchmark health check was skipped, so do the + # full model-ready wait here. In post-benchmark mode a quick port + # check is sufficient since the server already served traffic. + if os.environ.get("EVAL_ONLY", "false").lower() == "true": + r = self.config.resources + n_prefill = 0 if r.num_agg > 0 else r.num_prefill + n_decode = r.num_agg if r.num_agg > 0 else r.num_decode + hc = self.config.health_check + logger.info("EVAL_ONLY: Waiting for server health before eval...") + if not wait_for_model( + host=self.runtime.nodes.head, + port=8000, + n_prefill=n_prefill, + n_decode=n_decode, + poll_interval=float(hc.interval_seconds), + timeout=float(hc.max_attempts * hc.interval_seconds), + report_every=60.0, + frontend_type=self.config.frontend.type, + stop_event=stop_event, + ): + logger.error("Server did not become healthy for eval") + return 1 + else: + if not wait_for_port(self.runtime.nodes.head, 8000, timeout=30): + logger.error("Server health check failed before eval - skipping") + return 1 + + try: + runner = get_runner("lm-eval") + except ValueError as e: + logger.error("lm-eval runner not available: %s", e) + return 1 + + eval_log = self.runtime.log_dir / "eval.out" + cmd = runner.build_command(self.config, self.runtime) + + logger.info("Eval command: %s", " ".join(cmd)) + logger.info("Eval log: %s", eval_log) + + # Pass through eval-related env vars. InferenceX writes multi-node + # metadata from these variables in append_lm_eval_summary(). + env_to_set = {} + for var in [ + "RUN_EVAL", + "EVAL_ONLY", + "IS_MULTINODE", + "FRAMEWORK", + "PRECISION", + "MODEL_PREFIX", + "RUNNER_TYPE", + "RESULT_FILENAME", + "SPEC_DECODING", + "ISL", + "OSL", + "MODEL", + "MODEL_PATH", + "MAX_MODEL_LEN", + "EVAL_MAX_MODEL_LEN", + "PREFILL_TP", + "PREFILL_EP", + "PREFILL_DP_ATTN", + "PREFILL_NUM_WORKERS", + "DECODE_TP", + "DECODE_EP", + "DECODE_DP_ATTN", + "DECODE_NUM_WORKERS", + ]: + val = os.environ.get(var) + if val: + env_to_set[var] = val + + # Set MODEL_NAME to the served model name so lm-eval uses the correct + # name for API requests. Without this, benchmark_lib.sh falls back to + # $MODEL (the HuggingFace ID) which the server doesn't recognize. + env_to_set["MODEL_NAME"] = self.config.served_model_name + logger.info("Eval MODEL_NAME: %s", env_to_set["MODEL_NAME"]) + + # Use EVAL_CONC from workflow (median chosen by InferenceX mark_eval_entries), + # falling back to max of benchmark concurrency list. + eval_conc = os.environ.get("EVAL_CONC") + if eval_conc: + env_to_set["EVAL_CONC"] = eval_conc + logger.info("Eval concurrency (from workflow): %s", eval_conc) + else: + conc_list = self.config.benchmark.get_concurrency_list() + if conc_list: + env_to_set["EVAL_CONC"] = str(max(conc_list)) + logger.info("Eval concurrency (max of %s): %s", conc_list, env_to_set["EVAL_CONC"]) + + proc = start_srun_process( + command=cmd, + nodelist=[self.runtime.nodes.head], + output=str(eval_log), + container_image=str(self.runtime.container_image), + container_mounts=self.runtime.container_mounts, + env_to_set=env_to_set, + ) + + while proc.poll() is None: + if stop_event.is_set(): + logger.info("Stop requested, terminating eval") + proc.terminate() + return 1 + time.sleep(1) + + return proc.returncode or 0 + def run(self) -> int: """Run the complete sweep.""" # Create status reporter (fire-and-forget, no-op if not configured) @@ -221,8 +334,27 @@ def run(self) -> int: self._print_connection_info() - # Stage 4: Benchmark (status reported AFTER health check passes) - exit_code = self.run_benchmark(registry, stop_event, reporter) + if os.environ.get("EVAL_ONLY", "false").lower() == "true": + reporter.report(JobStatus.BENCHMARK, JobStage.BENCHMARK, "Running eval-only evaluation") + logger.info("EVAL_ONLY=true: Skipping benchmark stage and running lm-eval evaluation...") + exit_code = self._run_post_eval(stop_event) + if exit_code != 0: + logger.error("Eval-only evaluation failed with exit code %d", exit_code) + else: + logger.info("Eval-only evaluation completed successfully") + else: + # Stage 4: Benchmark (status reported AFTER health check passes) + exit_code = self.run_benchmark(registry, stop_event, reporter) + + # Stage 5: Post-benchmark eval (optional, non-fatal) + if os.environ.get("RUN_EVAL", "false").lower() == "true" and exit_code == 0: + reporter.report(JobStatus.BENCHMARK, JobStage.BENCHMARK, "Running post-benchmark evaluation") + logger.info("RUN_EVAL=true: Running post-benchmark lm-eval evaluation...") + eval_exit = self._run_post_eval(stop_event) + if eval_exit != 0: + logger.warning("Eval failed with exit code %d (benchmark result is still valid)", eval_exit) + else: + logger.info("Post-benchmark eval completed successfully") except Exception as e: logger.exception("Error during sweep: %s", e) diff --git a/src/srtctl/core/config.py b/src/srtctl/core/config.py index 8cea4e17..f30fc7fc 100644 --- a/src/srtctl/core/config.py +++ b/src/srtctl/core/config.py @@ -141,6 +141,20 @@ def resolve_config_with_defaults(user_config: dict[str, Any], cluster_config: di config["reporting"] = cluster_config["reporting"] logger.debug("Applied cluster reporting config") + # Resolve extra_mount host path aliases through model_paths + extra_mounts = config.get("extra_mount", []) + if model_paths and extra_mounts: + resolved_mounts = [] + for mount_spec in extra_mounts: + host_path, container_path = mount_spec.split(":", 1) + if host_path in model_paths: + resolved_host = model_paths[host_path] + resolved_mounts.append(f"{resolved_host}:{container_path}") + logger.debug(f"Resolved extra_mount alias '{host_path}' -> '{resolved_host}'") + else: + resolved_mounts.append(mount_spec) + config["extra_mount"] = resolved_mounts + # Resolve frontend nginx_container alias frontend = config.get("frontend", {}) nginx_container = frontend.get("nginx_container", "") diff --git a/src/srtctl/core/runtime.py b/src/srtctl/core/runtime.py index 3e68bdd5..31195ed3 100644 --- a/src/srtctl/core/runtime.py +++ b/src/srtctl/core/runtime.py @@ -231,6 +231,14 @@ def from_config( host_path, container_path = mount_spec.split(":", 1) container_mounts[Path(host_path).resolve()] = Path(container_path) + # Mount InferenceX workspace if available (for lm-eval support). + # Skip exists() check: the orchestrator runs on the SLURM head node + # where the GH Actions workspace path may not be directly accessible, + # but it IS accessible from compute nodes via shared filesystem. + infmax_ws = os.environ.get("INFMAX_WORKSPACE") + if infmax_ws: + container_mounts[Path(infmax_ws)] = Path("/infmax-workspace") + # Add FormattablePath mounts from config.container_mounts # These need to be expanded with the runtime context, so we create a # temporary context first and then update diff --git a/src/srtctl/core/schema.py b/src/srtctl/core/schema.py index 97547fec..c535be39 100644 --- a/src/srtctl/core/schema.py +++ b/src/srtctl/core/schema.py @@ -539,6 +539,12 @@ class BenchmarkConfig: ttft_threshold_ms: int | None = None # Goodput TTFT threshold in ms (default: 2000) itl_threshold_ms: int | None = None # Goodput ITL threshold in ms (default: 25) random_range_ratio: float | None = None # Random input/output length range ratio (default: 0.8) + num_prompts_mult: int | None = None # Multiplier for num_prompts = concurrency * mult (default: 10) + num_warmup_mult: int | None = None # Multiplier for warmup prompts = concurrency * mult (default: 2) + # Trace replay benchmark fields (uses aiperf with mooncake_trace dataset type) + trace_file: str | None = None # Path to trace JSONL file (container path, e.g., /traces/dataset.jsonl) + custom_tokenizer: str | None = None # Custom tokenizer class (e.g., "module.path.ClassName") + use_chat_template: bool = True # Pass --use-chat-template to benchmark (default: true) def get_concurrency_list(self) -> list[int]: if self.concurrencies is None: @@ -711,7 +717,7 @@ def get_install_commands(self) -> str: if self.version is not None: return ( f"echo 'Installing dynamo {self.version}...' && " - f"pip install --break-system-packages --quiet ai-dynamo-runtime=={self.version} ai-dynamo=={self.version} && " + f"pip install --break-system-packages --quiet --extra-index-url https://pypi.nvidia.com ai-dynamo-runtime=={self.version} ai-dynamo=={self.version} && " f"echo 'Dynamo {self.version} installed'" ) @@ -719,8 +725,8 @@ def get_install_commands(self) -> str: git_ref = self.hash if self.hash else "HEAD" checkout_cmd = f"git checkout {self.hash}" if self.hash else "" - return ( - f"echo 'Installing dynamo from source ({git_ref})...' && " + # Original SGLang container path, UNCHANGED + sglang = ( "apt-get update -qq && apt-get install -y -qq libclang-dev > /dev/null 2>&1 && " "cd /sgl-workspace/ && " "git clone https://github.com/ai-dynamo/dynamo.git && " @@ -736,6 +742,34 @@ def get_install_commands(self) -> str: f"echo 'Dynamo installed from source ({git_ref})'" ) + # Portable path for non-SGLang containers (vLLM, etc.) + portable = ( + "if ! command -v cargo &> /dev/null || ! command -v maturin &> /dev/null; then " + "apt-get update -qq && apt-get install -y -qq git curl libclang-dev protobuf-compiler > /dev/null 2>&1 && " + "if ! command -v cargo &> /dev/null; then " + "curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y && source $HOME/.cargo/env; fi && " + "if ! command -v maturin &> /dev/null; then " + "pip install --break-system-packages maturin; fi; fi && " + "ORIG_DIR=$(pwd) && rm -rf /tmp/dynamo_build && mkdir -p /tmp/dynamo_build && cd /tmp/dynamo_build && " + "git clone https://github.com/ai-dynamo/dynamo.git && " + "cd dynamo && " + f"{checkout_cmd + ' && ' if checkout_cmd else ''}" + "cd lib/bindings/python/ && " + 'export RUSTFLAGS="${RUSTFLAGS:-} -C target-cpu=native --cfg tokio_unstable" && ' + "rm -f /tmp/ai_dynamo_runtime*.whl && " + "maturin build -o /tmp && " + "pip install --break-system-packages /tmp/ai_dynamo_runtime*.whl --force-reinstall && " + "cd /tmp/dynamo_build/dynamo/ && " + "pip install --break-system-packages -e . && " + "cd $ORIG_DIR && " + f"echo 'Dynamo installed from source ({git_ref})'" + ) + + return ( + f"echo 'Installing dynamo from source ({git_ref})...' && " + f"if [ -d /sgl-workspace ]; then {sglang}; else {portable}; fi" + ) + Schema: ClassVar[type[Schema]] = Schema diff --git a/tests/test_benchmarks.py b/tests/test_benchmarks.py index 261020c7..c15759b2 100644 --- a/tests/test_benchmarks.py +++ b/tests/test_benchmarks.py @@ -193,6 +193,62 @@ def test_build_command_includes_tokenizer_path(self): assert cmd[7] == "/model" # tokenizer path +class TestLMEvalRunner: + """Test LM-Eval runner.""" + + def test_registry_includes_lm_eval(self): + """lm-eval is in the benchmark registry.""" + assert "lm-eval" in list_benchmarks() + + def test_get_runner(self): + """Can get lm-eval runner.""" + runner = get_runner("lm-eval") + assert runner.name == "lm-eval" + + def test_script_path(self): + """Script path points to lm-eval bench.sh.""" + runner = get_runner("lm-eval") + assert "lm-eval/bench.sh" in runner.script_path + + def test_local_script_dir(self): + """Local script dir points to lm-eval scripts.""" + runner = get_runner("lm-eval") + assert runner.local_script_dir.endswith("lm-eval") + + def test_validate_config_always_valid(self): + """lm-eval accepts any config.""" + from srtctl.benchmarks.lm_eval import LMEvalRunner + from srtctl.core.schema import BenchmarkConfig, ModelConfig, ResourceConfig, SrtConfig + + runner = LMEvalRunner() + config = SrtConfig( + name="test", + model=ModelConfig(path="/model", container="/image", precision="fp4"), + resources=ResourceConfig(gpu_type="h100"), + benchmark=BenchmarkConfig(type="sa-bench"), + ) + assert runner.validate_config(config) == [] + + def test_build_command(self): + """build_command returns correct bash command.""" + from unittest.mock import MagicMock + + from srtctl.benchmarks.lm_eval import LMEvalRunner + + runner = LMEvalRunner() + runtime = MagicMock() + runtime.frontend_port = 8000 + + config = MagicMock() + cmd = runner.build_command(config, runtime) + assert cmd == [ + "bash", + "/srtctl-benchmarks/lm-eval/bench.sh", + "http://localhost:8000", + "/infmax-workspace", + ] + + class TestScriptsExist: """Test that benchmark scripts exist.""" @@ -209,3 +265,365 @@ def test_mmlu_script_exists(self): """MMLU script exists.""" script = SCRIPTS_DIR / "mmlu" / "bench.sh" assert script.exists() + + +class TestRunPostEval: + """Test SweepOrchestrator._run_post_eval method.""" + + @staticmethod + def _make_orchestrator(): + """Create a SweepOrchestrator with mocked config/runtime.""" + from pathlib import Path + + from srtctl.cli.do_sweep import SweepOrchestrator + from srtctl.core.runtime import Nodes, RuntimeContext + from srtctl.core.schema import ( + BenchmarkConfig, + FrontendConfig, + HealthCheckConfig, + ModelConfig, + ResourceConfig, + SrtConfig, + ) + + config = SrtConfig( + name="test", + model=ModelConfig(path="/model/test-model", container="/image", precision="fp4"), + resources=ResourceConfig( + gpu_type="h100", + gpus_per_node=8, + prefill_nodes=1, + decode_nodes=2, + prefill_workers=1, + decode_workers=2, + ), + benchmark=BenchmarkConfig(type="sa-bench", isl=1024, osl=1024, concurrencies="128x256x512"), + health_check=HealthCheckConfig(max_attempts=3, interval_seconds=1), + frontend=FrontendConfig(type="dynamo"), + ) + runtime = RuntimeContext( + job_id="12345", + run_name="test-run", + nodes=Nodes(head="node0", bench="node0", infra="node0", worker=("node0", "node1", "node2")), + head_node_ip="10.0.0.1", + infra_node_ip="10.0.0.1", + log_dir=Path("/tmp/logs"), + model_path=Path("/model/test-model"), + container_image=Path("/path/to/container.sqsh"), + gpus_per_node=8, + network_interface=None, + container_mounts={}, + environment={}, + ) + return SweepOrchestrator(config=config, runtime=runtime) + + def test_post_benchmark_port_check_fails(self): + """Returns 1 when port check fails in post-benchmark mode.""" + import os + import threading + from unittest.mock import patch + + orch = self._make_orchestrator() + stop = threading.Event() + with patch.dict(os.environ, {"EVAL_ONLY": "false"}, clear=False): + with patch("srtctl.cli.do_sweep.wait_for_port", return_value=False): + result = orch._run_post_eval(stop) + assert result == 1 + + def test_eval_only_health_check_fails(self): + """Returns 1 when health check fails in eval-only mode.""" + import os + import threading + from unittest.mock import patch + + orch = self._make_orchestrator() + stop = threading.Event() + with patch.dict(os.environ, {"EVAL_ONLY": "true"}, clear=False): + with patch("srtctl.core.health.wait_for_model", return_value=False): + result = orch._run_post_eval(stop) + assert result == 1 + + def test_runner_not_available(self): + """Returns 1 when lm-eval runner is not registered.""" + import os + import threading + from unittest.mock import patch + + orch = self._make_orchestrator() + stop = threading.Event() + with patch.dict(os.environ, {"EVAL_ONLY": "false"}, clear=False): + with patch("srtctl.cli.do_sweep.wait_for_port", return_value=True): + with patch("srtctl.benchmarks.get_runner", side_effect=ValueError("not found")): + result = orch._run_post_eval(stop) + assert result == 1 + + def test_successful_eval(self): + """Returns 0 when eval completes successfully.""" + import os + import threading + from unittest.mock import MagicMock, patch + + orch = self._make_orchestrator() + stop = threading.Event() + + mock_proc = MagicMock() + mock_proc.poll.side_effect = [None, 0] + mock_proc.returncode = 0 + + with patch.dict(os.environ, {"EVAL_ONLY": "false"}, clear=False): + with patch("srtctl.cli.do_sweep.wait_for_port", return_value=True): + with patch("srtctl.cli.do_sweep.start_srun_process", return_value=mock_proc): + result = orch._run_post_eval(stop) + assert result == 0 + + def test_eval_only_successful(self): + """Returns 0 in eval-only mode when health check and eval succeed.""" + import os + import threading + from unittest.mock import MagicMock, patch + + orch = self._make_orchestrator() + stop = threading.Event() + + mock_proc = MagicMock() + mock_proc.poll.side_effect = [None, 0] + mock_proc.returncode = 0 + + with patch.dict(os.environ, {"EVAL_ONLY": "true"}, clear=False): + with patch("srtctl.core.health.wait_for_model", return_value=True): + with patch("srtctl.cli.do_sweep.start_srun_process", return_value=mock_proc): + result = orch._run_post_eval(stop) + assert result == 0 + + def test_env_var_passthrough(self): + """Eval env vars are passed through to srun.""" + import os + import threading + from unittest.mock import MagicMock, patch + + orch = self._make_orchestrator() + stop = threading.Event() + + mock_proc = MagicMock() + mock_proc.poll.return_value = 0 + mock_proc.returncode = 0 + + env_vars = { + "EVAL_ONLY": "false", + "RUN_EVAL": "true", + "FRAMEWORK": "sglang", + "PRECISION": "fp4", + "MODEL": "test-model", + } + + captured_kwargs = {} + + def capture_srun(**kwargs): + captured_kwargs.update(kwargs) + return mock_proc + + with patch.dict(os.environ, env_vars, clear=False): + with patch("srtctl.cli.do_sweep.wait_for_port", return_value=True): + with patch("srtctl.cli.do_sweep.start_srun_process", side_effect=capture_srun): + orch._run_post_eval(stop) + + env_to_set = captured_kwargs["env_to_set"] + assert env_to_set["RUN_EVAL"] == "true" + assert env_to_set["FRAMEWORK"] == "sglang" + assert env_to_set["PRECISION"] == "fp4" + assert env_to_set["MODEL"] == "test-model" + assert env_to_set["MODEL_NAME"] == "test-model" + + def test_eval_conc_from_env(self): + """EVAL_CONC from env takes priority over benchmark concurrencies.""" + import os + import threading + from unittest.mock import MagicMock, patch + + orch = self._make_orchestrator() + stop = threading.Event() + + mock_proc = MagicMock() + mock_proc.poll.return_value = 0 + mock_proc.returncode = 0 + + captured_kwargs = {} + + def capture_srun(**kwargs): + captured_kwargs.update(kwargs) + return mock_proc + + with patch.dict(os.environ, {"EVAL_ONLY": "false", "EVAL_CONC": "64"}, clear=False): + with patch("srtctl.cli.do_sweep.wait_for_port", return_value=True): + with patch("srtctl.cli.do_sweep.start_srun_process", side_effect=capture_srun): + orch._run_post_eval(stop) + + assert captured_kwargs["env_to_set"]["EVAL_CONC"] == "64" + + def test_eval_conc_fallback_to_max_concurrency(self): + """EVAL_CONC falls back to max of benchmark concurrencies.""" + import os + import threading + from unittest.mock import MagicMock, patch + + orch = self._make_orchestrator() + stop = threading.Event() + + mock_proc = MagicMock() + mock_proc.poll.return_value = 0 + mock_proc.returncode = 0 + + captured_kwargs = {} + + def capture_srun(**kwargs): + captured_kwargs.update(kwargs) + return mock_proc + + env = {"EVAL_ONLY": "false"} + # Remove EVAL_CONC if present + with patch.dict(os.environ, env, clear=False): + os.environ.pop("EVAL_CONC", None) + with patch("srtctl.cli.do_sweep.wait_for_port", return_value=True): + with patch("srtctl.cli.do_sweep.start_srun_process", side_effect=capture_srun): + orch._run_post_eval(stop) + + # concurrencies="128x256x512", max is 512 + assert captured_kwargs["env_to_set"]["EVAL_CONC"] == "512" + + def test_stop_event_terminates_eval(self): + """Stop event terminates the eval process.""" + import os + import threading + from unittest.mock import MagicMock, patch + + orch = self._make_orchestrator() + stop = threading.Event() + stop.set() + + mock_proc = MagicMock() + mock_proc.poll.return_value = None + + with patch.dict(os.environ, {"EVAL_ONLY": "false"}, clear=False): + with patch("srtctl.cli.do_sweep.wait_for_port", return_value=True): + with patch("srtctl.cli.do_sweep.start_srun_process", return_value=mock_proc): + result = orch._run_post_eval(stop) + + assert result == 1 + mock_proc.terminate.assert_called_once() + + +class TestSweepRunEvalIntegration: + """Test eval-related branches in SweepOrchestrator.run().""" + + @staticmethod + def _make_orchestrator(): + return TestRunPostEval._make_orchestrator() + + def test_run_eval_only_mode(self): + """EVAL_ONLY=true skips benchmark and runs _run_post_eval.""" + import os + from unittest.mock import MagicMock, patch + + orch = self._make_orchestrator() + + with patch.dict(os.environ, {"EVAL_ONLY": "true"}, clear=False): + with patch.object(orch, "start_head_infrastructure") as mock_head: + mock_head.return_value = MagicMock() + with patch.object(orch, "start_all_workers", return_value={}): + with patch.object(orch, "start_frontend", return_value=[]): + with patch.object(orch, "_run_post_eval", return_value=0) as mock_eval: + with patch.object(orch, "run_benchmark") as mock_bench: + with patch.object(orch, "run_postprocess"): + with patch("srtctl.cli.do_sweep.StatusReporter") as mock_reporter_cls: + mock_reporter_cls.from_config.return_value = MagicMock() + exit_code = orch.run() + + mock_eval.assert_called_once() + mock_bench.assert_not_called() + assert exit_code == 0 + + def test_run_with_post_benchmark_eval(self): + """RUN_EVAL=true runs benchmark then _run_post_eval.""" + import os + from unittest.mock import MagicMock, patch + + orch = self._make_orchestrator() + + with patch.dict(os.environ, {"EVAL_ONLY": "false", "RUN_EVAL": "true"}, clear=False): + with patch.object(orch, "start_head_infrastructure") as mock_head: + mock_head.return_value = MagicMock() + with patch.object(orch, "start_all_workers", return_value={}): + with patch.object(orch, "start_frontend", return_value=[]): + with patch.object(orch, "run_benchmark", return_value=0) as mock_bench: + with patch.object(orch, "_run_post_eval", return_value=0) as mock_eval: + with patch.object(orch, "run_postprocess"): + with patch("srtctl.cli.do_sweep.StatusReporter") as mock_reporter_cls: + mock_reporter_cls.from_config.return_value = MagicMock() + exit_code = orch.run() + + mock_bench.assert_called_once() + mock_eval.assert_called_once() + assert exit_code == 0 + + def test_run_eval_only_failure(self): + """EVAL_ONLY=true with eval failure returns non-zero exit code.""" + import os + from unittest.mock import MagicMock, patch + + orch = self._make_orchestrator() + + with patch.dict(os.environ, {"EVAL_ONLY": "true"}, clear=False): + with patch.object(orch, "start_head_infrastructure") as mock_head: + mock_head.return_value = MagicMock() + with patch.object(orch, "start_all_workers", return_value={}): + with patch.object(orch, "start_frontend", return_value=[]): + with patch.object(orch, "_run_post_eval", return_value=1): + with patch.object(orch, "run_postprocess"): + with patch("srtctl.cli.do_sweep.StatusReporter") as mock_reporter_cls: + mock_reporter_cls.from_config.return_value = MagicMock() + exit_code = orch.run() + + assert exit_code == 1 + + def test_run_post_benchmark_eval_failure_nonfatal(self): + """RUN_EVAL=true with eval failure still returns benchmark exit code 0.""" + import os + from unittest.mock import MagicMock, patch + + orch = self._make_orchestrator() + + with patch.dict(os.environ, {"EVAL_ONLY": "false", "RUN_EVAL": "true"}, clear=False): + with patch.object(orch, "start_head_infrastructure") as mock_head: + mock_head.return_value = MagicMock() + with patch.object(orch, "start_all_workers", return_value={}): + with patch.object(orch, "start_frontend", return_value=[]): + with patch.object(orch, "run_benchmark", return_value=0): + with patch.object(orch, "_run_post_eval", return_value=1): + with patch.object(orch, "run_postprocess"): + with patch("srtctl.cli.do_sweep.StatusReporter") as mock_reporter_cls: + mock_reporter_cls.from_config.return_value = MagicMock() + exit_code = orch.run() + + assert exit_code == 0 + + def test_run_eval_skipped_when_benchmark_fails(self): + """RUN_EVAL=true but benchmark fails: eval is skipped.""" + import os + from unittest.mock import MagicMock, patch + + orch = self._make_orchestrator() + + with patch.dict(os.environ, {"EVAL_ONLY": "false", "RUN_EVAL": "true"}, clear=False): + with patch.object(orch, "start_head_infrastructure") as mock_head: + mock_head.return_value = MagicMock() + with patch.object(orch, "start_all_workers", return_value={}): + with patch.object(orch, "start_frontend", return_value=[]): + with patch.object(orch, "run_benchmark", return_value=1): + with patch.object(orch, "_run_post_eval") as mock_eval: + with patch.object(orch, "run_postprocess"): + with patch("srtctl.cli.do_sweep.StatusReporter") as mock_reporter_cls: + mock_reporter_cls.from_config.return_value = MagicMock() + exit_code = orch.run() + + mock_eval.assert_not_called() + assert exit_code == 1 diff --git a/tests/test_configs.py b/tests/test_configs.py index 1c23fb30..0b4138d5 100644 --- a/tests/test_configs.py +++ b/tests/test_configs.py @@ -127,7 +127,11 @@ def test_hash_install_command(self): assert "git clone" in cmd assert "git checkout abc123" in cmd assert "maturin build" in cmd - assert "pip install -e" in cmd + assert "if [ -d /sgl-workspace ]" in cmd + assert "/tmp/dynamo_build" in cmd + assert "protobuf-compiler" in cmd + assert "if ! command -v cargo" in cmd + assert "if ! command -v maturin" in cmd def test_top_of_tree_install_command(self): """Top-of-tree config generates source install without checkout.""" @@ -140,6 +144,10 @@ def test_top_of_tree_install_command(self): assert "git clone" in cmd assert "git checkout" not in cmd assert "maturin build" in cmd + assert "if [ -d /sgl-workspace ]" in cmd + assert "/tmp/dynamo_build" in cmd + assert "--break-system-packages" in cmd + assert "--force-reinstall" in cmd def test_hash_and_top_of_tree_not_allowed(self): """Cannot specify both hash and top_of_tree.""" @@ -1072,6 +1080,8 @@ def test_standard_tp_mode_still_works(self): def test_vllm_get_process_environment(self): """Test vLLM sets port environment variables from process.""" + from unittest.mock import patch + from srtctl.backends import VLLMProtocol from srtctl.core.topology import Process @@ -1090,10 +1100,12 @@ def test_vllm_get_process_environment(self): nixl_port=6550, ) - env = backend.get_process_environment(process) + with patch("srtctl.core.slurm.get_hostname_ip", return_value="10.0.0.1"): + env = backend.get_process_environment(process) assert env["DYN_VLLM_KV_EVENT_PORT"] == "5550" assert env["VLLM_NIXL_SIDE_CHANNEL_PORT"] == "6550" + assert env["VLLM_NIXL_SIDE_CHANNEL_HOST"] == "10.0.0.1" def test_vllm_get_process_environment_none_ports(self): """Test vLLM handles None ports gracefully.""" @@ -1370,3 +1382,113 @@ def test_agg_mode_no_disaggregation_flag(self): assert "--disaggregation-mode" not in cmd assert "--is-prefill-worker" not in cmd assert "--is-decode-worker" not in cmd + + +class TestInfmaxWorkspaceMount: + """Test that INFMAX_WORKSPACE env var creates a container mount.""" + + def test_infmax_workspace_mount_added(self, tmp_path): + """RuntimeContext includes /infmax-workspace mount when env var is set.""" + import os + import subprocess + from pathlib import Path + from unittest.mock import MagicMock, patch + + from srtctl.core.runtime import RuntimeContext + from srtctl.core.schema import ModelConfig, ResourceConfig, SrtConfig + + model_path = tmp_path / "model" + model_path.mkdir() + container_path = tmp_path / "container.sqsh" + container_path.touch() + + slurm_env = { + "SLURM_JOB_ID": "12345", + "SLURM_JOBID": "12345", + "SLURM_NODELIST": "gpu-[01-02]", + "SLURM_JOB_NUM_NODES": "2", + "SRTCTL_SOURCE_DIR": str(Path(__file__).parent.parent), + "INFMAX_WORKSPACE": "/actions/runner/workspace", + } + + def mock_scontrol(cmd, **kwargs): + if cmd[0] == "scontrol" and "hostnames" in cmd: + result = MagicMock() + result.stdout = "gpu-01\ngpu-02" + result.returncode = 0 + return result + raise subprocess.CalledProcessError(1, cmd) + + with patch.dict(os.environ, slurm_env): + with patch("subprocess.run", mock_scontrol): + with patch("srtctl.core.slurm.get_hostname_ip", return_value="10.0.0.1"): + config = SrtConfig( + name="test", + model=ModelConfig( + path=str(model_path), + container=str(container_path), + precision="fp8", + ), + resources=ResourceConfig( + gpu_type="h100", + gpus_per_node=8, + prefill_nodes=1, + decode_nodes=1, + ), + ) + runtime = RuntimeContext.from_config(config, job_id="12345") + + assert Path("/infmax-workspace") in runtime.container_mounts.values() + + def test_infmax_workspace_mount_not_added_without_env(self, tmp_path): + """RuntimeContext does not include /infmax-workspace without env var.""" + import os + import subprocess + from pathlib import Path + from unittest.mock import MagicMock, patch + + from srtctl.core.runtime import RuntimeContext + from srtctl.core.schema import ModelConfig, ResourceConfig, SrtConfig + + model_path = tmp_path / "model" + model_path.mkdir() + container_path = tmp_path / "container.sqsh" + container_path.touch() + + slurm_env = { + "SLURM_JOB_ID": "12345", + "SLURM_JOBID": "12345", + "SLURM_NODELIST": "gpu-[01-02]", + "SLURM_JOB_NUM_NODES": "2", + "SRTCTL_SOURCE_DIR": str(Path(__file__).parent.parent), + } + + def mock_scontrol(cmd, **kwargs): + if cmd[0] == "scontrol" and "hostnames" in cmd: + result = MagicMock() + result.stdout = "gpu-01\ngpu-02" + result.returncode = 0 + return result + raise subprocess.CalledProcessError(1, cmd) + + with patch.dict(os.environ, slurm_env): + os.environ.pop("INFMAX_WORKSPACE", None) + with patch("subprocess.run", mock_scontrol): + with patch("srtctl.core.slurm.get_hostname_ip", return_value="10.0.0.1"): + config = SrtConfig( + name="test", + model=ModelConfig( + path=str(model_path), + container=str(container_path), + precision="fp8", + ), + resources=ResourceConfig( + gpu_type="h100", + gpus_per_node=8, + prefill_nodes=1, + decode_nodes=1, + ), + ) + runtime = RuntimeContext.from_config(config, job_id="12345") + + assert Path("/infmax-workspace") not in runtime.container_mounts.values()