Skip to content

Update Qwen3.5 recipes#240

Merged
YAMY1234 merged 12 commits intoishandhanani:mainfrom
YAMY1234:qwen3.5-recipes
Apr 7, 2026
Merged

Update Qwen3.5 recipes#240
YAMY1234 merged 12 commits intoishandhanani:mainfrom
YAMY1234:qwen3.5-recipes

Conversation

@YAMY1234
Copy link
Copy Markdown
Collaborator

@YAMY1234 YAMY1234 commented Apr 3, 2026

Summary by CodeRabbit

  • New Features

    • Added new benchmark recipe configurations supporting tensor parallel, data parallel, and expert parallel inference strategies
    • Introduced profiling capabilities for performance optimization across multiple hardware configurations
    • Added support for new precision formats and quantization strategies
  • Chores

    • Updated model serving configurations with enhanced performance tuning parameters
    • Expanded benchmark workload definitions with additional evaluation datasets and concurrency profiles

YAMY1234 added 4 commits April 3, 2026 01:32
…layout

- Reorganize flat qwen3.5/ directory into structured hierarchy:
  qwen3.5/{fp8,nvfp4}/{agg,disagg}/
- Split disagg recipes by transfer backend: mooncake/ and nixl/
- Add ACC (accuracy) recipes for gsm8k validation
- Add profile recipes for performance analysis
- Remove duplicate/debug files from old layout
- Remove extra_mount (local sglang source mount)
- Remove dynamo hash configs
- Remove trust-remote-code (not needed)
- Remove random_range_ratio from benchmarks
- Remove commented-out options (# mamba-scheduler-strategy: "extra_buffer")
- Remove SGLANG_DEBUG_MEMORY_POOL env vars
- Keep # enable-symm-mem: true comments
Keep enable-symm-mem commented out; add note that it may
improve performance in some scenarios and should be benchmarked.
- Add kv-cache-dtype: fp8_e4m3 to all files that were missing it
- Rename tp-size/dp-size/ep-size to full names (tensor-parallel-size etc)
- Remove max-mamba-cache-size and related verbose comments
- Remove commented-out DeepEP mounts and debug env vars
- Fix chunked-prefill-size: 2048 -> 16384
- Fix mamba-track-interval: 128 -> 2048 (or 10000 for 8k1k recipes)
  with annotation to keep it > isl+osl to avoid checkpointing
- Fix context-length: 262144 -> 2200 to match workload
- Remove misc commented-out options (trace envs, benchmark params)
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 3, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9dbdf531-9088-4dd6-9161-015a7d8c3789

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Introduces 100+ SGLang benchmark recipe YAML files for Qwen3.5 models across FP8 and NVFP4 precision modes, spanning aggregated, disaggregated (Mooncake/NIXL), and profiling configurations. Changes include new MTP speculative decoding (EAGLE), staging buffers, heterogeneous parallel topologies (TP/DP/EP variants), and runtime tuning parameters.

Changes

Cohort / File(s) Summary
Aggregated FP8 MTP Recipes
recipes/qwen3.5/fp8/agg/mtp_radix_off/tp4-mtp-acc.yaml, recipes/qwen3.5/fp8/agg/mtp_radix_on/tp4-mtp-acc.yaml
New recipes configuring TP4 with EAGLE speculative decoding (2 steps, topk=1, 3 draft tokens), FP8 KV cache, radix cache control (disabled/enabled), and GSM8K benchmark.
Aggregated FP8 Profiling
recipes/qwen3.5/fp8/agg/profile/agg-dep4-trtllm-symmem-profile.yaml, recipes/qwen3.5/fp8/agg/profile/agg-tep4-trtllm-symmem-profile.yaml, recipes/qwen3.5/fp8/agg/profile/agg-tp4-trtllm-symmem-profile.yaml
Enabled symmetric memory (enable-symm-mem: true) and added torch profiling with aggregated window (steps 10–20).
Aggregated FP8 STP Prefix Off
recipes/qwen3.5/fp8/agg/stp_prefix_off/dep4-acc.yaml, recipes/qwen3.5/fp8/agg/stp_prefix_off/dep4.yaml, recipes/qwen3.5/fp8/agg/stp_prefix_off/tep4.yaml, recipes/qwen3.5/fp8/agg/stp_prefix_off/tp4-acc.yaml, recipes/qwen3.5/fp8/agg/stp_prefix_off/tp4.yaml
New/updated recipes with DEP4/TEP4/TP4 parallelism, disabled radix cache, FP8 KV cache, and GSM8K accuracy benchmarks; minor env var cleanups.
Aggregated FP8 STP Radix On
recipes/qwen3.5/fp8/agg/stp_radix_on/dep4-acc.yaml, recipes/qwen3.5/fp8/agg/stp_radix_on/tp4-acc.yaml
New accuracy recipes with radix cache enabled (disable-radix-cache: false), TP4/DEP4 configs, and GSM8K benchmarks.
Disaggregated Mooncake MTP Configs
recipes/qwen3.5/fp8/disagg/mooncake/mtp_radix_off/1p1d-mtp-acc.yaml, recipes/qwen3.5/fp8/disagg/mooncake/mtp_radix_off/1p1d-tp4-mtp.yaml, recipes/qwen3.5/fp8/disagg/mooncake/mtp_radix_on/1p1d-mtp-acc-prefixcache-retraction.yaml, recipes/qwen3.5/fp8/disagg/mooncake/mtp_radix_on/1p1d-mtp-acc-prefixcache.yaml
New/updated 1P1D recipes with EAGLE MTP, radix/prefix cache control, GPQA benchmarks, and prefill/decode staging buffer configs.
Disaggregated Mooncake Profiling
recipes/qwen3.5/fp8/disagg/mooncake/profile/1p1d-dep4-profile.yaml, recipes/qwen3.5/fp8/disagg/mooncake/profile/1p1d-dep4-staging-nsys-1k1k-v3.yaml, recipes/qwen3.5/fp8/disagg/mooncake/profile/1p1d-dep4-staging-torch-1k1k.yaml, recipes/qwen3.5/fp8/disagg/mooncake/profile/1p1d-dep4dep4-profile.yaml, recipes/qwen3.5/fp8/disagg/mooncake/profile/1p1d-tep4-profile.yaml, recipes/qwen3.5/fp8/disagg/mooncake/profile/1p1d-tp4-mtp-profile.yaml, recipes/qwen3.5/fp8/disagg/mooncake/profile/1p1d-tp4-profile.yaml
New profiling recipes with torch/nsys profiling (steps 10–20), sa-bench concurrency sweeps, staging buffer control, and updated memory fractions (0.75→0.80).
Disaggregated Mooncake STP Prefix Off (Homogeneous)
recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-deepep-deepgemm.yaml, recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep2-gpqa.yaml, recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep2-gsm8k-bench.yaml, recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep2-longbenchv2.yaml, recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep2-staging-gpqa.yaml, recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep2-staging-gsm8k-bench.yaml, recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep2-staging-longbenchv2.yaml, recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep4-acc.yaml, recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep4-longbenchv2.yaml, recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep4-mooncake-gpqa.yaml, recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep4-staging-acc.yaml, recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep4-staging-gsm8k-bench.yaml, recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep4-staging-gsm8k.yaml, recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep4-staging-longbenchv2.yaml, recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep4-staging.yaml, recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep4.yaml, recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep4dep4-8k1k.yaml, recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep4dep4-staging-8k1k.yaml, recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep4dep4-staging.yaml, recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep4dep4.yaml, recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep4tp4-gpqa.yaml, recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep4tp4-gsm8k-bench.yaml, recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep4tp4-longbenchv2.yaml, recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep4tp4-staging-gpqa.yaml, recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep4tp4-staging-gsm8k-bench.yaml, recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep4tp4-staging-longbenchv2.yaml, recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-tep4.yaml, recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-tp4-acc.yaml, recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-tp4-baseline.yaml, recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-tp4.yaml
New/updated 1P1D recipes with TP4/DEP4/TEP4/DEP2/DEP4TP4 parallelism, FP8 KV cache, staging buffer configs (32MB–256MB prefill, 4GB decode), diverse benchmarks (GPQA, GSM8K, LongBenchV2), and updated memory tuning (mem-fraction-static 0.75–0.80, cuda-graph-max-bs 1024).
Disaggregated Mooncake STP Prefix Off (Heterogeneous)
recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/2p1d-hetero-nostaging-longbenchv2.yaml, recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/2p1d-hetero-staging-dep4-gpqa.yaml, recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/2p1d-hetero-staging-dep4-longbenchv2.yaml, recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/2p1d-hetero-staging-gpqa.yaml, recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/2p1d-hetero-staging-longbenchv2.yaml
New 2P1D heterogeneous recipes with multi-worker prefill groups (TP4 and DEP2 variants), single decode (DEP4), staging buffer control, and LongBenchV2/GPQA benchmarks.
Disaggregated NIXL STP Prefix Off
recipes/qwen3.5/fp8/disagg/nixl/stp_prefix_off/1p1d-dep4-nixl-acc.yaml, recipes/qwen3.5/fp8/disagg/nixl/stp_prefix_off/1p1d-dep4-nixl-gpqa-20req.yaml, recipes/qwen3.5/fp8/disagg/nixl/stp_prefix_off/1p1d-dep4-nixl-gpqa.yaml, recipes/qwen3.5/fp8/disagg/nixl/stp_prefix_off/1p1d-dep4-nixl-gsm8k.yaml, recipes/qwen3.5/fp8/disagg/nixl/stp_prefix_off/1p1d-dep4-staging-nixl-gpqa-20req.yaml, recipes/qwen3.5/fp8/disagg/nixl/stp_prefix_off/1p1d-dep4-staging-nixl-gpqa.yaml, recipes/qwen3.5/fp8/disagg/nixl/stp_prefix_off/1p1d-dep4-staging-nixl-longbenchv2.yaml, recipes/qwen3.5/fp8/disagg/nixl/stp_prefix_off/1p1d-dep4dep4-nixl.yaml, recipes/qwen3.5/fp8/disagg/nixl/stp_prefix_off/1p1d-tep4-nixl.yaml, recipes/qwen3.5/fp8/disagg/nixl/stp_prefix_off/1p1d-tp4-nixl-acc.yaml, recipes/qwen3.5/fp8/disagg/nixl/stp_prefix_off/1p1d-tp4-nixl.yaml
New/updated 1P1D recipes using NIXL transfer backend, DEP4/TEP4/TP4 parallelism, FP8 KV cache, staging buffer configs (256MB prefill, 4GB decode), diverse benchmarks, and radix cache control.
NVFP4 Aggregated Validation
recipes/qwen3.5/nvfp4/agg/mtp_radix_on/validate-nvfp4-mtp-ladfi-c128.yaml, recipes/qwen3.5/nvfp4/agg/mtp_radix_on/validate-nvfp4-mtp-ladfi-c512.yaml
New NVFP4 MTP validation recipes with EAGLE speculative decoding (NEXTN algorithm), FlashInfer LADFI (linear-attention decode), radix cache enabled, and high concurrency (128/512).
NVFP4 Aggregated Profiling & STP Prefix Off
recipes/qwen3.5/nvfp4/agg/profile/profile-torch-conc128.yaml, recipes/qwen3.5/nvfp4/agg/stp_prefix_off/dep4-acc.yaml, recipes/qwen3.5/nvfp4/agg/stp_prefix_off/dep4.yaml, recipes/qwen3.5/nvfp4/agg/stp_prefix_off/tep4.yaml, recipes/qwen3.5/nvfp4/agg/stp_prefix_off/tp4-acc.yaml, recipes/qwen3.5/nvfp4/agg/stp_prefix_off/tp4.yaml
New NVFP4 FP4 quantization recipes with TP4/DEP4/TEP4 parallelism, modelopt_fp4 quantization, FP8 E4M3 KV cache, FlashInfer GEMM/MoE backends, torch profiling, GSM8K benchmarks, and concurrency tuning (64–512).
NVFP4 Aggregated STP Radix On
recipes/qwen3.5/nvfp4/agg/stp_radix_on/validate-nvfp4-extrabuf-ladfi-c128.yaml, recipes/qwen3.5/nvfp4/agg/stp_radix_on/validate-nvfp4-extrabuf-ladfi-c512.yaml
New NVFP4 validation recipes with extra\_buffer scheduler strategy, FlashInfer LADFI backend, radix cache enabled, and high concurrency.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

Suggested reviewers

  • ishandhanani
  • kyleliang-nv

🐰 Whiskers twitch with joy—such YAML abundance!
A hundred configs dance in perfect cadence,
Parallel pathways bloom with FP8 and NVFP4,
MTP dreams flutter, staging buffers swoon,
The recipes are many, the tuning is wise! 🌙✨

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

@YAMY1234 YAMY1234 marked this pull request as draft April 3, 2026 18:36
YAMY1234 added 5 commits April 3, 2026 11:43
profile/: keep tp4, tp4-mtp, dep4-nsys; remove dep4, dep4dep4, tep4, torch
stp_prefix_off/: keep tp4, dep4dep4, dep4tp4-staging (+gpqa/gsm8k); remove
  dep2/dep4 variants, longbenchv2, hetero, baseline, deepep-deepgemm, 8k1k
nixl/: keep tp4, dep4 (+acc, gpqa, gsm8k); remove 20req, tep4, dep4dep4, staging
nvfp4/: rename validate-* to tp4-acc/tp4-mtp-acc style, keep c512, drop c128
- speculative-algorithm: EAGLE -> NEXTN (native MTP head)
- speculative-num-steps: 2 -> 3
- speculative-num-draft-tokens: 3 -> 4
- Remove speculative-eagle-topk (not needed for NEXTN)
- Add SGLANG_ENABLE_SPEC_V2=1 env var to all MTP recipes
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 19

Note

Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.

🟡 Minor comments (6)
recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-tp4-baseline.yaml-3-3 (1)

3-3: ⚠️ Potential issue | 🟡 Minor

Typo in comment: "recipies" should be "recipes".

Proposed fix
-# Server config is EXACTLY the same as recipies/qwen3.5/1p1d.yaml (949062)
+# Server config is EXACTLY the same as recipes/qwen3.5/1p1d.yaml (949062)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-tp4-baseline.yaml` at
line 3, Fix the typo in the top-line comment of the YAML: change
"recipies/qwen3.5/1p1d.yaml" to "recipes/qwen3.5/1p1d.yaml" so the comment reads
"Server config is EXACTLY the same as recipes/qwen3.5/1p1d.yaml (949062)";
update the string in the file comment accordingly.
recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep4-staging-gsm8k.yaml-1-6 (1)

1-6: ⚠️ Potential issue | 🟡 Minor

Inconsistent header comment: mentions GPQA but benchmark is GSM8K.

The header comment on line 2 says "with GPQA accuracy benchmark" but the actual benchmark type on line 115 is gsm8k. Update the comment to accurately reflect the benchmark.

Proposed fix
 # Qwen3.5-397B-A17B-FP8 Disaggregated 1P1D: TP4 Prefill + DEP4 Decode
-# Same as 1p1d-dep4-staging.yaml but with GPQA accuracy benchmark
+# Same as 1p1d-dep4-staging.yaml but with GSM8K accuracy benchmark
 # GPU staging buffer enabled (bulk RDMA, ~1000x fewer RDMA WRs)
-# Purpose: verify staging buffer correctness via GPQA accuracy
+# Purpose: verify staging buffer correctness via GSM8K accuracy
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep4-staging-gsm8k.yaml`
around lines 1 - 6, The file header comment incorrectly states "with GPQA
accuracy benchmark" while this recipe uses the GSM8K benchmark; update the top
comment to accurately say GSM8K (or remove the GPQA mention) so it matches the
actual benchmark used, and verify the name field
("qwen3.5-1p1d-dep4-staging-acc") and any other descriptive lines reflect GSM8K
consistently.
recipes/qwen3.5/fp8/disagg/nixl/stp_prefix_off/1p1d-dep4-staging-nixl-gpqa-20req.yaml-54-55 (1)

54-55: ⚠️ Potential issue | 🟡 Minor

Decode environment may be missing SGLANG_DISAGG_STAGING_BUFFER_SIZE_MB.

Consistent with the other NIXL staging recipe (1p1d-dep4-staging-nixl-gpqa.yaml), the decode environment has SGLANG_DISAGG_STAGING_POOL_SIZE_MB but lacks SGLANG_DISAGG_STAGING_BUFFER_SIZE_MB. Consider adding for consistency with the Mooncake staging reference implementation.

Suggested fix
     SGLANG_DISAGG_STAGING_BUFFER: "1"
+    SGLANG_DISAGG_STAGING_BUFFER_SIZE_MB: "64"
     SGLANG_DISAGG_STAGING_POOL_SIZE_MB: "4096"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@recipes/qwen3.5/fp8/disagg/nixl/stp_prefix_off/1p1d-dep4-staging-nixl-gpqa-20req.yaml`
around lines 54 - 55, The decode environment is missing the
SGLANG_DISAGG_STAGING_BUFFER_SIZE_MB variable; add
SGLANG_DISAGG_STAGING_BUFFER_SIZE_MB with an appropriate value (e.g., "4096" to
match SGLANG_DISAGG_STAGING_POOL_SIZE_MB) alongside
SGLANG_DISAGG_STAGING_POOL_SIZE_MB in the same env block so the staging recipe
matches the Mooncake/1p1d-dep4-staging-nixl-gpqa reference implementation.
recipes/qwen3.5/fp8/disagg/nixl/stp_prefix_off/1p1d-dep4-staging-nixl-gpqa.yaml-54-55 (1)

54-55: ⚠️ Potential issue | 🟡 Minor

Decode environment may be missing SGLANG_DISAGG_STAGING_BUFFER_SIZE_MB.

The prefill environment sets SGLANG_DISAGG_STAGING_BUFFER_SIZE_MB: "256", but the decode environment only has SGLANG_DISAGG_STAGING_POOL_SIZE_MB: "4096" without an explicit buffer size. Comparing with 1p1d-dep4-staging.yaml (Mooncake variant), the decode environment includes both SGLANG_DISAGG_STAGING_BUFFER_SIZE_MB: "64" and SGLANG_DISAGG_STAGING_POOL_SIZE_MB: "4096". If this omission is intentional (relying on defaults), consider adding a comment; otherwise, add the missing parameter.

Suggested fix
     SGLANG_DISAGG_STAGING_BUFFER: "1"
+    SGLANG_DISAGG_STAGING_BUFFER_SIZE_MB: "64"
     SGLANG_DISAGG_STAGING_POOL_SIZE_MB: "4096"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@recipes/qwen3.5/fp8/disagg/nixl/stp_prefix_off/1p1d-dep4-staging-nixl-gpqa.yaml`
around lines 54 - 55, The decode environment is missing the explicit
SGLANG_DISAGG_STAGING_BUFFER_SIZE_MB entry; update the decode env block to add
SGLANG_DISAGG_STAGING_BUFFER_SIZE_MB with the intended value (e.g., match the
prefill value "256" or the Mooncake variant "64") alongside
SGLANG_DISAGG_STAGING_BUFFER and SGLANG_DISAGG_STAGING_POOL_SIZE_MB, or if
omission is intentional, add an explanatory comment next to
SGLANG_DISAGG_STAGING_POOL_SIZE_MB to document reliance on a default.
recipes/qwen3.5/fp8/disagg/nixl/stp_prefix_off/1p1d-dep4-nixl-acc.yaml-26-27 (1)

26-27: ⚠️ Potential issue | 🟡 Minor

Mooncake environment variables in NIXL recipe may be unnecessary.

This recipe uses disaggregation-transfer-backend: nixl but includes Mooncake-specific environment variables (MC_FORCE_MNNVL: "1" and SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True") in both prefill and decode environments. Comparing with the sibling recipe 1p1d-dep4-nixl-gpqa.yaml, those Mooncake variables are absent. These may be remnants from copying a Mooncake template and could be safely removed for clarity.

Suggested cleanup in prefill_environment
     NCCL_MNNVL_ENABLE: "1"
     NCCL_CUMEM_ENABLE: "1"
-    MC_FORCE_MNNVL: "1"
     SGLANG_DG_CACHE_DIR: "/configs/deepgemm-cache"
     ...
-    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
Suggested cleanup in decode_environment
     NCCL_MNNVL_ENABLE: "1"
     NCCL_CUMEM_ENABLE: "1"
-    MC_FORCE_MNNVL: "1"
     SGLANG_DG_CACHE_DIR: "/configs/deepgemm-cache"
     ...
-    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"

Also applies to: 32-33, 41-42, 49-50

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@recipes/qwen3.5/fp8/disagg/nixl/stp_prefix_off/1p1d-dep4-nixl-acc.yaml`
around lines 26 - 27, This recipe includes Mooncake-specific environment
variables that are unnecessary for the NIXL transfer backend; remove the
MC_FORCE_MNNVL and SGLANG_MOONCAKE_CUSTOM_MEM_POOL entries from both
prefill_environment and decode_environment blocks in the 1p1d-dep4-nixl-acc.yaml
recipe (leave other vars like SGLANG_DG_CACHE_DIR intact) so the NIXL recipe
matches the sibling 1p1d-dep4-nixl-gpqa.yaml; search for the exact keys
"MC_FORCE_MNNVL" and "SGLANG_MOONCAKE_CUSTOM_MEM_POOL" and delete those lines
wherever present in this file.
recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep4tp4-staging-longbenchv2.yaml-58-59 (1)

58-59: ⚠️ Potential issue | 🟡 Minor

Decode environment may be missing SGLANG_DISAGG_STAGING_BUFFER_SIZE_MB.

Similar to the NIXL staging GPQA recipe, the prefill environment sets SGLANG_DISAGG_STAGING_BUFFER_SIZE_MB: "256", but the decode environment only has SGLANG_DISAGG_STAGING_POOL_SIZE_MB: "4096" without an explicit buffer size. The staging recipes in the Mooncake 1p1d-dep4-staging.yaml include both parameters for decode. Consider adding the buffer size for consistency, or document why it's intentionally omitted.

Suggested fix
     SGLANG_DISAGG_STAGING_BUFFER: "1"
+    SGLANG_DISAGG_STAGING_BUFFER_SIZE_MB: "256"
     SGLANG_DISAGG_STAGING_POOL_SIZE_MB: "4096"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep4tp4-staging-longbenchv2.yaml`
around lines 58 - 59, The decode environment is missing the
SGLANG_DISAGG_STAGING_BUFFER_SIZE_MB variable while the prefill env sets it
(e.g., "256"); update the decode env to include
SGLANG_DISAGG_STAGING_BUFFER_SIZE_MB with the same or appropriate value to match
SGLANG_DISAGG_STAGING_POOL_SIZE_MB ("4096") for consistency, or add a brief
comment documenting why the buffer size is intentionally omitted; look for the
decode env block containing SGLANG_DISAGG_STAGING_POOL_SIZE_MB and add
SGLANG_DISAGG_STAGING_BUFFER_SIZE_MB alongside it (or a comment explaining
omission).
🧹 Nitpick comments (14)
recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-tp4-acc.yaml (2)

85-86: Make decode parallelism explicit to avoid default-dependent behavior.

Decode currently declares only tensor-parallel-size. Add explicit data-parallel-size: 1 and expert-parallel-size: 1 for parity with prefill and future-proofing against default changes.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-tp4-acc.yaml` around
lines 85 - 86, In the decode configuration where only tensor-parallel-size is
declared, explicitly add data-parallel-size: 1 and expert-parallel-size: 1 to
the same block to avoid relying on defaults; update the decode section (the
entry containing tensor-parallel-size) to include the two new fields so the
decode parallelism matches prefill and is future-proof against default changes.

21-50: Reduce config drift by factoring shared env vars.

prefill_environment and decode_environment are mostly duplicated. Consider YAML anchors/merge keys so common keys are defined once and stage-specific overrides stay minimal.

Proposed refactor (YAML anchors)
 backend:
+  common_environment: &common_environment
+    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
+    PYTHONUNBUFFERED: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_CUMEM_ENABLE: "1"
+    MC_FORCE_MNNVL: "1"
+    SGLANG_DG_CACHE_DIR: "/configs/deepgemm-cache"
+    FLASHINFER_WORKSPACE_BASE: "/configs/flashinfer-cache"
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
+    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0"
 
   prefill_environment:
-    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
-    PYTHONUNBUFFERED: "1"
-    NCCL_MNNVL_ENABLE: "1"
-    NCCL_CUMEM_ENABLE: "1"
-    MC_FORCE_MNNVL: "1"
-    SGLANG_DG_CACHE_DIR: "/configs/deepgemm-cache"
-    FLASHINFER_WORKSPACE_BASE: "/configs/flashinfer-cache"
-    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
-    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
-    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
-    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
-    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0"
+    <<: *common_environment
 
   decode_environment:
-    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
-    PYTHONUNBUFFERED: "1"
-    NCCL_MNNVL_ENABLE: "1"
-    NCCL_CUMEM_ENABLE: "1"
-    MC_FORCE_MNNVL: "1"
-    SGLANG_DG_CACHE_DIR: "/configs/deepgemm-cache"
-    FLASHINFER_WORKSPACE_BASE: "/configs/flashinfer-cache"
-    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
-    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
-    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+    <<: *common_environment
     SGLANG_DECODE_BOOTSTRAP_TIMEOUT: "1000"
     SGLANG_HACK_SEQ_BOOTSTRAP_ROOM: "1"
-    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
-    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-tp4-acc.yaml` around
lines 21 - 50, prefill_environment and decode_environment share nearly identical
keys; factor the common environment variables into a single YAML anchor (e.g.,
&common_env) and then merge it into each section using the YAML merge key (<<:
*common_env), leaving only stage-specific overrides (e.g.,
SGLANG_DECODE_BOOTSTRAP_TIMEOUT, SGLANG_HACK_SEQ_BOOTSTRAP_ROOM) in
decode_environment and no-ops in prefill_environment; ensure all values remain
quoted strings and that unique keys like SGLANG_MOONCAKE_CUSTOM_MEM_POOL and
SGLANG_USE_MESSAGE_QUEUE_BROADCASTER are preserved after the merge.
recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-deepep-deepgemm.yaml (2)

8-8: Pin the container image to an immutable tag/digest.

container: "dev" on Line 8 is mutable and can change benchmark behavior over time. Prefer a versioned tag or digest for reproducibility.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-deepep-deepgemm.yaml`
at line 8, The recipe currently uses a mutable container reference container:
"dev"; change this to a pinned image tag or digest (e.g., container:
"your-image:1.2.3" or container: "your-image@sha256:...") to ensure reproducible
benchmarks—update the container field in the YAML (look for the container
key/value "dev") to a specific versioned tag or digest.

111-116: Decode tuning comment block is partially stale.

The note references max_mamba_cache_size: 3200/8, but that setting is no longer configured in this file. Please update/remove that line so future tuning is based on active knobs only.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-deepep-deepgemm.yaml`
around lines 111 - 116, The comment block about decode tuning is stale because
it references the inactive setting max_mamba_cache_size (shown as "3200/8");
update the comment to remove or rephrase that specific line so it only documents
currently used knobs (e.g., max_running_requests, extra_buffer, dp_size,
enable_dp_attention and pool), or replace the max_mamba_cache_size calculation
with the actual active parameter if intended.
recipes/qwen3.5/nvfp4/agg/stp_prefix_off/tp4.yaml (1)

1-5: Align the header with the actual symm-mem setting.

Line 2 says symmetric memory is enabled, but Line 53 still leaves enable-symm-mem commented out. Please make the header match the real config so benchmark comparisons are not mislabeled.

Also applies to: 53-53

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@recipes/qwen3.5/nvfp4/agg/stp_prefix_off/tp4.yaml` around lines 1 - 5, The
header comment claiming "symmetric memory enabled" is inconsistent with the
actual config where the enable-symm-mem flag remains commented out; either
uncomment and set enable-symm-mem to true in the TP4 config or change the header
text to state symmetric memory is disabled. Edit the top comment line that
currently says "Pure tensor parallel, no expert parallel, symmetric memory
enabled" and/or uncomment and set the enable-symm-mem setting so the header and
the enable-symm-mem flag match.
recipes/qwen3.5/fp8/disagg/mooncake/mtp_radix_on/1p1d-mtp-acc-prefixcache-retraction.yaml (1)

9-9: Recipe name doesn't match filename suffix.

The filename includes -retraction suffix but the name field is qwen3.5-1p1d-tp4-mtp-acc-prefixcache. Consider adding -retraction to the name for consistency and easier identification in logs/results.

Proposed fix
-name: "qwen3.5-1p1d-tp4-mtp-acc-prefixcache"
+name: "qwen3.5-1p1d-tp4-mtp-acc-prefixcache-retraction"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@recipes/qwen3.5/fp8/disagg/mooncake/mtp_radix_on/1p1d-mtp-acc-prefixcache-retraction.yaml`
at line 9, The recipe's name field value "qwen3.5-1p1d-tp4-mtp-acc-prefixcache"
does not match the filename suffix; update the name field (the value assigned to
name) to include the "-retraction" suffix (e.g.
"qwen3.5-1p1d-tp4-mtp-acc-prefixcache-retraction") so logs and results match the
file naming convention.
recipes/qwen3.5/fp8/disagg/mooncake/mtp_radix_off/1p1d-mtp-acc.yaml (1)

114-115: Consider removing multimodal flag for text-only benchmark.

enable-multimodal: true is set but the benchmark is GPQA (text-only). This may add unnecessary overhead. Consider removing it unless there's a specific reason to keep it enabled.

Proposed fix
       watchdog-timeout: 1000000

-      enable-multimodal: true
       reasoning-parser: qwen3
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@recipes/qwen3.5/fp8/disagg/mooncake/mtp_radix_off/1p1d-mtp-acc.yaml` around
lines 114 - 115, The recipe currently sets enable-multimodal: true for a GPQA
text-only benchmark; remove that key or set enable-multimodal: false in this
YAML so the pipeline doesn't enable unnecessary multimodal code paths, keeping
reasoning-parser: qwen3 as-is; if there is a specific multimodal requirement,
add a brief comment next to enable-multimodal to justify keeping it.
recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep2-staging-gpqa.yaml (1)

10-10: Different container version used.

This recipe uses container: "dev-0318" while all other recipes in this PR use container: "dev". This is likely intentional to access staging buffer support in a newer build, but should be documented in the comments.

Proposed documentation addition
 # GPU staging buffer enabled (bulk RDMA, ~1000x fewer RDMA WRs)
 # GPQA accuracy benchmark
+# Note: Uses dev-0318 container for staging buffer support

 name: "qwen3.5-1p1d-dep2-staging-gpqa"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep2-staging-gpqa.yaml`
at line 10, The recipe sets container: "dev-0318" which differs from the other
recipes using container: "dev"; update the file to document this intentional
divergence by adding a brief comment next to the container key explaining why
"dev-0318" is required (e.g., to access staging buffer support or a newer build)
and the expected scope/duration of the deviation; ensure the comment references
the exact container string ("dev-0318") so reviewers understand it's intentional
and not a typo.
recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep4-acc.yaml (1)

67-68: Consider increasing mamba-track-interval for GSM8K max_tokens.

The benchmark specifies max_tokens: 16000, but mamba-track-interval: 2048 is set for both prefill and decode. Per the comments in other files (e.g., 1p1d-dep4dep4-8k1k.yaml), mamba-track-interval should be greater than isl+osl to avoid checkpointing. While GSM8K inputs are typically short, the combination with max_tokens: 16000 could potentially exceed 2048 tokens total in some cases.

💡 Consider adjusting mamba-track-interval

If the total sequence length (input + output) can exceed 2048 tokens, consider increasing the interval:

       mamba-scheduler-strategy: "no_buffer"
       disable-radix-cache: true
-      mamba-track-interval: 2048
+      mamba-track-interval: 18000  # > max expected isl+osl
       mamba-ssm-dtype: "bfloat16"

Apply to both prefill (Line 67) and decode (Line 97) sections.

Also applies to: 97-98

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep4-acc.yaml` around
lines 67 - 68, The mamba-track-interval (currently 2048) is too small given
max_tokens: 16000 and may trigger unwanted checkpointing; update the
mamba-track-interval entries for both the prefill and decode sections (the
mamba-track-interval keys shown) to a value greater than the expected total
sequence length (e.g., exceed isl+osl or set >16000) so it safely surpasses
input+output token maxima.
recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep2-gpqa.yaml (1)

67-70: Consider increasing mamba-track-interval for GPQA max_tokens.

The GPQA benchmark specifies max_tokens: 65536, but mamba-track-interval: 2048 is set. If total sequence lengths approach or exceed 2048 tokens, this could trigger unnecessary checkpointing overhead. Other long-context recipes in this PR use higher values (e.g., mamba-track-interval: 10000 for 8k1k workloads).

💡 Consider adjusting mamba-track-interval
       mamba-scheduler-strategy: "no_buffer"
       disable-radix-cache: true
-      mamba-track-interval: 2048
+      mamba-track-interval: 70000  # > max expected isl+osl for GPQA
       mamba-ssm-dtype: "bfloat16"

Apply to both prefill (Line 69) and decode (Line 98) sections if GPQA sequences can be long.

Also applies to: 96-99

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep2-gpqa.yaml`
around lines 67 - 70, The mamba-track-interval is set to 2048 but GPQA uses
max_tokens: 65536, which may cause excessive checkpointing; update the
mamba-track-interval entries (both the prefill block and the decode block where
"mamba-track-interval" appears) to a much larger value consistent with other
long-context recipes (e.g., 10000 or a value close to expected max sequence
lengths) to avoid unnecessary overhead while keeping other mamba settings
unchanged.
recipes/qwen3.5/fp8/agg/mtp_radix_off/tp4-mtp-acc.yaml (1)

4-4: Consider making the run name explicitly radix-off for symmetry.

This will make side-by-side result filtering (radix-on vs radix-off) cleaner in dashboards/artifacts.

Suggested rename
-name: "qwen3.5-agg-tp4-mtp-acc"
+name: "qwen3.5-agg-tp4-mtp-radix-off-acc"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@recipes/qwen3.5/fp8/agg/mtp_radix_off/tp4-mtp-acc.yaml` at line 4, The run
name currently set as the name field value "qwen3.5-agg-tp4-mtp-acc" should be
changed to include an explicit "radix-off" token for symmetry; update the YAML
name value (the name: key) to something like "qwen3.5-agg-tp4-mtp-acc-radix-off"
so dashboards and artifact filters can clearly distinguish radix-on vs radix-off
runs.
recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep2-longbenchv2.yaml (1)

71-71: Consider explicitly specifying disaggregation-transfer-backend for Mooncake.

Unlike the NIXL recipes that explicitly set disaggregation-transfer-backend: nixl, this Mooncake recipe does not specify the transfer backend in either the prefill or decode sglang_config. While it may default correctly based on the environment variables, explicitly specifying the backend would improve clarity and consistency with the NIXL recipes in this PR.

Suggested addition for prefill
       disaggregation-mode: "prefill"
+      disaggregation-transfer-backend: mooncake
Suggested addition for decode
       disaggregation-mode: "decode"
+      disaggregation-transfer-backend: mooncake

Also applies to: 100-100

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep2-longbenchv2.yaml`
at line 71, Add an explicit disaggregation-transfer-backend entry set to
"mooncake" in the sglang_config blocks for both the prefill and decode sections
so the Mooncake recipe matches the NIXL recipes' explicit backend setting;
specifically, add disaggregation-transfer-backend: "mooncake" alongside
disaggregation-mode under the prefill and decode sglang_config entries.
recipes/qwen3.5/nvfp4/agg/stp_prefix_off/tep4.yaml (1)

47-48: Consider adding mamba-scheduler-strategy and mamba-track-interval for consistency.

Other Qwen3.5 recipes in this PR include mamba-scheduler-strategy: "no_buffer" and mamba-track-interval: 2048 alongside mamba-ssm-dtype. If this was intentionally omitted for aggregated mode, consider adding a comment explaining why.

Suggested addition for consistency
+      mamba-scheduler-strategy: "no_buffer"
+      mamba-track-interval: 2048
       mamba-ssm-dtype: "bfloat16"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@recipes/qwen3.5/nvfp4/agg/stp_prefix_off/tep4.yaml` around lines 47 - 48, The
YAML currently sets mamba-ssm-dtype and moe-runner-backend but omits the
scheduler and tracking settings used in other Qwen3.5 recipes; add
mamba-scheduler-strategy: "no_buffer" and mamba-track-interval: 2048 alongside
mamba-ssm-dtype (or, if aggregated mode intentionally differs, add a brief
comment next to moe-runner-backend or at the top of the recipe explaining why
mamba-scheduler-strategy and mamba-track-interval are omitted for aggregated
mode) so the recipe matches the others or documents the deviation.
recipes/qwen3.5/nvfp4/agg/stp_prefix_off/dep4.yaml (1)

50-52: Set full mamba-* knobs explicitly for Qwen3.5 consistency.

Line 50 currently sets only mamba-ssm-dtype. Add mamba-scheduler-strategy and mamba-track-interval to avoid relying on implicit defaults.

Suggested refactor
+      mamba-scheduler-strategy: "no_buffer"
+      mamba-track-interval: 2048
       mamba-ssm-dtype: "bfloat16"
       moe-runner-backend: "flashinfer_trtllm"
Based on learnings: Qwen3.5 uses SGLang mamba-* scheduler infrastructure, and these settings are applicable/needed for this model family.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@recipes/qwen3.5/nvfp4/agg/stp_prefix_off/dep4.yaml` around lines 50 - 52, The
recipe currently sets only mamba-ssm-dtype ("bfloat16") which can leave Qwen3.5
runs relying on implicit defaults; update the YAML to explicitly add
mamba-scheduler-strategy and mamba-track-interval alongside mamba-ssm-dtype so
the mamba scheduler configuration is fully specified for Qwen3.5 (add suitable
values for mamba-scheduler-strategy and mamba-track-interval consistent with
SGLang mamba-* conventions used across this model family).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@recipes/qwen3.5/fp8/disagg/mooncake/mtp_radix_off/1p1d-tp4-mtp.yaml`:
- Around line 77-84: The decode stage is pointing at a different served model
than the prefill/serving block; update the decode block (the entries under
"decode" including served-model-name and model-path) to use the exact same
served-model-name and model-path used by the top-level model/prefill (so
sglang.deep_gemm_precompile and the serving run see identical model identity),
ensuring served-model-name and model-path strings match across prefill and
decode.

In `@recipes/qwen3.5/fp8/disagg/mooncake/profile/1p1d-dep4dep4-profile.yaml`:
- Line 12: The YAML includes an unsupported key model.version which breaks CI
schema validation; remove the model.version entry from the document (delete the
model.version line and any value or nested structure under it) so the file
conforms to the current schema and rerun validation (look for the top-level
model object and remove only its version field).

In `@recipes/qwen3.5/fp8/disagg/mooncake/profile/1p1d-tp4-mtp-profile.yaml`:
- Around line 77-79: The decode block's served-model-name is set to
nvidia/DeepSeek-R1-0528-NVFP4-v2 which mismatches the recipe's Qwen3.5 FP8
prefill stage; update decode.served-model-name to the same model identifier used
in the prefill configuration (i.e., the Qwen3.5 FP8 served-model-name) so
routing/attribution remain consistent between prefill and decode stages (locate
the decode block and the prefill served-model-name and make them identical).

In
`@recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-deepep-deepgemm.yaml`:
- Line 70: The mamba-track-interval value (mamba-track-interval) contradicts its
safety comment ("must be > isl+osl") because context-length is 2200; update both
mamba-track-interval occurrences to be strictly greater than isl+osl (e.g., set
mamba-track-interval to a safe value like 4096 or compute context-length+1) so
checkpointing is avoided; ensure you change every instance of
mamba-track-interval in this file so it consistently satisfies the isl+osl
constraint relative to context-length.

In
`@recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep2-gsm8k-bench.yaml`:
- Around line 110-116: The recipe's benchmark block is using an unsupported
field "use_chat_api" for benchmark type "gsm8k-bench", causing schema validation
to fail; remove the "use_chat_api: true" entry from the benchmark section (the
block that includes type: "gsm8k-bench", num_examples, num_shots, max_tokens,
num_threads) so the YAML only contains supported keys, or alternatively update
the gsm8k-bench schema to accept use_chat_api if that behavior is required.

In
`@recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep2-staging-gsm8k-bench.yaml`:
- Around line 114-120: The benchmark YAML includes an unsupported field
`use_chat_api` in the `benchmark` block which violates the `BenchmarkConfig`
schema; remove the `use_chat_api` entry from the `benchmark` mapping (the block
containing type: "gsm8k-bench", num_examples, num_shots, max_tokens,
num_threads) so the config conforms to `BenchmarkConfig` and the GSM8K benchmark
runner accepts it.

In
`@recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep4tp4-gsm8k-bench.yaml`:
- Line 118: Remove the unsupported benchmark.use_chat_api field from the YAML by
deleting the "use_chat_api: true" entry under the benchmark config (the field
shown as use_chat_api in the diff); also search for any other occurrences of
use_chat_api or benchmark.use_chat_api in this recipe and remove them or replace
them with a supported counterpart, then re-run schema validation to confirm the
pipeline is no longer blocked.

In
`@recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep4tp4-staging-gsm8k-bench.yaml`:
- Around line 116-122: The benchmark config for type "gsm8k-bench" includes an
unsupported field use_chat_api which fails validation; remove the use_chat_api:
true line from the benchmark block (the YAML stanza containing type:
"gsm8k-bench", num_examples, num_shots, max_tokens, num_threads) so the schema
validates, or alternatively update the benchmark schema to explicitly allow
use_chat_api if that behavior is required.

In
`@recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/2p1d-hetero-nostaging-longbenchv2.yaml`:
- Around line 56-108: The config uses a non‑existent key prefill_groups under
sglang_config; update the block to match SGLangServerConfig by replacing
prefill_groups with the singular fields (prefill / decode / aggregated) and move
the intended prefill settings into a single prefill dict (or consolidate into
prefill and use load balancing/worker specifics inside that dict) instead of an
array, ensuring keys like disaggregation-mode, tensor-parallel-size,
chunked-prefill-size, mem-fraction-static, etc. are placed under
sglang_config.prefill; do not add multiple entries unless the schema is extended
— simply convert the two array items into a single prefill configuration that
merges/chooses the correct values for the deployment.

In
`@recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/2p1d-hetero-staging-dep4-gpqa.yaml`:
- Around line 64-115: The YAML uses the unsupported key prefill_groups which
will fail SGLangServerConfig validation; replace the top-level prefill_groups
block with sglang_config and move each worker's configuration under
sglang_config.prefill (and add sglang_config.decode or sglang_config.aggregated
sections if required later). Concretely, remove the prefill_groups key and nest
each worker entry under sglang_config:prefill while preserving all per-worker
fields (e.g., served-model-name, model-path, attention-backend, kv-cache-dtype,
tensor-parallel-size, data-parallel-size, disaggregation-mode,
chunked-prefill-size, etc.), and ensure the final YAML keys match the supported
schema names sglang_config, prefill, decode, aggregated.

In
`@recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/2p1d-hetero-staging-dep4-longbenchv2.yaml`:
- Around line 63-64: The CI fails because the config contains an unknown key
sglang_config.prefill_groups; either remove/rename this key to match the
existing schema or add schema/validator support for the new contract. Locate the
config block named sglang_config (the recipe file referencing prefill_groups)
and either delete or replace prefill_groups with the accepted field name, or
update the config validator/schema (the schema definition that validates
sglang_config) to declare prefill_groups with its expected type/structure and
add a small unit/validation test to cover the new field.

In
`@recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/2p1d-hetero-staging-gpqa.yaml`:
- Around line 63-64: The recipe uses the unknown field prefill_groups under
backend.sglang_config which fails schema validation; update the schema validator
for backend.sglang_config to include a prefill_groups property (with the correct
type/shape expected by the recipe) so CI accepts this key, or alternatively
remove/rename prefill_groups in the YAML to match the existing schema; locate
references to backend.sglang_config in the validator and add the new property
definition consistent with other group/list fields.

In
`@recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/2p1d-hetero-staging-longbenchv2.yaml`:
- Around line 62-114: The YAML uses a non-existent sglang_config.prefill_groups
field causing CI failure; either add support for heterogeneous groups in the
SGLangServerConfig schema or convert these entries to the existing schema fields
(prefill/decode/aggregated). Fix option A: extend the SGLangServerConfig class
to include a prefill_groups array type (with the same per-group keys used here:
served-model-name, model-path, attention-backend, kv-cache-dtype,
tensor-parallel-size, data-parallel-size, expert-parallel-size,
disaggregation-mode, mem-fraction-static, chunked-prefill-size,
load-balance-method, watchdog-timeout, etc.), update its validation/parsing
logic to handle heterogeneous group entries, and update any code that reads
sglang_config (search for SGLangServerConfig) to accept and iterate
prefill_groups. Fix option B: restructure this YAML to use the existing
sglang_config.prefill entries by splitting each group into separate prefill
blocks (or into decode/aggregated as appropriate) so only supported fields
(prefill/decode/aggregated) are present and remove prefill_groups.

In `@recipes/qwen3.5/nvfp4/agg/mtp_radix_on/validate-nvfp4-mtp-ladfi-c128.yaml`:
- Around line 13-16: The CI is failing because the YAML includes an invalid
field under frontend; remove the invalid install field so the frontend block
only contains valid keys (e.g., keep frontend.type: "sglang" and delete
frontend.install). Locate the frontend block (the keys 'frontend', 'type:
"sglang"', and 'install') and remove the install entry entirely so the file
matches the valid schema used by the c512 variant.

In `@recipes/qwen3.5/nvfp4/agg/mtp_radix_on/validate-nvfp4-mtp-ladfi-c512.yaml`:
- Around line 13-16: The frontend block contains an invalid key "install" that
causes CI validation to fail; remove the install: false line from the frontend
section (the block with type: "sglang") so the frontend only declares supported
fields such as type, ensuring the YAML validates correctly.

In `@recipes/qwen3.5/nvfp4/agg/profile/profile-torch-conc128.yaml`:
- Around line 1-5: The profile name declares conc128 but the config actually
uses conc64; either rename the "name" field value
("coreai_comparch_trtllm-yangminl.nvfp4-torch-profile-nobuf-conc64") to reflect
conc64 or change all concurrency settings currently set to 64 to 128 so they
match the intended conc128 profile; update the same/congruent "conc" entries
found in the body (the other occurrences referenced in the diff) and any related
fields so the recipe name and concurrency configuration are consistent.
- Line 17: Remove the unsupported frontend.install field (the line with
"install: false") from the profile configuration so CI validation no longer
fails; locate the frontend block referencing frontend.install in
profile-torch-conc128.yaml and delete that key (or, if the install flag is
required for runtime logic, move it to a supported location or documented field
name instead of frontend.install).

In
`@recipes/qwen3.5/nvfp4/agg/stp_radix_on/validate-nvfp4-extrabuf-ladfi-c128.yaml`:
- Around line 13-16: Remove the invalid "install" field under the frontend block
in the YAML (the frontend: type: "sglang" section) because the config schema
does not recognize it; locate the frontend mapping in
validate-nvfp4-extrabuf-ladfi-c128.yaml and delete the line "install: false" so
the frontend section only contains valid keys (e.g., type).

In
`@recipes/qwen3.5/nvfp4/agg/stp_radix_on/validate-nvfp4-extrabuf-ladfi-c512.yaml`:
- Around line 13-16: Remove the invalid install field from the frontend block in
the YAML: locate the frontend section (frontend: with type: "sglang") and delete
the install: false entry so the file conforms to the schema and CI validation
passes.

---

Minor comments:
In
`@recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep4-staging-gsm8k.yaml`:
- Around line 1-6: The file header comment incorrectly states "with GPQA
accuracy benchmark" while this recipe uses the GSM8K benchmark; update the top
comment to accurately say GSM8K (or remove the GPQA mention) so it matches the
actual benchmark used, and verify the name field
("qwen3.5-1p1d-dep4-staging-acc") and any other descriptive lines reflect GSM8K
consistently.

In
`@recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep4tp4-staging-longbenchv2.yaml`:
- Around line 58-59: The decode environment is missing the
SGLANG_DISAGG_STAGING_BUFFER_SIZE_MB variable while the prefill env sets it
(e.g., "256"); update the decode env to include
SGLANG_DISAGG_STAGING_BUFFER_SIZE_MB with the same or appropriate value to match
SGLANG_DISAGG_STAGING_POOL_SIZE_MB ("4096") for consistency, or add a brief
comment documenting why the buffer size is intentionally omitted; look for the
decode env block containing SGLANG_DISAGG_STAGING_POOL_SIZE_MB and add
SGLANG_DISAGG_STAGING_BUFFER_SIZE_MB alongside it (or a comment explaining
omission).

In `@recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-tp4-baseline.yaml`:
- Line 3: Fix the typo in the top-line comment of the YAML: change
"recipies/qwen3.5/1p1d.yaml" to "recipes/qwen3.5/1p1d.yaml" so the comment reads
"Server config is EXACTLY the same as recipes/qwen3.5/1p1d.yaml (949062)";
update the string in the file comment accordingly.

In `@recipes/qwen3.5/fp8/disagg/nixl/stp_prefix_off/1p1d-dep4-nixl-acc.yaml`:
- Around line 26-27: This recipe includes Mooncake-specific environment
variables that are unnecessary for the NIXL transfer backend; remove the
MC_FORCE_MNNVL and SGLANG_MOONCAKE_CUSTOM_MEM_POOL entries from both
prefill_environment and decode_environment blocks in the 1p1d-dep4-nixl-acc.yaml
recipe (leave other vars like SGLANG_DG_CACHE_DIR intact) so the NIXL recipe
matches the sibling 1p1d-dep4-nixl-gpqa.yaml; search for the exact keys
"MC_FORCE_MNNVL" and "SGLANG_MOONCAKE_CUSTOM_MEM_POOL" and delete those lines
wherever present in this file.

In
`@recipes/qwen3.5/fp8/disagg/nixl/stp_prefix_off/1p1d-dep4-staging-nixl-gpqa-20req.yaml`:
- Around line 54-55: The decode environment is missing the
SGLANG_DISAGG_STAGING_BUFFER_SIZE_MB variable; add
SGLANG_DISAGG_STAGING_BUFFER_SIZE_MB with an appropriate value (e.g., "4096" to
match SGLANG_DISAGG_STAGING_POOL_SIZE_MB) alongside
SGLANG_DISAGG_STAGING_POOL_SIZE_MB in the same env block so the staging recipe
matches the Mooncake/1p1d-dep4-staging-nixl-gpqa reference implementation.

In
`@recipes/qwen3.5/fp8/disagg/nixl/stp_prefix_off/1p1d-dep4-staging-nixl-gpqa.yaml`:
- Around line 54-55: The decode environment is missing the explicit
SGLANG_DISAGG_STAGING_BUFFER_SIZE_MB entry; update the decode env block to add
SGLANG_DISAGG_STAGING_BUFFER_SIZE_MB with the intended value (e.g., match the
prefill value "256" or the Mooncake variant "64") alongside
SGLANG_DISAGG_STAGING_BUFFER and SGLANG_DISAGG_STAGING_POOL_SIZE_MB, or if
omission is intentional, add an explanatory comment next to
SGLANG_DISAGG_STAGING_POOL_SIZE_MB to document reliance on a default.

---

Nitpick comments:
In `@recipes/qwen3.5/fp8/agg/mtp_radix_off/tp4-mtp-acc.yaml`:
- Line 4: The run name currently set as the name field value
"qwen3.5-agg-tp4-mtp-acc" should be changed to include an explicit "radix-off"
token for symmetry; update the YAML name value (the name: key) to something like
"qwen3.5-agg-tp4-mtp-acc-radix-off" so dashboards and artifact filters can
clearly distinguish radix-on vs radix-off runs.

In `@recipes/qwen3.5/fp8/disagg/mooncake/mtp_radix_off/1p1d-mtp-acc.yaml`:
- Around line 114-115: The recipe currently sets enable-multimodal: true for a
GPQA text-only benchmark; remove that key or set enable-multimodal: false in
this YAML so the pipeline doesn't enable unnecessary multimodal code paths,
keeping reasoning-parser: qwen3 as-is; if there is a specific multimodal
requirement, add a brief comment next to enable-multimodal to justify keeping
it.

In
`@recipes/qwen3.5/fp8/disagg/mooncake/mtp_radix_on/1p1d-mtp-acc-prefixcache-retraction.yaml`:
- Line 9: The recipe's name field value "qwen3.5-1p1d-tp4-mtp-acc-prefixcache"
does not match the filename suffix; update the name field (the value assigned to
name) to include the "-retraction" suffix (e.g.
"qwen3.5-1p1d-tp4-mtp-acc-prefixcache-retraction") so logs and results match the
file naming convention.

In
`@recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-deepep-deepgemm.yaml`:
- Line 8: The recipe currently uses a mutable container reference container:
"dev"; change this to a pinned image tag or digest (e.g., container:
"your-image:1.2.3" or container: "your-image@sha256:...") to ensure reproducible
benchmarks—update the container field in the YAML (look for the container
key/value "dev") to a specific versioned tag or digest.
- Around line 111-116: The comment block about decode tuning is stale because it
references the inactive setting max_mamba_cache_size (shown as "3200/8"); update
the comment to remove or rephrase that specific line so it only documents
currently used knobs (e.g., max_running_requests, extra_buffer, dp_size,
enable_dp_attention and pool), or replace the max_mamba_cache_size calculation
with the actual active parameter if intended.

In `@recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep2-gpqa.yaml`:
- Around line 67-70: The mamba-track-interval is set to 2048 but GPQA uses
max_tokens: 65536, which may cause excessive checkpointing; update the
mamba-track-interval entries (both the prefill block and the decode block where
"mamba-track-interval" appears) to a much larger value consistent with other
long-context recipes (e.g., 10000 or a value close to expected max sequence
lengths) to avoid unnecessary overhead while keeping other mamba settings
unchanged.

In
`@recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep2-longbenchv2.yaml`:
- Line 71: Add an explicit disaggregation-transfer-backend entry set to
"mooncake" in the sglang_config blocks for both the prefill and decode sections
so the Mooncake recipe matches the NIXL recipes' explicit backend setting;
specifically, add disaggregation-transfer-backend: "mooncake" alongside
disaggregation-mode under the prefill and decode sglang_config entries.

In
`@recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep2-staging-gpqa.yaml`:
- Line 10: The recipe sets container: "dev-0318" which differs from the other
recipes using container: "dev"; update the file to document this intentional
divergence by adding a brief comment next to the container key explaining why
"dev-0318" is required (e.g., to access staging buffer support or a newer build)
and the expected scope/duration of the deviation; ensure the comment references
the exact container string ("dev-0318") so reviewers understand it's intentional
and not a typo.

In `@recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep4-acc.yaml`:
- Around line 67-68: The mamba-track-interval (currently 2048) is too small
given max_tokens: 16000 and may trigger unwanted checkpointing; update the
mamba-track-interval entries for both the prefill and decode sections (the
mamba-track-interval keys shown) to a value greater than the expected total
sequence length (e.g., exceed isl+osl or set >16000) so it safely surpasses
input+output token maxima.

In `@recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-tp4-acc.yaml`:
- Around line 85-86: In the decode configuration where only tensor-parallel-size
is declared, explicitly add data-parallel-size: 1 and expert-parallel-size: 1 to
the same block to avoid relying on defaults; update the decode section (the
entry containing tensor-parallel-size) to include the two new fields so the
decode parallelism matches prefill and is future-proof against default changes.
- Around line 21-50: prefill_environment and decode_environment share nearly
identical keys; factor the common environment variables into a single YAML
anchor (e.g., &common_env) and then merge it into each section using the YAML
merge key (<<: *common_env), leaving only stage-specific overrides (e.g.,
SGLANG_DECODE_BOOTSTRAP_TIMEOUT, SGLANG_HACK_SEQ_BOOTSTRAP_ROOM) in
decode_environment and no-ops in prefill_environment; ensure all values remain
quoted strings and that unique keys like SGLANG_MOONCAKE_CUSTOM_MEM_POOL and
SGLANG_USE_MESSAGE_QUEUE_BROADCASTER are preserved after the merge.

In `@recipes/qwen3.5/nvfp4/agg/stp_prefix_off/dep4.yaml`:
- Around line 50-52: The recipe currently sets only mamba-ssm-dtype ("bfloat16")
which can leave Qwen3.5 runs relying on implicit defaults; update the YAML to
explicitly add mamba-scheduler-strategy and mamba-track-interval alongside
mamba-ssm-dtype so the mamba scheduler configuration is fully specified for
Qwen3.5 (add suitable values for mamba-scheduler-strategy and
mamba-track-interval consistent with SGLang mamba-* conventions used across this
model family).

In `@recipes/qwen3.5/nvfp4/agg/stp_prefix_off/tep4.yaml`:
- Around line 47-48: The YAML currently sets mamba-ssm-dtype and
moe-runner-backend but omits the scheduler and tracking settings used in other
Qwen3.5 recipes; add mamba-scheduler-strategy: "no_buffer" and
mamba-track-interval: 2048 alongside mamba-ssm-dtype (or, if aggregated mode
intentionally differs, add a brief comment next to moe-runner-backend or at the
top of the recipe explaining why mamba-scheduler-strategy and
mamba-track-interval are omitted for aggregated mode) so the recipe matches the
others or documents the deviation.

In `@recipes/qwen3.5/nvfp4/agg/stp_prefix_off/tp4.yaml`:
- Around line 1-5: The header comment claiming "symmetric memory enabled" is
inconsistent with the actual config where the enable-symm-mem flag remains
commented out; either uncomment and set enable-symm-mem to true in the TP4
config or change the header text to state symmetric memory is disabled. Edit the
top comment line that currently says "Pure tensor parallel, no expert parallel,
symmetric memory enabled" and/or uncomment and set the enable-symm-mem setting
so the header and the enable-symm-mem flag match.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6f5d5264-f7f5-4c52-a5c7-c492fb77fbec

📥 Commits

Reviewing files that changed from the base of the PR and between 81d9757 and 95c026e.

📒 Files selected for processing (85)
  • recipes/qwen3.5/fp8/agg/mtp_radix_off/tp4-mtp-acc.yaml
  • recipes/qwen3.5/fp8/agg/mtp_radix_on/tp4-mtp-acc.yaml
  • recipes/qwen3.5/fp8/agg/profile/agg-dep4-trtllm-symmem-profile.yaml
  • recipes/qwen3.5/fp8/agg/profile/agg-tep4-trtllm-symmem-profile.yaml
  • recipes/qwen3.5/fp8/agg/profile/agg-tp4-trtllm-symmem-profile.yaml
  • recipes/qwen3.5/fp8/agg/stp_prefix_off/dep4-acc.yaml
  • recipes/qwen3.5/fp8/agg/stp_prefix_off/dep4.yaml
  • recipes/qwen3.5/fp8/agg/stp_prefix_off/tep4.yaml
  • recipes/qwen3.5/fp8/agg/stp_prefix_off/tp4-acc.yaml
  • recipes/qwen3.5/fp8/agg/stp_prefix_off/tp4.yaml
  • recipes/qwen3.5/fp8/agg/stp_radix_on/dep4-acc.yaml
  • recipes/qwen3.5/fp8/agg/stp_radix_on/tp4-acc.yaml
  • recipes/qwen3.5/fp8/disagg/mooncake/mtp_radix_off/1p1d-mtp-acc.yaml
  • recipes/qwen3.5/fp8/disagg/mooncake/mtp_radix_off/1p1d-tp4-mtp.yaml
  • recipes/qwen3.5/fp8/disagg/mooncake/mtp_radix_on/1p1d-mtp-acc-prefixcache-retraction.yaml
  • recipes/qwen3.5/fp8/disagg/mooncake/mtp_radix_on/1p1d-mtp-acc-prefixcache.yaml
  • recipes/qwen3.5/fp8/disagg/mooncake/profile/1p1d-dep4-profile.yaml
  • recipes/qwen3.5/fp8/disagg/mooncake/profile/1p1d-dep4-staging-nsys-1k1k-v3.yaml
  • recipes/qwen3.5/fp8/disagg/mooncake/profile/1p1d-dep4-staging-torch-1k1k.yaml
  • recipes/qwen3.5/fp8/disagg/mooncake/profile/1p1d-dep4dep4-profile.yaml
  • recipes/qwen3.5/fp8/disagg/mooncake/profile/1p1d-tep4-profile.yaml
  • recipes/qwen3.5/fp8/disagg/mooncake/profile/1p1d-tp4-mtp-profile.yaml
  • recipes/qwen3.5/fp8/disagg/mooncake/profile/1p1d-tp4-profile.yaml
  • recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-deepep-deepgemm.yaml
  • recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep2-gpqa.yaml
  • recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep2-gsm8k-bench.yaml
  • recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep2-longbenchv2.yaml
  • recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep2-staging-gpqa.yaml
  • recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep2-staging-gsm8k-bench.yaml
  • recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep2-staging-longbenchv2.yaml
  • recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep4-acc.yaml
  • recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep4-longbenchv2.yaml
  • recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep4-mooncake-gpqa.yaml
  • recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep4-staging-acc.yaml
  • recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep4-staging-gsm8k-bench.yaml
  • recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep4-staging-gsm8k.yaml
  • recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep4-staging-longbenchv2.yaml
  • recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep4-staging.yaml
  • recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep4.yaml
  • recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep4dep4-8k1k.yaml
  • recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep4dep4-staging-8k1k.yaml
  • recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep4dep4-staging.yaml
  • recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep4dep4.yaml
  • recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep4tp4-gpqa.yaml
  • recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep4tp4-gsm8k-bench.yaml
  • recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep4tp4-longbenchv2.yaml
  • recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep4tp4-staging-gpqa.yaml
  • recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep4tp4-staging-gsm8k-bench.yaml
  • recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep4tp4-staging-longbenchv2.yaml
  • recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-tep4.yaml
  • recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-tp4-acc.yaml
  • recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-tp4-baseline.yaml
  • recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-tp4.yaml
  • recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/2p1d-hetero-nostaging-longbenchv2.yaml
  • recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/2p1d-hetero-staging-dep4-gpqa.yaml
  • recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/2p1d-hetero-staging-dep4-longbenchv2.yaml
  • recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/2p1d-hetero-staging-gpqa.yaml
  • recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/2p1d-hetero-staging-longbenchv2.yaml
  • recipes/qwen3.5/fp8/disagg/nixl/stp_prefix_off/1p1d-dep4-nixl-acc.yaml
  • recipes/qwen3.5/fp8/disagg/nixl/stp_prefix_off/1p1d-dep4-nixl-gpqa-20req.yaml
  • recipes/qwen3.5/fp8/disagg/nixl/stp_prefix_off/1p1d-dep4-nixl-gpqa.yaml
  • recipes/qwen3.5/fp8/disagg/nixl/stp_prefix_off/1p1d-dep4-nixl-gsm8k.yaml
  • recipes/qwen3.5/fp8/disagg/nixl/stp_prefix_off/1p1d-dep4-staging-nixl-gpqa-20req.yaml
  • recipes/qwen3.5/fp8/disagg/nixl/stp_prefix_off/1p1d-dep4-staging-nixl-gpqa.yaml
  • recipes/qwen3.5/fp8/disagg/nixl/stp_prefix_off/1p1d-dep4-staging-nixl-longbenchv2.yaml
  • recipes/qwen3.5/fp8/disagg/nixl/stp_prefix_off/1p1d-dep4dep4-nixl.yaml
  • recipes/qwen3.5/fp8/disagg/nixl/stp_prefix_off/1p1d-tep4-nixl.yaml
  • recipes/qwen3.5/fp8/disagg/nixl/stp_prefix_off/1p1d-tp4-nixl-acc.yaml
  • recipes/qwen3.5/fp8/disagg/nixl/stp_prefix_off/1p1d-tp4-nixl.yaml
  • recipes/qwen3.5/nvfp4/agg/mtp_radix_off/.gitkeep
  • recipes/qwen3.5/nvfp4/agg/mtp_radix_on/validate-nvfp4-mtp-ladfi-c128.yaml
  • recipes/qwen3.5/nvfp4/agg/mtp_radix_on/validate-nvfp4-mtp-ladfi-c512.yaml
  • recipes/qwen3.5/nvfp4/agg/profile/profile-torch-conc128.yaml
  • recipes/qwen3.5/nvfp4/agg/stp_prefix_off/dep4-acc.yaml
  • recipes/qwen3.5/nvfp4/agg/stp_prefix_off/dep4.yaml
  • recipes/qwen3.5/nvfp4/agg/stp_prefix_off/tep4.yaml
  • recipes/qwen3.5/nvfp4/agg/stp_prefix_off/tp4-acc.yaml
  • recipes/qwen3.5/nvfp4/agg/stp_prefix_off/tp4.yaml
  • recipes/qwen3.5/nvfp4/agg/stp_radix_on/validate-nvfp4-extrabuf-ladfi-c128.yaml
  • recipes/qwen3.5/nvfp4/agg/stp_radix_on/validate-nvfp4-extrabuf-ladfi-c512.yaml
  • recipes/qwen3.5/nvfp4/disagg/mtp_radix_off/.gitkeep
  • recipes/qwen3.5/nvfp4/disagg/mtp_radix_on/.gitkeep
  • recipes/qwen3.5/nvfp4/disagg/profile/.gitkeep
  • recipes/qwen3.5/nvfp4/disagg/stp_prefix_off/.gitkeep
  • recipes/qwen3.5/nvfp4/disagg/stp_radix_on/.gitkeep

Comment thread recipes/qwen3.5/fp8/disagg/mooncake/mtp_radix_off/1p1d-tp4-mtp.yaml
Comment thread recipes/qwen3.5/fp8/disagg/mooncake/profile/1p1d-dep4dep4-profile.yaml Outdated
Comment thread recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-deepep-deepgemm.yaml Outdated
Comment thread recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep2-gsm8k-bench.yaml Outdated
Comment thread recipes/qwen3.5/nvfp4/agg/mtp_radix_on/tp4-mtp-acc.yaml Outdated
Comment thread recipes/qwen3.5/nvfp4/agg/profile/profile-torch-conc128.yaml Outdated
Comment thread recipes/qwen3.5/nvfp4/agg/profile/tp4-profile.yaml Outdated
Comment thread recipes/qwen3.5/nvfp4/agg/stp_radix_on/validate-nvfp4-extrabuf-ladfi-c128.yaml Outdated
Comment thread recipes/qwen3.5/nvfp4/agg/stp_radix_on/tp4-acc.yaml Outdated
YAMY1234 added 3 commits April 3, 2026 13:43
Per SGLang official docs and test code (PR #19391), NEXTN speculative
decoding requires speculative-eagle-topk: 1 as NEXTN is internally
converted to EAGLE. Without it, server_args crashes with TypeError
when trtllm_mha backend is used.
…refill-size

- Add explicit quantization: fp8 to 3 disagg MTP acc files that were missing it
- Add moe-runner-backend: flashinfer_trtllm to same 3 files
- Change chunked-prefill-size from -1 to 16384 in 1p1d-tp4-profile and 1p1d-tp4
- Remove unsupported `frontend.install` field from NVFP4 recipes
- Remove unsupported `benchmark.use_chat_api` field from gsm8k-bench recipes
- Fix decode served-model-name mismatch (DeepSeek → Qwen3.5) in MTP disagg recipes
- Align profile comment with actual concurrency config (conc128 → conc64)
@YAMY1234 YAMY1234 marked this pull request as ready for review April 6, 2026 22:51
@YAMY1234 YAMY1234 merged commit 6b2c3ef into ishandhanani:main Apr 7, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant