Conversation
…layout
- Reorganize flat qwen3.5/ directory into structured hierarchy:
qwen3.5/{fp8,nvfp4}/{agg,disagg}/
- Split disagg recipes by transfer backend: mooncake/ and nixl/
- Add ACC (accuracy) recipes for gsm8k validation
- Add profile recipes for performance analysis
- Remove duplicate/debug files from old layout
- Remove extra_mount (local sglang source mount) - Remove dynamo hash configs - Remove trust-remote-code (not needed) - Remove random_range_ratio from benchmarks - Remove commented-out options (# mamba-scheduler-strategy: "extra_buffer") - Remove SGLANG_DEBUG_MEMORY_POOL env vars - Keep # enable-symm-mem: true comments
Keep enable-symm-mem commented out; add note that it may improve performance in some scenarios and should be benchmarked.
- Add kv-cache-dtype: fp8_e4m3 to all files that were missing it - Rename tp-size/dp-size/ep-size to full names (tensor-parallel-size etc) - Remove max-mamba-cache-size and related verbose comments - Remove commented-out DeepEP mounts and debug env vars - Fix chunked-prefill-size: 2048 -> 16384 - Fix mamba-track-interval: 128 -> 2048 (or 10000 for 8k1k recipes) with annotation to keep it > isl+osl to avoid checkpointing - Fix context-length: 262144 -> 2200 to match workload - Remove misc commented-out options (trace envs, benchmark params)
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughIntroduces 100+ SGLang benchmark recipe YAML files for Qwen3.5 models across FP8 and NVFP4 precision modes, spanning aggregated, disaggregated (Mooncake/NIXL), and profiling configurations. Changes include new MTP speculative decoding (EAGLE), staging buffers, heterogeneous parallel topologies (TP/DP/EP variants), and runtime tuning parameters. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Possibly related PRs
Suggested reviewers
✨ Finishing Touches🧪 Generate unit tests (beta)
|
profile/: keep tp4, tp4-mtp, dep4-nsys; remove dep4, dep4dep4, tep4, torch stp_prefix_off/: keep tp4, dep4dep4, dep4tp4-staging (+gpqa/gsm8k); remove dep2/dep4 variants, longbenchv2, hetero, baseline, deepep-deepgemm, 8k1k nixl/: keep tp4, dep4 (+acc, gpqa, gsm8k); remove 20req, tep4, dep4dep4, staging nvfp4/: rename validate-* to tp4-acc/tp4-mtp-acc style, keep c512, drop c128
- speculative-algorithm: EAGLE -> NEXTN (native MTP head) - speculative-num-steps: 2 -> 3 - speculative-num-draft-tokens: 3 -> 4 - Remove speculative-eagle-topk (not needed for NEXTN) - Add SGLANG_ENABLE_SPEC_V2=1 env var to all MTP recipes
There was a problem hiding this comment.
Actionable comments posted: 19
Note
Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.
🟡 Minor comments (6)
recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-tp4-baseline.yaml-3-3 (1)
3-3:⚠️ Potential issue | 🟡 MinorTypo in comment: "recipies" should be "recipes".
Proposed fix
-# Server config is EXACTLY the same as recipies/qwen3.5/1p1d.yaml (949062) +# Server config is EXACTLY the same as recipes/qwen3.5/1p1d.yaml (949062)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-tp4-baseline.yaml` at line 3, Fix the typo in the top-line comment of the YAML: change "recipies/qwen3.5/1p1d.yaml" to "recipes/qwen3.5/1p1d.yaml" so the comment reads "Server config is EXACTLY the same as recipes/qwen3.5/1p1d.yaml (949062)"; update the string in the file comment accordingly.recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep4-staging-gsm8k.yaml-1-6 (1)
1-6:⚠️ Potential issue | 🟡 MinorInconsistent header comment: mentions GPQA but benchmark is GSM8K.
The header comment on line 2 says "with GPQA accuracy benchmark" but the actual benchmark type on line 115 is
gsm8k. Update the comment to accurately reflect the benchmark.Proposed fix
# Qwen3.5-397B-A17B-FP8 Disaggregated 1P1D: TP4 Prefill + DEP4 Decode -# Same as 1p1d-dep4-staging.yaml but with GPQA accuracy benchmark +# Same as 1p1d-dep4-staging.yaml but with GSM8K accuracy benchmark # GPU staging buffer enabled (bulk RDMA, ~1000x fewer RDMA WRs) -# Purpose: verify staging buffer correctness via GPQA accuracy +# Purpose: verify staging buffer correctness via GSM8K accuracy🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep4-staging-gsm8k.yaml` around lines 1 - 6, The file header comment incorrectly states "with GPQA accuracy benchmark" while this recipe uses the GSM8K benchmark; update the top comment to accurately say GSM8K (or remove the GPQA mention) so it matches the actual benchmark used, and verify the name field ("qwen3.5-1p1d-dep4-staging-acc") and any other descriptive lines reflect GSM8K consistently.recipes/qwen3.5/fp8/disagg/nixl/stp_prefix_off/1p1d-dep4-staging-nixl-gpqa-20req.yaml-54-55 (1)
54-55:⚠️ Potential issue | 🟡 MinorDecode environment may be missing
SGLANG_DISAGG_STAGING_BUFFER_SIZE_MB.Consistent with the other NIXL staging recipe (
1p1d-dep4-staging-nixl-gpqa.yaml), the decode environment hasSGLANG_DISAGG_STAGING_POOL_SIZE_MBbut lacksSGLANG_DISAGG_STAGING_BUFFER_SIZE_MB. Consider adding for consistency with the Mooncake staging reference implementation.Suggested fix
SGLANG_DISAGG_STAGING_BUFFER: "1" + SGLANG_DISAGG_STAGING_BUFFER_SIZE_MB: "64" SGLANG_DISAGG_STAGING_POOL_SIZE_MB: "4096"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@recipes/qwen3.5/fp8/disagg/nixl/stp_prefix_off/1p1d-dep4-staging-nixl-gpqa-20req.yaml` around lines 54 - 55, The decode environment is missing the SGLANG_DISAGG_STAGING_BUFFER_SIZE_MB variable; add SGLANG_DISAGG_STAGING_BUFFER_SIZE_MB with an appropriate value (e.g., "4096" to match SGLANG_DISAGG_STAGING_POOL_SIZE_MB) alongside SGLANG_DISAGG_STAGING_POOL_SIZE_MB in the same env block so the staging recipe matches the Mooncake/1p1d-dep4-staging-nixl-gpqa reference implementation.recipes/qwen3.5/fp8/disagg/nixl/stp_prefix_off/1p1d-dep4-staging-nixl-gpqa.yaml-54-55 (1)
54-55:⚠️ Potential issue | 🟡 MinorDecode environment may be missing
SGLANG_DISAGG_STAGING_BUFFER_SIZE_MB.The prefill environment sets
SGLANG_DISAGG_STAGING_BUFFER_SIZE_MB: "256", but the decode environment only hasSGLANG_DISAGG_STAGING_POOL_SIZE_MB: "4096"without an explicit buffer size. Comparing with1p1d-dep4-staging.yaml(Mooncake variant), the decode environment includes bothSGLANG_DISAGG_STAGING_BUFFER_SIZE_MB: "64"andSGLANG_DISAGG_STAGING_POOL_SIZE_MB: "4096". If this omission is intentional (relying on defaults), consider adding a comment; otherwise, add the missing parameter.Suggested fix
SGLANG_DISAGG_STAGING_BUFFER: "1" + SGLANG_DISAGG_STAGING_BUFFER_SIZE_MB: "64" SGLANG_DISAGG_STAGING_POOL_SIZE_MB: "4096"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@recipes/qwen3.5/fp8/disagg/nixl/stp_prefix_off/1p1d-dep4-staging-nixl-gpqa.yaml` around lines 54 - 55, The decode environment is missing the explicit SGLANG_DISAGG_STAGING_BUFFER_SIZE_MB entry; update the decode env block to add SGLANG_DISAGG_STAGING_BUFFER_SIZE_MB with the intended value (e.g., match the prefill value "256" or the Mooncake variant "64") alongside SGLANG_DISAGG_STAGING_BUFFER and SGLANG_DISAGG_STAGING_POOL_SIZE_MB, or if omission is intentional, add an explanatory comment next to SGLANG_DISAGG_STAGING_POOL_SIZE_MB to document reliance on a default.recipes/qwen3.5/fp8/disagg/nixl/stp_prefix_off/1p1d-dep4-nixl-acc.yaml-26-27 (1)
26-27:⚠️ Potential issue | 🟡 MinorMooncake environment variables in NIXL recipe may be unnecessary.
This recipe uses
disaggregation-transfer-backend: nixlbut includes Mooncake-specific environment variables (MC_FORCE_MNNVL: "1"andSGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True") in both prefill and decode environments. Comparing with the sibling recipe1p1d-dep4-nixl-gpqa.yaml, those Mooncake variables are absent. These may be remnants from copying a Mooncake template and could be safely removed for clarity.Suggested cleanup in prefill_environment
NCCL_MNNVL_ENABLE: "1" NCCL_CUMEM_ENABLE: "1" - MC_FORCE_MNNVL: "1" SGLANG_DG_CACHE_DIR: "/configs/deepgemm-cache" ... - SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"Suggested cleanup in decode_environment
NCCL_MNNVL_ENABLE: "1" NCCL_CUMEM_ENABLE: "1" - MC_FORCE_MNNVL: "1" SGLANG_DG_CACHE_DIR: "/configs/deepgemm-cache" ... - SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"Also applies to: 32-33, 41-42, 49-50
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@recipes/qwen3.5/fp8/disagg/nixl/stp_prefix_off/1p1d-dep4-nixl-acc.yaml` around lines 26 - 27, This recipe includes Mooncake-specific environment variables that are unnecessary for the NIXL transfer backend; remove the MC_FORCE_MNNVL and SGLANG_MOONCAKE_CUSTOM_MEM_POOL entries from both prefill_environment and decode_environment blocks in the 1p1d-dep4-nixl-acc.yaml recipe (leave other vars like SGLANG_DG_CACHE_DIR intact) so the NIXL recipe matches the sibling 1p1d-dep4-nixl-gpqa.yaml; search for the exact keys "MC_FORCE_MNNVL" and "SGLANG_MOONCAKE_CUSTOM_MEM_POOL" and delete those lines wherever present in this file.recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep4tp4-staging-longbenchv2.yaml-58-59 (1)
58-59:⚠️ Potential issue | 🟡 MinorDecode environment may be missing
SGLANG_DISAGG_STAGING_BUFFER_SIZE_MB.Similar to the NIXL staging GPQA recipe, the prefill environment sets
SGLANG_DISAGG_STAGING_BUFFER_SIZE_MB: "256", but the decode environment only hasSGLANG_DISAGG_STAGING_POOL_SIZE_MB: "4096"without an explicit buffer size. The staging recipes in the Mooncake1p1d-dep4-staging.yamlinclude both parameters for decode. Consider adding the buffer size for consistency, or document why it's intentionally omitted.Suggested fix
SGLANG_DISAGG_STAGING_BUFFER: "1" + SGLANG_DISAGG_STAGING_BUFFER_SIZE_MB: "256" SGLANG_DISAGG_STAGING_POOL_SIZE_MB: "4096"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep4tp4-staging-longbenchv2.yaml` around lines 58 - 59, The decode environment is missing the SGLANG_DISAGG_STAGING_BUFFER_SIZE_MB variable while the prefill env sets it (e.g., "256"); update the decode env to include SGLANG_DISAGG_STAGING_BUFFER_SIZE_MB with the same or appropriate value to match SGLANG_DISAGG_STAGING_POOL_SIZE_MB ("4096") for consistency, or add a brief comment documenting why the buffer size is intentionally omitted; look for the decode env block containing SGLANG_DISAGG_STAGING_POOL_SIZE_MB and add SGLANG_DISAGG_STAGING_BUFFER_SIZE_MB alongside it (or a comment explaining omission).
🧹 Nitpick comments (14)
recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-tp4-acc.yaml (2)
85-86: Make decode parallelism explicit to avoid default-dependent behavior.Decode currently declares only
tensor-parallel-size. Add explicitdata-parallel-size: 1andexpert-parallel-size: 1for parity with prefill and future-proofing against default changes.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-tp4-acc.yaml` around lines 85 - 86, In the decode configuration where only tensor-parallel-size is declared, explicitly add data-parallel-size: 1 and expert-parallel-size: 1 to the same block to avoid relying on defaults; update the decode section (the entry containing tensor-parallel-size) to include the two new fields so the decode parallelism matches prefill and is future-proof against default changes.
21-50: Reduce config drift by factoring shared env vars.
prefill_environmentanddecode_environmentare mostly duplicated. Consider YAML anchors/merge keys so common keys are defined once and stage-specific overrides stay minimal.Proposed refactor (YAML anchors)
backend: + common_environment: &common_environment + TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800" + PYTHONUNBUFFERED: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_CUMEM_ENABLE: "1" + MC_FORCE_MNNVL: "1" + SGLANG_DG_CACHE_DIR: "/configs/deepgemm-cache" + FLASHINFER_WORKSPACE_BASE: "/configs/flashinfer-cache" + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True" + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0" prefill_environment: - TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800" - PYTHONUNBUFFERED: "1" - NCCL_MNNVL_ENABLE: "1" - NCCL_CUMEM_ENABLE: "1" - MC_FORCE_MNNVL: "1" - SGLANG_DG_CACHE_DIR: "/configs/deepgemm-cache" - FLASHINFER_WORKSPACE_BASE: "/configs/flashinfer-cache" - SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" - SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" - SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" - SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True" - SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0" + <<: *common_environment decode_environment: - TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800" - PYTHONUNBUFFERED: "1" - NCCL_MNNVL_ENABLE: "1" - NCCL_CUMEM_ENABLE: "1" - MC_FORCE_MNNVL: "1" - SGLANG_DG_CACHE_DIR: "/configs/deepgemm-cache" - FLASHINFER_WORKSPACE_BASE: "/configs/flashinfer-cache" - SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" - SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" - SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + <<: *common_environment SGLANG_DECODE_BOOTSTRAP_TIMEOUT: "1000" SGLANG_HACK_SEQ_BOOTSTRAP_ROOM: "1" - SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True" - SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-tp4-acc.yaml` around lines 21 - 50, prefill_environment and decode_environment share nearly identical keys; factor the common environment variables into a single YAML anchor (e.g., &common_env) and then merge it into each section using the YAML merge key (<<: *common_env), leaving only stage-specific overrides (e.g., SGLANG_DECODE_BOOTSTRAP_TIMEOUT, SGLANG_HACK_SEQ_BOOTSTRAP_ROOM) in decode_environment and no-ops in prefill_environment; ensure all values remain quoted strings and that unique keys like SGLANG_MOONCAKE_CUSTOM_MEM_POOL and SGLANG_USE_MESSAGE_QUEUE_BROADCASTER are preserved after the merge.recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-deepep-deepgemm.yaml (2)
8-8: Pin the container image to an immutable tag/digest.
container: "dev"on Line 8 is mutable and can change benchmark behavior over time. Prefer a versioned tag or digest for reproducibility.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-deepep-deepgemm.yaml` at line 8, The recipe currently uses a mutable container reference container: "dev"; change this to a pinned image tag or digest (e.g., container: "your-image:1.2.3" or container: "your-image@sha256:...") to ensure reproducible benchmarks—update the container field in the YAML (look for the container key/value "dev") to a specific versioned tag or digest.
111-116: Decode tuning comment block is partially stale.The note references
max_mamba_cache_size: 3200/8, but that setting is no longer configured in this file. Please update/remove that line so future tuning is based on active knobs only.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-deepep-deepgemm.yaml` around lines 111 - 116, The comment block about decode tuning is stale because it references the inactive setting max_mamba_cache_size (shown as "3200/8"); update the comment to remove or rephrase that specific line so it only documents currently used knobs (e.g., max_running_requests, extra_buffer, dp_size, enable_dp_attention and pool), or replace the max_mamba_cache_size calculation with the actual active parameter if intended.recipes/qwen3.5/nvfp4/agg/stp_prefix_off/tp4.yaml (1)
1-5: Align the header with the actual symm-mem setting.Line 2 says symmetric memory is enabled, but Line 53 still leaves
enable-symm-memcommented out. Please make the header match the real config so benchmark comparisons are not mislabeled.Also applies to: 53-53
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@recipes/qwen3.5/nvfp4/agg/stp_prefix_off/tp4.yaml` around lines 1 - 5, The header comment claiming "symmetric memory enabled" is inconsistent with the actual config where the enable-symm-mem flag remains commented out; either uncomment and set enable-symm-mem to true in the TP4 config or change the header text to state symmetric memory is disabled. Edit the top comment line that currently says "Pure tensor parallel, no expert parallel, symmetric memory enabled" and/or uncomment and set the enable-symm-mem setting so the header and the enable-symm-mem flag match.recipes/qwen3.5/fp8/disagg/mooncake/mtp_radix_on/1p1d-mtp-acc-prefixcache-retraction.yaml (1)
9-9: Recipe name doesn't match filename suffix.The filename includes
-retractionsuffix but thenamefield isqwen3.5-1p1d-tp4-mtp-acc-prefixcache. Consider adding-retractionto the name for consistency and easier identification in logs/results.Proposed fix
-name: "qwen3.5-1p1d-tp4-mtp-acc-prefixcache" +name: "qwen3.5-1p1d-tp4-mtp-acc-prefixcache-retraction"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@recipes/qwen3.5/fp8/disagg/mooncake/mtp_radix_on/1p1d-mtp-acc-prefixcache-retraction.yaml` at line 9, The recipe's name field value "qwen3.5-1p1d-tp4-mtp-acc-prefixcache" does not match the filename suffix; update the name field (the value assigned to name) to include the "-retraction" suffix (e.g. "qwen3.5-1p1d-tp4-mtp-acc-prefixcache-retraction") so logs and results match the file naming convention.recipes/qwen3.5/fp8/disagg/mooncake/mtp_radix_off/1p1d-mtp-acc.yaml (1)
114-115: Consider removing multimodal flag for text-only benchmark.
enable-multimodal: trueis set but the benchmark is GPQA (text-only). This may add unnecessary overhead. Consider removing it unless there's a specific reason to keep it enabled.Proposed fix
watchdog-timeout: 1000000 - enable-multimodal: true reasoning-parser: qwen3🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@recipes/qwen3.5/fp8/disagg/mooncake/mtp_radix_off/1p1d-mtp-acc.yaml` around lines 114 - 115, The recipe currently sets enable-multimodal: true for a GPQA text-only benchmark; remove that key or set enable-multimodal: false in this YAML so the pipeline doesn't enable unnecessary multimodal code paths, keeping reasoning-parser: qwen3 as-is; if there is a specific multimodal requirement, add a brief comment next to enable-multimodal to justify keeping it.recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep2-staging-gpqa.yaml (1)
10-10: Different container version used.This recipe uses
container: "dev-0318"while all other recipes in this PR usecontainer: "dev". This is likely intentional to access staging buffer support in a newer build, but should be documented in the comments.Proposed documentation addition
# GPU staging buffer enabled (bulk RDMA, ~1000x fewer RDMA WRs) # GPQA accuracy benchmark +# Note: Uses dev-0318 container for staging buffer support name: "qwen3.5-1p1d-dep2-staging-gpqa"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep2-staging-gpqa.yaml` at line 10, The recipe sets container: "dev-0318" which differs from the other recipes using container: "dev"; update the file to document this intentional divergence by adding a brief comment next to the container key explaining why "dev-0318" is required (e.g., to access staging buffer support or a newer build) and the expected scope/duration of the deviation; ensure the comment references the exact container string ("dev-0318") so reviewers understand it's intentional and not a typo.recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep4-acc.yaml (1)
67-68: Consider increasingmamba-track-intervalfor GSM8K max_tokens.The benchmark specifies
max_tokens: 16000, butmamba-track-interval: 2048is set for both prefill and decode. Per the comments in other files (e.g.,1p1d-dep4dep4-8k1k.yaml),mamba-track-intervalshould be greater thanisl+oslto avoid checkpointing. While GSM8K inputs are typically short, the combination withmax_tokens: 16000could potentially exceed 2048 tokens total in some cases.💡 Consider adjusting mamba-track-interval
If the total sequence length (input + output) can exceed 2048 tokens, consider increasing the interval:
mamba-scheduler-strategy: "no_buffer" disable-radix-cache: true - mamba-track-interval: 2048 + mamba-track-interval: 18000 # > max expected isl+osl mamba-ssm-dtype: "bfloat16"Apply to both prefill (Line 67) and decode (Line 97) sections.
Also applies to: 97-98
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep4-acc.yaml` around lines 67 - 68, The mamba-track-interval (currently 2048) is too small given max_tokens: 16000 and may trigger unwanted checkpointing; update the mamba-track-interval entries for both the prefill and decode sections (the mamba-track-interval keys shown) to a value greater than the expected total sequence length (e.g., exceed isl+osl or set >16000) so it safely surpasses input+output token maxima.recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep2-gpqa.yaml (1)
67-70: Consider increasingmamba-track-intervalfor GPQA max_tokens.The GPQA benchmark specifies
max_tokens: 65536, butmamba-track-interval: 2048is set. If total sequence lengths approach or exceed 2048 tokens, this could trigger unnecessary checkpointing overhead. Other long-context recipes in this PR use higher values (e.g.,mamba-track-interval: 10000for 8k1k workloads).💡 Consider adjusting mamba-track-interval
mamba-scheduler-strategy: "no_buffer" disable-radix-cache: true - mamba-track-interval: 2048 + mamba-track-interval: 70000 # > max expected isl+osl for GPQA mamba-ssm-dtype: "bfloat16"Apply to both prefill (Line 69) and decode (Line 98) sections if GPQA sequences can be long.
Also applies to: 96-99
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep2-gpqa.yaml` around lines 67 - 70, The mamba-track-interval is set to 2048 but GPQA uses max_tokens: 65536, which may cause excessive checkpointing; update the mamba-track-interval entries (both the prefill block and the decode block where "mamba-track-interval" appears) to a much larger value consistent with other long-context recipes (e.g., 10000 or a value close to expected max sequence lengths) to avoid unnecessary overhead while keeping other mamba settings unchanged.recipes/qwen3.5/fp8/agg/mtp_radix_off/tp4-mtp-acc.yaml (1)
4-4: Consider making the run name explicitlyradix-offfor symmetry.This will make side-by-side result filtering (radix-on vs radix-off) cleaner in dashboards/artifacts.
Suggested rename
-name: "qwen3.5-agg-tp4-mtp-acc" +name: "qwen3.5-agg-tp4-mtp-radix-off-acc"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@recipes/qwen3.5/fp8/agg/mtp_radix_off/tp4-mtp-acc.yaml` at line 4, The run name currently set as the name field value "qwen3.5-agg-tp4-mtp-acc" should be changed to include an explicit "radix-off" token for symmetry; update the YAML name value (the name: key) to something like "qwen3.5-agg-tp4-mtp-acc-radix-off" so dashboards and artifact filters can clearly distinguish radix-on vs radix-off runs.recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep2-longbenchv2.yaml (1)
71-71: Consider explicitly specifyingdisaggregation-transfer-backendfor Mooncake.Unlike the NIXL recipes that explicitly set
disaggregation-transfer-backend: nixl, this Mooncake recipe does not specify the transfer backend in either the prefill or decodesglang_config. While it may default correctly based on the environment variables, explicitly specifying the backend would improve clarity and consistency with the NIXL recipes in this PR.Suggested addition for prefill
disaggregation-mode: "prefill" + disaggregation-transfer-backend: mooncakeSuggested addition for decode
disaggregation-mode: "decode" + disaggregation-transfer-backend: mooncakeAlso applies to: 100-100
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep2-longbenchv2.yaml` at line 71, Add an explicit disaggregation-transfer-backend entry set to "mooncake" in the sglang_config blocks for both the prefill and decode sections so the Mooncake recipe matches the NIXL recipes' explicit backend setting; specifically, add disaggregation-transfer-backend: "mooncake" alongside disaggregation-mode under the prefill and decode sglang_config entries.recipes/qwen3.5/nvfp4/agg/stp_prefix_off/tep4.yaml (1)
47-48: Consider addingmamba-scheduler-strategyandmamba-track-intervalfor consistency.Other Qwen3.5 recipes in this PR include
mamba-scheduler-strategy: "no_buffer"andmamba-track-interval: 2048alongsidemamba-ssm-dtype. If this was intentionally omitted for aggregated mode, consider adding a comment explaining why.Suggested addition for consistency
+ mamba-scheduler-strategy: "no_buffer" + mamba-track-interval: 2048 mamba-ssm-dtype: "bfloat16"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@recipes/qwen3.5/nvfp4/agg/stp_prefix_off/tep4.yaml` around lines 47 - 48, The YAML currently sets mamba-ssm-dtype and moe-runner-backend but omits the scheduler and tracking settings used in other Qwen3.5 recipes; add mamba-scheduler-strategy: "no_buffer" and mamba-track-interval: 2048 alongside mamba-ssm-dtype (or, if aggregated mode intentionally differs, add a brief comment next to moe-runner-backend or at the top of the recipe explaining why mamba-scheduler-strategy and mamba-track-interval are omitted for aggregated mode) so the recipe matches the others or documents the deviation.recipes/qwen3.5/nvfp4/agg/stp_prefix_off/dep4.yaml (1)
50-52: Set fullmamba-*knobs explicitly for Qwen3.5 consistency.Line 50 currently sets only
mamba-ssm-dtype. Addmamba-scheduler-strategyandmamba-track-intervalto avoid relying on implicit defaults.Based on learnings: Qwen3.5 uses SGLang mamba-* scheduler infrastructure, and these settings are applicable/needed for this model family.Suggested refactor
+ mamba-scheduler-strategy: "no_buffer" + mamba-track-interval: 2048 mamba-ssm-dtype: "bfloat16" moe-runner-backend: "flashinfer_trtllm"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@recipes/qwen3.5/nvfp4/agg/stp_prefix_off/dep4.yaml` around lines 50 - 52, The recipe currently sets only mamba-ssm-dtype ("bfloat16") which can leave Qwen3.5 runs relying on implicit defaults; update the YAML to explicitly add mamba-scheduler-strategy and mamba-track-interval alongside mamba-ssm-dtype so the mamba scheduler configuration is fully specified for Qwen3.5 (add suitable values for mamba-scheduler-strategy and mamba-track-interval consistent with SGLang mamba-* conventions used across this model family).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@recipes/qwen3.5/fp8/disagg/mooncake/mtp_radix_off/1p1d-tp4-mtp.yaml`:
- Around line 77-84: The decode stage is pointing at a different served model
than the prefill/serving block; update the decode block (the entries under
"decode" including served-model-name and model-path) to use the exact same
served-model-name and model-path used by the top-level model/prefill (so
sglang.deep_gemm_precompile and the serving run see identical model identity),
ensuring served-model-name and model-path strings match across prefill and
decode.
In `@recipes/qwen3.5/fp8/disagg/mooncake/profile/1p1d-dep4dep4-profile.yaml`:
- Line 12: The YAML includes an unsupported key model.version which breaks CI
schema validation; remove the model.version entry from the document (delete the
model.version line and any value or nested structure under it) so the file
conforms to the current schema and rerun validation (look for the top-level
model object and remove only its version field).
In `@recipes/qwen3.5/fp8/disagg/mooncake/profile/1p1d-tp4-mtp-profile.yaml`:
- Around line 77-79: The decode block's served-model-name is set to
nvidia/DeepSeek-R1-0528-NVFP4-v2 which mismatches the recipe's Qwen3.5 FP8
prefill stage; update decode.served-model-name to the same model identifier used
in the prefill configuration (i.e., the Qwen3.5 FP8 served-model-name) so
routing/attribution remain consistent between prefill and decode stages (locate
the decode block and the prefill served-model-name and make them identical).
In
`@recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-deepep-deepgemm.yaml`:
- Line 70: The mamba-track-interval value (mamba-track-interval) contradicts its
safety comment ("must be > isl+osl") because context-length is 2200; update both
mamba-track-interval occurrences to be strictly greater than isl+osl (e.g., set
mamba-track-interval to a safe value like 4096 or compute context-length+1) so
checkpointing is avoided; ensure you change every instance of
mamba-track-interval in this file so it consistently satisfies the isl+osl
constraint relative to context-length.
In
`@recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep2-gsm8k-bench.yaml`:
- Around line 110-116: The recipe's benchmark block is using an unsupported
field "use_chat_api" for benchmark type "gsm8k-bench", causing schema validation
to fail; remove the "use_chat_api: true" entry from the benchmark section (the
block that includes type: "gsm8k-bench", num_examples, num_shots, max_tokens,
num_threads) so the YAML only contains supported keys, or alternatively update
the gsm8k-bench schema to accept use_chat_api if that behavior is required.
In
`@recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep2-staging-gsm8k-bench.yaml`:
- Around line 114-120: The benchmark YAML includes an unsupported field
`use_chat_api` in the `benchmark` block which violates the `BenchmarkConfig`
schema; remove the `use_chat_api` entry from the `benchmark` mapping (the block
containing type: "gsm8k-bench", num_examples, num_shots, max_tokens,
num_threads) so the config conforms to `BenchmarkConfig` and the GSM8K benchmark
runner accepts it.
In
`@recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep4tp4-gsm8k-bench.yaml`:
- Line 118: Remove the unsupported benchmark.use_chat_api field from the YAML by
deleting the "use_chat_api: true" entry under the benchmark config (the field
shown as use_chat_api in the diff); also search for any other occurrences of
use_chat_api or benchmark.use_chat_api in this recipe and remove them or replace
them with a supported counterpart, then re-run schema validation to confirm the
pipeline is no longer blocked.
In
`@recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep4tp4-staging-gsm8k-bench.yaml`:
- Around line 116-122: The benchmark config for type "gsm8k-bench" includes an
unsupported field use_chat_api which fails validation; remove the use_chat_api:
true line from the benchmark block (the YAML stanza containing type:
"gsm8k-bench", num_examples, num_shots, max_tokens, num_threads) so the schema
validates, or alternatively update the benchmark schema to explicitly allow
use_chat_api if that behavior is required.
In
`@recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/2p1d-hetero-nostaging-longbenchv2.yaml`:
- Around line 56-108: The config uses a non‑existent key prefill_groups under
sglang_config; update the block to match SGLangServerConfig by replacing
prefill_groups with the singular fields (prefill / decode / aggregated) and move
the intended prefill settings into a single prefill dict (or consolidate into
prefill and use load balancing/worker specifics inside that dict) instead of an
array, ensuring keys like disaggregation-mode, tensor-parallel-size,
chunked-prefill-size, mem-fraction-static, etc. are placed under
sglang_config.prefill; do not add multiple entries unless the schema is extended
— simply convert the two array items into a single prefill configuration that
merges/chooses the correct values for the deployment.
In
`@recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/2p1d-hetero-staging-dep4-gpqa.yaml`:
- Around line 64-115: The YAML uses the unsupported key prefill_groups which
will fail SGLangServerConfig validation; replace the top-level prefill_groups
block with sglang_config and move each worker's configuration under
sglang_config.prefill (and add sglang_config.decode or sglang_config.aggregated
sections if required later). Concretely, remove the prefill_groups key and nest
each worker entry under sglang_config:prefill while preserving all per-worker
fields (e.g., served-model-name, model-path, attention-backend, kv-cache-dtype,
tensor-parallel-size, data-parallel-size, disaggregation-mode,
chunked-prefill-size, etc.), and ensure the final YAML keys match the supported
schema names sglang_config, prefill, decode, aggregated.
In
`@recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/2p1d-hetero-staging-dep4-longbenchv2.yaml`:
- Around line 63-64: The CI fails because the config contains an unknown key
sglang_config.prefill_groups; either remove/rename this key to match the
existing schema or add schema/validator support for the new contract. Locate the
config block named sglang_config (the recipe file referencing prefill_groups)
and either delete or replace prefill_groups with the accepted field name, or
update the config validator/schema (the schema definition that validates
sglang_config) to declare prefill_groups with its expected type/structure and
add a small unit/validation test to cover the new field.
In
`@recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/2p1d-hetero-staging-gpqa.yaml`:
- Around line 63-64: The recipe uses the unknown field prefill_groups under
backend.sglang_config which fails schema validation; update the schema validator
for backend.sglang_config to include a prefill_groups property (with the correct
type/shape expected by the recipe) so CI accepts this key, or alternatively
remove/rename prefill_groups in the YAML to match the existing schema; locate
references to backend.sglang_config in the validator and add the new property
definition consistent with other group/list fields.
In
`@recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/2p1d-hetero-staging-longbenchv2.yaml`:
- Around line 62-114: The YAML uses a non-existent sglang_config.prefill_groups
field causing CI failure; either add support for heterogeneous groups in the
SGLangServerConfig schema or convert these entries to the existing schema fields
(prefill/decode/aggregated). Fix option A: extend the SGLangServerConfig class
to include a prefill_groups array type (with the same per-group keys used here:
served-model-name, model-path, attention-backend, kv-cache-dtype,
tensor-parallel-size, data-parallel-size, expert-parallel-size,
disaggregation-mode, mem-fraction-static, chunked-prefill-size,
load-balance-method, watchdog-timeout, etc.), update its validation/parsing
logic to handle heterogeneous group entries, and update any code that reads
sglang_config (search for SGLangServerConfig) to accept and iterate
prefill_groups. Fix option B: restructure this YAML to use the existing
sglang_config.prefill entries by splitting each group into separate prefill
blocks (or into decode/aggregated as appropriate) so only supported fields
(prefill/decode/aggregated) are present and remove prefill_groups.
In `@recipes/qwen3.5/nvfp4/agg/mtp_radix_on/validate-nvfp4-mtp-ladfi-c128.yaml`:
- Around line 13-16: The CI is failing because the YAML includes an invalid
field under frontend; remove the invalid install field so the frontend block
only contains valid keys (e.g., keep frontend.type: "sglang" and delete
frontend.install). Locate the frontend block (the keys 'frontend', 'type:
"sglang"', and 'install') and remove the install entry entirely so the file
matches the valid schema used by the c512 variant.
In `@recipes/qwen3.5/nvfp4/agg/mtp_radix_on/validate-nvfp4-mtp-ladfi-c512.yaml`:
- Around line 13-16: The frontend block contains an invalid key "install" that
causes CI validation to fail; remove the install: false line from the frontend
section (the block with type: "sglang") so the frontend only declares supported
fields such as type, ensuring the YAML validates correctly.
In `@recipes/qwen3.5/nvfp4/agg/profile/profile-torch-conc128.yaml`:
- Around line 1-5: The profile name declares conc128 but the config actually
uses conc64; either rename the "name" field value
("coreai_comparch_trtllm-yangminl.nvfp4-torch-profile-nobuf-conc64") to reflect
conc64 or change all concurrency settings currently set to 64 to 128 so they
match the intended conc128 profile; update the same/congruent "conc" entries
found in the body (the other occurrences referenced in the diff) and any related
fields so the recipe name and concurrency configuration are consistent.
- Line 17: Remove the unsupported frontend.install field (the line with
"install: false") from the profile configuration so CI validation no longer
fails; locate the frontend block referencing frontend.install in
profile-torch-conc128.yaml and delete that key (or, if the install flag is
required for runtime logic, move it to a supported location or documented field
name instead of frontend.install).
In
`@recipes/qwen3.5/nvfp4/agg/stp_radix_on/validate-nvfp4-extrabuf-ladfi-c128.yaml`:
- Around line 13-16: Remove the invalid "install" field under the frontend block
in the YAML (the frontend: type: "sglang" section) because the config schema
does not recognize it; locate the frontend mapping in
validate-nvfp4-extrabuf-ladfi-c128.yaml and delete the line "install: false" so
the frontend section only contains valid keys (e.g., type).
In
`@recipes/qwen3.5/nvfp4/agg/stp_radix_on/validate-nvfp4-extrabuf-ladfi-c512.yaml`:
- Around line 13-16: Remove the invalid install field from the frontend block in
the YAML: locate the frontend section (frontend: with type: "sglang") and delete
the install: false entry so the file conforms to the schema and CI validation
passes.
---
Minor comments:
In
`@recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep4-staging-gsm8k.yaml`:
- Around line 1-6: The file header comment incorrectly states "with GPQA
accuracy benchmark" while this recipe uses the GSM8K benchmark; update the top
comment to accurately say GSM8K (or remove the GPQA mention) so it matches the
actual benchmark used, and verify the name field
("qwen3.5-1p1d-dep4-staging-acc") and any other descriptive lines reflect GSM8K
consistently.
In
`@recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep4tp4-staging-longbenchv2.yaml`:
- Around line 58-59: The decode environment is missing the
SGLANG_DISAGG_STAGING_BUFFER_SIZE_MB variable while the prefill env sets it
(e.g., "256"); update the decode env to include
SGLANG_DISAGG_STAGING_BUFFER_SIZE_MB with the same or appropriate value to match
SGLANG_DISAGG_STAGING_POOL_SIZE_MB ("4096") for consistency, or add a brief
comment documenting why the buffer size is intentionally omitted; look for the
decode env block containing SGLANG_DISAGG_STAGING_POOL_SIZE_MB and add
SGLANG_DISAGG_STAGING_BUFFER_SIZE_MB alongside it (or a comment explaining
omission).
In `@recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-tp4-baseline.yaml`:
- Line 3: Fix the typo in the top-line comment of the YAML: change
"recipies/qwen3.5/1p1d.yaml" to "recipes/qwen3.5/1p1d.yaml" so the comment reads
"Server config is EXACTLY the same as recipes/qwen3.5/1p1d.yaml (949062)";
update the string in the file comment accordingly.
In `@recipes/qwen3.5/fp8/disagg/nixl/stp_prefix_off/1p1d-dep4-nixl-acc.yaml`:
- Around line 26-27: This recipe includes Mooncake-specific environment
variables that are unnecessary for the NIXL transfer backend; remove the
MC_FORCE_MNNVL and SGLANG_MOONCAKE_CUSTOM_MEM_POOL entries from both
prefill_environment and decode_environment blocks in the 1p1d-dep4-nixl-acc.yaml
recipe (leave other vars like SGLANG_DG_CACHE_DIR intact) so the NIXL recipe
matches the sibling 1p1d-dep4-nixl-gpqa.yaml; search for the exact keys
"MC_FORCE_MNNVL" and "SGLANG_MOONCAKE_CUSTOM_MEM_POOL" and delete those lines
wherever present in this file.
In
`@recipes/qwen3.5/fp8/disagg/nixl/stp_prefix_off/1p1d-dep4-staging-nixl-gpqa-20req.yaml`:
- Around line 54-55: The decode environment is missing the
SGLANG_DISAGG_STAGING_BUFFER_SIZE_MB variable; add
SGLANG_DISAGG_STAGING_BUFFER_SIZE_MB with an appropriate value (e.g., "4096" to
match SGLANG_DISAGG_STAGING_POOL_SIZE_MB) alongside
SGLANG_DISAGG_STAGING_POOL_SIZE_MB in the same env block so the staging recipe
matches the Mooncake/1p1d-dep4-staging-nixl-gpqa reference implementation.
In
`@recipes/qwen3.5/fp8/disagg/nixl/stp_prefix_off/1p1d-dep4-staging-nixl-gpqa.yaml`:
- Around line 54-55: The decode environment is missing the explicit
SGLANG_DISAGG_STAGING_BUFFER_SIZE_MB entry; update the decode env block to add
SGLANG_DISAGG_STAGING_BUFFER_SIZE_MB with the intended value (e.g., match the
prefill value "256" or the Mooncake variant "64") alongside
SGLANG_DISAGG_STAGING_BUFFER and SGLANG_DISAGG_STAGING_POOL_SIZE_MB, or if
omission is intentional, add an explanatory comment next to
SGLANG_DISAGG_STAGING_POOL_SIZE_MB to document reliance on a default.
---
Nitpick comments:
In `@recipes/qwen3.5/fp8/agg/mtp_radix_off/tp4-mtp-acc.yaml`:
- Line 4: The run name currently set as the name field value
"qwen3.5-agg-tp4-mtp-acc" should be changed to include an explicit "radix-off"
token for symmetry; update the YAML name value (the name: key) to something like
"qwen3.5-agg-tp4-mtp-acc-radix-off" so dashboards and artifact filters can
clearly distinguish radix-on vs radix-off runs.
In `@recipes/qwen3.5/fp8/disagg/mooncake/mtp_radix_off/1p1d-mtp-acc.yaml`:
- Around line 114-115: The recipe currently sets enable-multimodal: true for a
GPQA text-only benchmark; remove that key or set enable-multimodal: false in
this YAML so the pipeline doesn't enable unnecessary multimodal code paths,
keeping reasoning-parser: qwen3 as-is; if there is a specific multimodal
requirement, add a brief comment next to enable-multimodal to justify keeping
it.
In
`@recipes/qwen3.5/fp8/disagg/mooncake/mtp_radix_on/1p1d-mtp-acc-prefixcache-retraction.yaml`:
- Line 9: The recipe's name field value "qwen3.5-1p1d-tp4-mtp-acc-prefixcache"
does not match the filename suffix; update the name field (the value assigned to
name) to include the "-retraction" suffix (e.g.
"qwen3.5-1p1d-tp4-mtp-acc-prefixcache-retraction") so logs and results match the
file naming convention.
In
`@recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-deepep-deepgemm.yaml`:
- Line 8: The recipe currently uses a mutable container reference container:
"dev"; change this to a pinned image tag or digest (e.g., container:
"your-image:1.2.3" or container: "your-image@sha256:...") to ensure reproducible
benchmarks—update the container field in the YAML (look for the container
key/value "dev") to a specific versioned tag or digest.
- Around line 111-116: The comment block about decode tuning is stale because it
references the inactive setting max_mamba_cache_size (shown as "3200/8"); update
the comment to remove or rephrase that specific line so it only documents
currently used knobs (e.g., max_running_requests, extra_buffer, dp_size,
enable_dp_attention and pool), or replace the max_mamba_cache_size calculation
with the actual active parameter if intended.
In `@recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep2-gpqa.yaml`:
- Around line 67-70: The mamba-track-interval is set to 2048 but GPQA uses
max_tokens: 65536, which may cause excessive checkpointing; update the
mamba-track-interval entries (both the prefill block and the decode block where
"mamba-track-interval" appears) to a much larger value consistent with other
long-context recipes (e.g., 10000 or a value close to expected max sequence
lengths) to avoid unnecessary overhead while keeping other mamba settings
unchanged.
In
`@recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep2-longbenchv2.yaml`:
- Line 71: Add an explicit disaggregation-transfer-backend entry set to
"mooncake" in the sglang_config blocks for both the prefill and decode sections
so the Mooncake recipe matches the NIXL recipes' explicit backend setting;
specifically, add disaggregation-transfer-backend: "mooncake" alongside
disaggregation-mode under the prefill and decode sglang_config entries.
In
`@recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep2-staging-gpqa.yaml`:
- Line 10: The recipe sets container: "dev-0318" which differs from the other
recipes using container: "dev"; update the file to document this intentional
divergence by adding a brief comment next to the container key explaining why
"dev-0318" is required (e.g., to access staging buffer support or a newer build)
and the expected scope/duration of the deviation; ensure the comment references
the exact container string ("dev-0318") so reviewers understand it's intentional
and not a typo.
In `@recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep4-acc.yaml`:
- Around line 67-68: The mamba-track-interval (currently 2048) is too small
given max_tokens: 16000 and may trigger unwanted checkpointing; update the
mamba-track-interval entries for both the prefill and decode sections (the
mamba-track-interval keys shown) to a value greater than the expected total
sequence length (e.g., exceed isl+osl or set >16000) so it safely surpasses
input+output token maxima.
In `@recipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-tp4-acc.yaml`:
- Around line 85-86: In the decode configuration where only tensor-parallel-size
is declared, explicitly add data-parallel-size: 1 and expert-parallel-size: 1 to
the same block to avoid relying on defaults; update the decode section (the
entry containing tensor-parallel-size) to include the two new fields so the
decode parallelism matches prefill and is future-proof against default changes.
- Around line 21-50: prefill_environment and decode_environment share nearly
identical keys; factor the common environment variables into a single YAML
anchor (e.g., &common_env) and then merge it into each section using the YAML
merge key (<<: *common_env), leaving only stage-specific overrides (e.g.,
SGLANG_DECODE_BOOTSTRAP_TIMEOUT, SGLANG_HACK_SEQ_BOOTSTRAP_ROOM) in
decode_environment and no-ops in prefill_environment; ensure all values remain
quoted strings and that unique keys like SGLANG_MOONCAKE_CUSTOM_MEM_POOL and
SGLANG_USE_MESSAGE_QUEUE_BROADCASTER are preserved after the merge.
In `@recipes/qwen3.5/nvfp4/agg/stp_prefix_off/dep4.yaml`:
- Around line 50-52: The recipe currently sets only mamba-ssm-dtype ("bfloat16")
which can leave Qwen3.5 runs relying on implicit defaults; update the YAML to
explicitly add mamba-scheduler-strategy and mamba-track-interval alongside
mamba-ssm-dtype so the mamba scheduler configuration is fully specified for
Qwen3.5 (add suitable values for mamba-scheduler-strategy and
mamba-track-interval consistent with SGLang mamba-* conventions used across this
model family).
In `@recipes/qwen3.5/nvfp4/agg/stp_prefix_off/tep4.yaml`:
- Around line 47-48: The YAML currently sets mamba-ssm-dtype and
moe-runner-backend but omits the scheduler and tracking settings used in other
Qwen3.5 recipes; add mamba-scheduler-strategy: "no_buffer" and
mamba-track-interval: 2048 alongside mamba-ssm-dtype (or, if aggregated mode
intentionally differs, add a brief comment next to moe-runner-backend or at the
top of the recipe explaining why mamba-scheduler-strategy and
mamba-track-interval are omitted for aggregated mode) so the recipe matches the
others or documents the deviation.
In `@recipes/qwen3.5/nvfp4/agg/stp_prefix_off/tp4.yaml`:
- Around line 1-5: The header comment claiming "symmetric memory enabled" is
inconsistent with the actual config where the enable-symm-mem flag remains
commented out; either uncomment and set enable-symm-mem to true in the TP4
config or change the header text to state symmetric memory is disabled. Edit the
top comment line that currently says "Pure tensor parallel, no expert parallel,
symmetric memory enabled" and/or uncomment and set the enable-symm-mem setting
so the header and the enable-symm-mem flag match.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 6f5d5264-f7f5-4c52-a5c7-c492fb77fbec
📒 Files selected for processing (85)
recipes/qwen3.5/fp8/agg/mtp_radix_off/tp4-mtp-acc.yamlrecipes/qwen3.5/fp8/agg/mtp_radix_on/tp4-mtp-acc.yamlrecipes/qwen3.5/fp8/agg/profile/agg-dep4-trtllm-symmem-profile.yamlrecipes/qwen3.5/fp8/agg/profile/agg-tep4-trtllm-symmem-profile.yamlrecipes/qwen3.5/fp8/agg/profile/agg-tp4-trtllm-symmem-profile.yamlrecipes/qwen3.5/fp8/agg/stp_prefix_off/dep4-acc.yamlrecipes/qwen3.5/fp8/agg/stp_prefix_off/dep4.yamlrecipes/qwen3.5/fp8/agg/stp_prefix_off/tep4.yamlrecipes/qwen3.5/fp8/agg/stp_prefix_off/tp4-acc.yamlrecipes/qwen3.5/fp8/agg/stp_prefix_off/tp4.yamlrecipes/qwen3.5/fp8/agg/stp_radix_on/dep4-acc.yamlrecipes/qwen3.5/fp8/agg/stp_radix_on/tp4-acc.yamlrecipes/qwen3.5/fp8/disagg/mooncake/mtp_radix_off/1p1d-mtp-acc.yamlrecipes/qwen3.5/fp8/disagg/mooncake/mtp_radix_off/1p1d-tp4-mtp.yamlrecipes/qwen3.5/fp8/disagg/mooncake/mtp_radix_on/1p1d-mtp-acc-prefixcache-retraction.yamlrecipes/qwen3.5/fp8/disagg/mooncake/mtp_radix_on/1p1d-mtp-acc-prefixcache.yamlrecipes/qwen3.5/fp8/disagg/mooncake/profile/1p1d-dep4-profile.yamlrecipes/qwen3.5/fp8/disagg/mooncake/profile/1p1d-dep4-staging-nsys-1k1k-v3.yamlrecipes/qwen3.5/fp8/disagg/mooncake/profile/1p1d-dep4-staging-torch-1k1k.yamlrecipes/qwen3.5/fp8/disagg/mooncake/profile/1p1d-dep4dep4-profile.yamlrecipes/qwen3.5/fp8/disagg/mooncake/profile/1p1d-tep4-profile.yamlrecipes/qwen3.5/fp8/disagg/mooncake/profile/1p1d-tp4-mtp-profile.yamlrecipes/qwen3.5/fp8/disagg/mooncake/profile/1p1d-tp4-profile.yamlrecipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-deepep-deepgemm.yamlrecipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep2-gpqa.yamlrecipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep2-gsm8k-bench.yamlrecipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep2-longbenchv2.yamlrecipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep2-staging-gpqa.yamlrecipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep2-staging-gsm8k-bench.yamlrecipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep2-staging-longbenchv2.yamlrecipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep4-acc.yamlrecipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep4-longbenchv2.yamlrecipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep4-mooncake-gpqa.yamlrecipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep4-staging-acc.yamlrecipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep4-staging-gsm8k-bench.yamlrecipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep4-staging-gsm8k.yamlrecipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep4-staging-longbenchv2.yamlrecipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep4-staging.yamlrecipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep4.yamlrecipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep4dep4-8k1k.yamlrecipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep4dep4-staging-8k1k.yamlrecipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep4dep4-staging.yamlrecipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep4dep4.yamlrecipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep4tp4-gpqa.yamlrecipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep4tp4-gsm8k-bench.yamlrecipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep4tp4-longbenchv2.yamlrecipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep4tp4-staging-gpqa.yamlrecipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep4tp4-staging-gsm8k-bench.yamlrecipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-dep4tp4-staging-longbenchv2.yamlrecipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-tep4.yamlrecipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-tp4-acc.yamlrecipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-tp4-baseline.yamlrecipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/1p1d-tp4.yamlrecipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/2p1d-hetero-nostaging-longbenchv2.yamlrecipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/2p1d-hetero-staging-dep4-gpqa.yamlrecipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/2p1d-hetero-staging-dep4-longbenchv2.yamlrecipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/2p1d-hetero-staging-gpqa.yamlrecipes/qwen3.5/fp8/disagg/mooncake/stp_prefix_off/2p1d-hetero-staging-longbenchv2.yamlrecipes/qwen3.5/fp8/disagg/nixl/stp_prefix_off/1p1d-dep4-nixl-acc.yamlrecipes/qwen3.5/fp8/disagg/nixl/stp_prefix_off/1p1d-dep4-nixl-gpqa-20req.yamlrecipes/qwen3.5/fp8/disagg/nixl/stp_prefix_off/1p1d-dep4-nixl-gpqa.yamlrecipes/qwen3.5/fp8/disagg/nixl/stp_prefix_off/1p1d-dep4-nixl-gsm8k.yamlrecipes/qwen3.5/fp8/disagg/nixl/stp_prefix_off/1p1d-dep4-staging-nixl-gpqa-20req.yamlrecipes/qwen3.5/fp8/disagg/nixl/stp_prefix_off/1p1d-dep4-staging-nixl-gpqa.yamlrecipes/qwen3.5/fp8/disagg/nixl/stp_prefix_off/1p1d-dep4-staging-nixl-longbenchv2.yamlrecipes/qwen3.5/fp8/disagg/nixl/stp_prefix_off/1p1d-dep4dep4-nixl.yamlrecipes/qwen3.5/fp8/disagg/nixl/stp_prefix_off/1p1d-tep4-nixl.yamlrecipes/qwen3.5/fp8/disagg/nixl/stp_prefix_off/1p1d-tp4-nixl-acc.yamlrecipes/qwen3.5/fp8/disagg/nixl/stp_prefix_off/1p1d-tp4-nixl.yamlrecipes/qwen3.5/nvfp4/agg/mtp_radix_off/.gitkeeprecipes/qwen3.5/nvfp4/agg/mtp_radix_on/validate-nvfp4-mtp-ladfi-c128.yamlrecipes/qwen3.5/nvfp4/agg/mtp_radix_on/validate-nvfp4-mtp-ladfi-c512.yamlrecipes/qwen3.5/nvfp4/agg/profile/profile-torch-conc128.yamlrecipes/qwen3.5/nvfp4/agg/stp_prefix_off/dep4-acc.yamlrecipes/qwen3.5/nvfp4/agg/stp_prefix_off/dep4.yamlrecipes/qwen3.5/nvfp4/agg/stp_prefix_off/tep4.yamlrecipes/qwen3.5/nvfp4/agg/stp_prefix_off/tp4-acc.yamlrecipes/qwen3.5/nvfp4/agg/stp_prefix_off/tp4.yamlrecipes/qwen3.5/nvfp4/agg/stp_radix_on/validate-nvfp4-extrabuf-ladfi-c128.yamlrecipes/qwen3.5/nvfp4/agg/stp_radix_on/validate-nvfp4-extrabuf-ladfi-c512.yamlrecipes/qwen3.5/nvfp4/disagg/mtp_radix_off/.gitkeeprecipes/qwen3.5/nvfp4/disagg/mtp_radix_on/.gitkeeprecipes/qwen3.5/nvfp4/disagg/profile/.gitkeeprecipes/qwen3.5/nvfp4/disagg/stp_prefix_off/.gitkeeprecipes/qwen3.5/nvfp4/disagg/stp_radix_on/.gitkeep
Per SGLang official docs and test code (PR #19391), NEXTN speculative decoding requires speculative-eagle-topk: 1 as NEXTN is internally converted to EAGLE. Without it, server_args crashes with TypeError when trtllm_mha backend is used.
…refill-size - Add explicit quantization: fp8 to 3 disagg MTP acc files that were missing it - Add moe-runner-backend: flashinfer_trtllm to same 3 files - Change chunked-prefill-size from -1 to 16384 in 1p1d-tp4-profile and 1p1d-tp4
- Remove unsupported `frontend.install` field from NVFP4 recipes - Remove unsupported `benchmark.use_chat_api` field from gsm8k-bench recipes - Fix decode served-model-name mismatch (DeepSeek → Qwen3.5) in MTP disagg recipes - Align profile comment with actual concurrency config (conc128 → conc64)
Summary by CodeRabbit
New Features
Chores