Skip to content
12 changes: 5 additions & 7 deletions .github/configs/nvidia-master.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2477,17 +2477,15 @@
- isl: 1024
osl: 1024
search-space:
- { tp: 8, conc-start: 4, conc-end: 4 }
- { tp: 4, conc-start: 4, conc-end: 128 }
- { tp: 8, conc-start: 128, conc-end: 128 }
- { tp: 4, dp-attn: true, conc-start: 256, conc-end: 512 }
- { tp: 4, dp-attn: true, conc-start: 256, conc-end: 4096 }
- { tp: 8, dp-attn: true, conc-start: 2048, conc-end: 8192 }
- isl: 8192
osl: 1024
search-space:
- { tp: 8, conc-start: 4, conc-end: 4 }
- { tp: 4, conc-start: 4, conc-end: 128 }
- { tp: 8, conc-start: 128, conc-end: 128 }
- { tp: 4, dp-attn: true, conc-start: 256, conc-end: 512 }
- { tp: 4, conc-start: 4, conc-end: 64 }
- { tp: 4, dp-attn: true, conc-start: 128, conc-end: 1024 }
- { tp: 8, dp-attn: true, conc-start: 1024, conc-end: 8192 }

Check warning on line 2488 in .github/configs/nvidia-master.yaml

View check run for this annotation

Claude / Claude Code Review

Missing explicit ep: 8 on new TP=8/dp-attn:true rows

The two new TP=8/dp-attn:true rows added to `dsv4-fp4-b300-vllm` (lines 2482 and 2488) omit the `ep` field, so `generate_sweep_configs.py` defaults the metadata to `ep=1`. But `dsv4_fp4_b300_vllm.sh` unconditionally passes `--enable-expert-parallel` and sets `--data-parallel-size 8` for these rows, so the actual run is EP=8 — the result-filename template (`tp${TP}-ep${EP_SIZE}-dpa${DP_ATTENTION}`) and downstream group-by tooling will tag these B300 rows as `ep=1` while the underlying run is EP=8

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The two new TP=8/dp-attn:true rows added to dsv4-fp4-b300-vllm (lines 2482 and 2488) omit the ep field, so generate_sweep_configs.py defaults the metadata to ep=1. But dsv4_fp4_b300_vllm.sh unconditionally passes --enable-expert-parallel and sets --data-parallel-size 8 for these rows, so the actual run is EP=8 — the result-filename template (tp${TP}-ep${EP_SIZE}-dpa${DP_ATTENTION}) and downstream group-by tooling will tag these B300 rows as ep=1 while the underlying run is EP=8. Sister config dsv4-fp8-h200-vllm at line 2458/2462 explicitly tags the analogous TP=8/dp-attn:true row as { tp: 8, ep: 8, dp-attn: true, ... }. Suggest adding explicit ep: 8 to both new TP=8 entries to match the convention. (Note: the existing TP=4 dp-attn:true rows on this same config also omit ep, but that pattern was inherited from PR #1144 — this PR extends the issue to TP=8.)

Extended reasoning...

What the bug is

The two new search-space rows added to dsv4-fp4-b300-vllm omit the ep field:

# .github/configs/nvidia-master.yaml line 2482
- { tp: 8, dp-attn: true, conc-start: 2048, conc-end: 8192 }
# line 2488
- { tp: 8, dp-attn: true, conc-start: 1024, conc-end: 8192 }

In utils/matrix_logic/generate_sweep_configs.py:354, the matrix-entry default for ep is 1 (set unconditionally), and lines 362-363 only override ep if the YAML key was present (bmk.get(Fields.EP.value) returns None when omitted). So these matrix entries are tagged with ep=1 in the generated metadata.

Why the actual runtime is EP=8

benchmarks/single_node/dsv4_fp4_b300_vllm.sh does not consult the metadata ep value at all. At line 38 the parallel block sets --data-parallel-size "$TP" (i.e. --data-parallel-size 8 when TP=8 and DP_ATTENTION=true), and at line 78 it unconditionally passes --enable-expert-parallel. Under vLLM, --enable-expert-parallel with --data-parallel-size 8 runs with effective expert-parallel world size 8 (each rank holds 1/8 of the experts). So the runtime is EP=8 while the metadata says EP=1.

Why this is a metadata mismatch worth flagging

The sister config dsv4-fp8-h200-vllm at lines 2458 and 2462 explicitly tags the analogous TP=8/dp-attn:true rows as { tp: 8, ep: 8, dp-attn: true, ... } — confirming that ep: 8 is the established convention for this scenario across the dsv4 family. The metadata flows into RESULT_FILENAME (via .github/workflows/benchmark-tmpl.yml:146, template tp${TP}-ep${EP_SIZE}-dpa${DP_ATTENTION}), so the new B300 rows will be saved as tp8-ep1-dpaTrue while the actual run is EP=8. Downstream group-by tooling (compare_results.py, summarize.py, collect_eval_results.py) keys on ep, so cross-config analysis across the dsv4 family will misclassify these B300 rows.

Step-by-step proof

  1. Harness picks the new YAML row { tp: 8, dp-attn: true, conc-start: 2048, conc-end: 8192 } (line 2482).
  2. generate_sweep_configs.py:354 assigns ep=1 (default). Line 362 checks bmk.get('ep') which is None, so the default is kept. Metadata: tp=8, ep=1, dp-attn=true.
  3. Workflow exports EP_SIZE=1, TP=8, DP_ATTENTION=true. Result-filename template at benchmark-tmpl.yml:146 resolves to tp8-ep1-dpaTrue-....
  4. The script dsv4_fp4_b300_vllm.sh runs with --tensor-parallel-size 1 --data-parallel-size 8 --enable-expert-parallel. vLLM's effective EP world size is 8.
  5. The result file is tagged ep1 while the run is EP=8. The h200 sister config tags the same scenario ep=8.

Impact and fix

Metadata-only — runtime behavior is correct because the script hardcodes parallelism. Severity is nit. Fix: add ep: 8 to both new TP=8 entries (lines 2482 and 2488) to match dsv4-fp8-h200-vllm. Ideally ep: 4 would also be added to the kept TP=4/dp-attn:true rows for full consistency, but that pattern was inherited from PR #1144 and is outside the scope of this PR.


qwen3.5-fp8-h200-sglang:
image: lmsysorg/sglang:v0.5.9-cu129-amd64
Comment thread
wzhao18 marked this conversation as resolved.
Comment thread
wzhao18 marked this conversation as resolved.
Outdated
Expand Down
9 changes: 8 additions & 1 deletion benchmarks/single_node/dsv4_fp4_b300_vllm.sh
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,13 @@ if [ "${DP_ATTENTION}" = "true" ]; then
PARALLEL_ARGS=(--tensor-parallel-size 1 --data-parallel-size "$TP")
fi

# DP mode: mbt=ISL; TP mode: mbt=2*ISL; floor at 2048
if [ "${DP_ATTENTION}" = "true" ]; then
MAX_NUM_BATCHED_TOKENS=$(( ISL < 2048 ? 2048 : ISL ))
else
MAX_NUM_BATCHED_TOKENS=$(( ISL * 2 < 2048 ? 2048 : ISL * 2 ))
fi

BENCHMARK_MAX_MODEL_LEN="$MAX_MODEL_LEN"
if [ "$ISL" -eq 1024 ] && [ "$OSL" -eq 1024 ]; then
BENCHMARK_MAX_MODEL_LEN=4096
Expand Down Expand Up @@ -71,7 +78,7 @@ vllm serve "$MODEL" --host 0.0.0.0 --port "$PORT" \
--reasoning-parser deepseek_v4 \
--max-cudagraph-capture-size 2048 \
--max-model-len "$SERVE_MAX_MODEL_LEN" \
--max-num-batched-tokens 2048 > "$SERVER_LOG" 2>&1 &
--max-num-batched-tokens "$MAX_NUM_BATCHED_TOKENS" > "$SERVER_LOG" 2>&1 &

SERVER_PID=$!

Expand Down
9 changes: 9 additions & 0 deletions perf-changelog.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1819,3 +1819,12 @@
- "Restore the recipe-per-CONC split (low-latency / balanced / max-throughput) on top of the low-latency-only fallback from #1143; the DeepEP FP8 weight-postprocess path is fixed, so the high-throughput scenario runs again"
- "Recipes from https://docs.sglang.io/cookbook/autoregressive/DeepSeek/DeepSeek-V4"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1132

- config-keys:
- dsv4-fp4-b300-vllm
description:
- "Update search space based on B300 pareto sweep results"
- "ISL=1024: TP4 conc 4-128; DP4 (dp-attn) conc 256-4096; DP8 (dp-attn) conc 2048-8192"
- "ISL=8192: TP4 conc 4-64; DP4 (dp-attn) conc 128-1024; DP8 (dp-attn) conc 1024-8192"

Check warning on line 1828 in perf-changelog.yaml

View check run for this annotation

Claude / Claude Code Review

perf-changelog entry omits the new MAX_NUM_BATCHED_TOKENS formula

The new perf-changelog entry at perf-changelog.yaml:1822-1830 documents only the search-space reshape, but the diff also rewrites `--max-num-batched-tokens` in benchmarks/single_node/dsv4_fp4_b300_vllm.sh from a constant 2048 to an ISL/DP-conditional formula (DP: max(ISL, 2048); TP: max(2*ISL, 2048)). PR #1144's prior entry explicitly recorded "max-num-batched-tokens 2048" as part of the launch-args summary, so omitting the rewrite leaves that prior statement stale and the new formula invisible
Comment on lines +1865 to +1903

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The new perf-changelog entry at perf-changelog.yaml:1822-1830 documents only the search-space reshape, but the diff also rewrites --max-num-batched-tokens in benchmarks/single_node/dsv4_fp4_b300_vllm.sh from a constant 2048 to an ISL/DP-conditional formula (DP: max(ISL, 2048); TP: max(2ISL, 2048)). PR #1144's prior entry explicitly recorded "max-num-batched-tokens 2048" as part of the launch-args summary, so omitting the rewrite leaves that prior statement stale and the new formula invisible from the changelog audit trail. Suggest appending a description line such as: "Set --max-num-batched-tokens = max(ISL, 2048) for DP and max(2ISL, 2048) for TP, replacing the previous constant 2048".

Extended reasoning...

What the bug is

This PR makes two substantive changes to dsv4-fp4-b300-vllm, but the new perf-changelog.yaml entry (lines 1822-1830) only describes one of them:

  1. Documented: search-space reshape (TP4/DP4/DP8 concurrency ranges for ISL=1024 and ISL=8192).
  2. Undocumented: in benchmarks/single_node/dsv4_fp4_b300_vllm.sh:41-46, 81, --max-num-batched-tokens is changed from a constant 2048 to a function of ISL and DP_ATTENTION:
    • DP mode: max(ISL, 2048)
    • TP mode: max(2*ISL, 2048)

Why this matters for the audit trail

The predecessor entry (PR #1144, the introducing PR for this config) explicitly records "max-num-batched-tokens 2048" as part of the launch-arg summary. After this PR that statement is stale, but no follow-up entry replaces it. AGENTS.md describes perf-changelog.yaml as the chronological record of changes that affect benchmarks; the mbt rewrite materially affects request-batching behavior, so it belongs in the description.

Step-by-step proof of impact

For an ISL=8192 run in TP mode under the new formula:

  1. DP_ATTENTION != "true", so the TP branch is taken.
  2. MAX_NUM_BATCHED_TOKENS = 2 * 8192 = 16384 (the floor at 2048 is not binding).
  3. vllm serve --max-num-batched-tokens 16384 — 8x the previous constant 2048.

For ISL=8192 in DP mode ({ tp: 4|8, dp-attn: true }):

  1. DP branch is taken.
  2. MAX_NUM_BATCHED_TOKENS = 8192 — 4x the previous 2048.

These are not cosmetic deltas: they reshape the prefill-batching capacity and will alter throughput / TTFT characteristics relative to anything benchmarked under PR #1144.

Addressing the refutations

One refuter argued this is subjective wording and that perf-changelog descriptions vary in detail (some are one-liners). That is true in general, but here the new entry is specifically a successor to PR #1144's entry for the same config-keys: [dsv4-fp4-b300-vllm], which deliberately enumerated the launch args including max-num-batched-tokens 2048. Following that precedent, the value's replacement deserves a line. A second refuter called this a duplicate of bug_003 — the synthesis agent has merged them; this single comment now stands for both.

Severity and fix

This is a documentation-completeness issue with no runtime impact, so it is filed at nit severity (matching all three independent confirmations). Fix: append one description line to the new entry, e.g.

Set --max-num-batched-tokens = max(ISL, 2048) for DP and max(2*ISL, 2048) for TP, replacing the previous constant 2048.

pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1155

Loading