Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 24 additions & 0 deletions .github/configs/nvidia-master.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2628,6 +2628,30 @@ dsv4-fp8-h200-sglang:
- { tp: 8, ep: 1, conc-start: 1, conc-end: 1 }
- { tp: 8, ep: 1, conc-start: 4, conc-end: 64 }

# MTP variant of dsv4-fp8-h200-sglang. Mirrors the non-MTP recipe (same image,
# runner pool, search space) and adds EAGLE speculative decoding via
# --speculative-algorithm EAGLE with the (3,1,4) chain matching dsv4-fp4-b300-sglang-mtp.
dsv4-fp8-h200-sglang-mtp:
image: lmsysorg/sglang:deepseek-v4-hopper@sha256:7f19c6dc092e47a10fac2e41f47eab78970280d06648b8e50d312a82f0ae722f
model: deepseek-ai/DeepSeek-V4-Pro
model-prefix: dsv4
runner: h200-dgxc
precision: fp8
framework: sglang
multinode: false
scenarios:
fixed-seq-len:
- isl: 1024
osl: 1024
search-space:
- { tp: 8, ep: 1, conc-start: 1, conc-end: 1, spec-decoding: mtp }
- { tp: 8, ep: 1, conc-start: 4, conc-end: 64, spec-decoding: mtp }
- isl: 8192
osl: 1024
search-space:
- { tp: 8, ep: 1, conc-start: 1, conc-end: 1, spec-decoding: mtp }
- { tp: 8, ep: 1, conc-start: 4, conc-end: 64, spec-decoding: mtp }

# DeepSeek-V4-Pro B300 single-node aggregate recipe from the submitted B300
# pareto sweep. The single-node schema has no explicit data-parallel-size
# field, so dp-attn=true is used as the existing vLLM script switch for DP4
Expand Down
85 changes: 85 additions & 0 deletions benchmarks/single_node/dsv4_fp8_h200_sglang_mtp.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
#!/usr/bin/env bash

source "$(dirname "$0")/../benchmark_lib.sh"

check_env_vars \
MODEL \
TP \
CONC \
ISL \
OSL \
RANDOM_RANGE_RATIO \
RESULT_FILENAME

if [[ -n "$SLURM_JOB_ID" ]]; then
echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME"
fi

hf download "$MODEL"

nvidia-smi

SERVER_LOG="$PWD/server.log"
PORT=${PORT:-8888}

echo "TP: $TP, CONC: $CONC, ISL: $ISL, OSL: $OSL"

EVAL_CONTEXT_ARGS=""
if [ "${EVAL_ONLY}" = "true" ]; then
setup_eval_context
EVAL_CONTEXT_ARGS="--context-length $EVAL_MAX_MODEL_LEN"
fi

start_gpu_monitor --output "$PWD/gpu_metrics.csv"

set -x
PYTHONNOUSERSITE=1 sglang serve \
--model-path $MODEL \
--host 0.0.0.0 \
--port $PORT \
--trust-remote-code \
--tp $TP \
--moe-runner-backend marlin \
--chunked-prefill-size 4096 \
--disable-flashinfer-autotune \
--disable-radix-cache \
--mem-fraction-static 0.88 \
--max-running-requests "$(( CONC * 3 / 2 > 8 ? CONC * 3 / 2 : 8 ))" \
--speculative-algorithm EAGLE \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4 \
$EVAL_CONTEXT_ARGS >> $SERVER_LOG 2>&1 &

SERVER_PID=$!

wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID"

pip install -q datasets pandas

# --dsv4 routes prompts through encoding_dsv4.py (PR #1153), which emits the
# <bos><User>...<Assistant><think> framing DeepSeek-V4-Pro expects. The DSv4-Pro
# tokenizer ships without a jinja chat_template, so plain --use-chat-template
# would crash; --dsv4 sidesteps that and satisfies the AGENTS.md rule that all
# MTP scripts must benchmark against chat-formatted inputs (EAGLE acceptance
# silently regresses on raw random tokens).
run_benchmark_serving \
--model "$MODEL" \
--port "$PORT" \
--backend vllm \
--input-len "$ISL" \
--output-len "$OSL" \
--random-range-ratio "$RANDOM_RANGE_RATIO" \
--num-prompts $((CONC * 10)) \
--max-concurrency "$CONC" \
--result-filename "$RESULT_FILENAME" \
--result-dir "$PWD/" \
--dsv4

if [ "${RUN_EVAL}" = "true" ]; then
run_eval --framework lm-eval --port "$PORT"
append_lm_eval_summary
fi

stop_gpu_monitor
set +x
9 changes: 9 additions & 0 deletions perf-changelog.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2113,3 +2113,12 @@
- "Search space: TP=8 EP=1, conc 1 and 4-64 for both 1k1k and 8k1k"
- "Pinned to the h200-dgxc runner pool (new runners.yaml group); launch_h200-dgxc-slurm.sh extended to support framework-tagged script names and mount /ix instead of /workspace for the deepseek-v4-hopper image"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1264

- config-keys:
- dsv4-fp8-h200-sglang-mtp
description:
- "Add DeepSeek-V4-Pro FP8 H200 single-node SGLang MTP variant (mirrors dsv4-fp8-h200-sglang)"
- "EAGLE speculative decoding chain: --speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4"
- "run_benchmark_serving uses --dsv4 (chat-formatted prompts) per the AGENTS.md MTP rule, since EAGLE acceptance regresses on raw random tokens"
- "Search space mirrors the non-MTP H200 SGLang entry: TP=8 EP=1, conc 1 and 4-64 for both 1k1k and 8k1k, with spec-decoding: mtp"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1265
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The new dsv4-fp8-h200-sglang-mtp entry (perf-changelog.yaml:2124) has pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX — the "XXX" placeholder was never replaced with this PR's real number (#1265). Despite the PR description claiming the link was backfilled, the committed file still has the placeholder; please update it to /pull/1265 to match the convention used by every other entry in the file.

Extended reasoning...

What the bug is. The new entry added by this PR for the dsv4-fp8-h200-sglang-mtp config in perf-changelog.yaml ends with pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX. "XXX" is clearly a templated placeholder — every one of the ~150 other entries in this same file uses a concrete PR number, and the PR's own description even claims "PR-link backfilled to #1265". The backfill never happened.\n\nHow it manifests. Anything that consumes perf-changelog.yaml and follows pr-link will hit https://github.com/SemiAnalysisAI/InferenceX/pull/XXX, which is not a valid PR. GitHub renders this as a 404. Any internal changelog tooling, dashboard, or script that crawls these links to surface release notes will silently produce a broken hyperlink for this one entry.\n\nStep-by-step proof. (1) The PR description states "perf-changelog.yaml updated; PR-link backfilled to #1265." (2) The pre-loaded modified-files content for perf-changelog.yaml literally ends with the line pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX. (3) Independently confirmed by running git show HEAD:perf-changelog.yaml | tail -1 against commit 2f28e59 — it returns pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX. (4) The PR's own number is #1265 (per the metadata at the top of the timeline), and the immediately-prior entry in the same file correctly uses /pull/1264. The intended value is unambiguously 1265.\n\nAddressing the refutation. A verifier objected that get_pr_diff shows + pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1265 and concluded the merged result will be correct. That is contradicted by directly inspecting the committed tree: git show HEAD:perf-changelog.yaml on the merge candidate (2f28e59) shows /pull/XXX, not /pull/1265. Whatever the diff-fetching tool returned does not match what is actually on the branch — the on-disk file and the committed object both carry the placeholder. Since GitHub merges what's in the tree, not a synthesized diff, the placeholder is what will land on main if this PR is merged as-is.\n\nWhy existing review didn't catch it. It's a one-line change at the very tail of a 2000+ line YAML file, and the surrounding lines look intentional and well-formed. The PR description even asserts the backfill was done, which discourages a closer look. There's no schema check on pr-link values, so no CI signal.\n\nImpact and severity. No runtime impact — perf-changelog.yaml is documentation, not consumed by the benchmark pipeline. The blast radius is limited to whatever tooling renders this changelog. This is a trivial one-character fix (XXX1265), and easy to make before merging.\n\nHow to fix. Replace the last line of perf-changelog.yaml with:\n\nyaml\n pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1265\n

Loading