-
Notifications
You must be signed in to change notification settings - Fork 204
[NVIDIA] chore: B300 single node DeepSeek v4 SGLang LOW LATENCY ONLY #1143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 20 commits
Commits
Show all changes
21 commits
Select commit
Hold shift + click to select a range
26e540d
feat: add DeepSeek-V4-Flash FP4 B300 SGLang benchmark
cquil11 efdc8ba
fix: switch dsv4-fp4-b300-sglang to Pro + Max-Throughput recipe
cquil11 cc35a12
chore: sync launch_b200-dgxc-slurm.sh cache mount from claude/add-dsv…
cquil11 404a097
fix: restore trailing whitespace stripped from glm5.1 changelog entry
cquil11 97a488e
chore: add flock-guarded squash import to B300 runner
cquil11 106deea
fix: drop ENROOT_CACHE_PATH override from B300 runner
cquil11 4bb1f1a
chore: point B300 runner at shared gharunners/{squash,hf-hub-cache}
cquil11 744c5a0
fix: move enroot import out of srun to avoid pyxis namespace collision
cquil11 d003c59
fix: wipe stale pyxis scratch dirs for this JOB_ID before benchmark srun
cquil11 f00629f
Revert: drop all B300 runner changes, mirror #1128's approach
cquil11 570b0eb
runner: add head-node flock-guarded squash import on B300
cquil11 864419d
fix: mount at /ix and clear baked-in CUDA_VISIBLE_DEVICES
cquil11 5d93913
Merge branch 'main' into chore/dsv4-sgl-b300
cquil11 9453676
runner: use /data/models pre-staged path for dsv4 on B300
cquil11 5db43b8
fix: switch B300 dsv4 sglang to bw-ultra-compiled image
cquil11 c060c58
fix: switch B300 dsv4 sglang image to yhyang201/sglang-b300:v3
cquil11 08edf26
update b300
cquil11 a699ca0
feat(dsv4-fp4-b300-sglang): pick recipe by CONC; split search-space
cquil11 d35696c
update b300
cquil11 c3b562c
feat(dsv4-fp4-b300-sglang): low-latency recipe at every CONC (fallback)
cquil11 410df74
fix: align perf-changelog and config comments with low-latency fallback
github-actions[bot] File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,103 @@ | ||
| #!/usr/bin/env bash | ||
|
|
||
| source "$(dirname "$0")/../benchmark_lib.sh" | ||
|
|
||
| check_env_vars \ | ||
| MODEL \ | ||
| TP \ | ||
| CONC \ | ||
| ISL \ | ||
| OSL \ | ||
| RANDOM_RANGE_RATIO \ | ||
| RESULT_FILENAME | ||
|
|
||
| if [[ -n "$SLURM_JOB_ID" ]]; then | ||
| echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME" | ||
| fi | ||
|
|
||
| # The B300 runner overrides MODEL to a pre-staged /data/models path, so skip | ||
| # `hf download`. Only fetch when MODEL looks like a HF repo ID. | ||
| if [[ "$MODEL" != /* ]]; then | ||
| hf download "$MODEL" | ||
| fi | ||
|
|
||
| nvidia-smi | ||
|
|
||
| export SGLANG_JIT_DEEPGEMM_PRECOMPILE=0 | ||
|
|
||
| # The deepseek-v4 sglang images (lmsysorg/sglang:deepseek-v4-blackwell and its | ||
| # B300 forks) bake CUDA_VISIBLE_DEVICES=4,5,6,7 into their ENV, which masks half | ||
| # of the 8 GPUs Slurm allocates us. Clear it so TP=8 can bind to all ranks. | ||
| unset CUDA_VISIBLE_DEVICES | ||
|
|
||
| # TODO(Cam): the deepseek-v4 sglang images install sglang editable at | ||
| # /workspace/sglang/python; prior sglang tags used /sgl-workspace/sglang. | ||
| # The runner mounts our repo at a non-/workspace path for these images so the | ||
| # editable install stays visible. Paths in this script are $PWD-relative for | ||
| # that reason. Drop the runner conditional once lmsys moves sglang back out of | ||
| # /workspace. | ||
|
|
||
| SERVER_LOG="$PWD/server.log" | ||
| PORT=${PORT:-8888} | ||
|
|
||
| echo "TP: $TP, CONC: $CONC, ISL: $ISL, OSL: $OSL" | ||
|
|
||
| EVAL_CONTEXT_ARGS="" | ||
| if [ "${EVAL_ONLY}" = "true" ]; then | ||
| setup_eval_context | ||
| EVAL_CONTEXT_ARGS="--context-length $EVAL_MAX_MODEL_LEN" | ||
| fi | ||
|
|
||
| start_gpu_monitor --output "$PWD/gpu_metrics.csv" | ||
|
|
||
| # TODO(Cam): hardcoded to the low-latency recipe at every CONC until the | ||
| # DeepEP FP8 weight-postprocess path is fixed for this checkpoint on B300 | ||
| # (RuntimeError: Recipe must be a list/tuple of 3 integers. raised from | ||
| # sglang.srt.layers.quantization.fp8.process_weights_after_loading_block_quant). | ||
| # Restore the CONC-based low-latency / balanced / max-throughput dispatch | ||
| # on chore/dsv4-sgl-b300 once sglang can load the checkpoint under | ||
| # --moe-a2a-backend deepep. | ||
| RECIPE=low-latency | ||
| RECIPE_FLAGS=( | ||
| --moe-runner-backend flashinfer_mxfp4 | ||
| --chunked-prefill-size 4096 | ||
| --disable-flashinfer-autotune | ||
| --mem-fraction-static 0.82 | ||
| ) | ||
| echo "Recipe: $RECIPE (CONC=$CONC)" | ||
|
|
||
| set -x | ||
| PYTHONNOUSERSITE=1 sglang serve \ | ||
| --model-path $MODEL \ | ||
| --host 0.0.0.0 \ | ||
| --port $PORT \ | ||
| --trust-remote-code \ | ||
| --tp $TP \ | ||
| --disable-radix-cache \ | ||
| "${RECIPE_FLAGS[@]}" $EVAL_CONTEXT_ARGS > $SERVER_LOG 2>&1 & | ||
|
|
||
| SERVER_PID=$! | ||
|
|
||
| wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID" | ||
|
|
||
| pip install -q datasets pandas | ||
|
|
||
| run_benchmark_serving \ | ||
| --model "$MODEL" \ | ||
| --port "$PORT" \ | ||
| --backend vllm \ | ||
| --input-len "$ISL" \ | ||
| --output-len "$OSL" \ | ||
| --random-range-ratio "$RANDOM_RANGE_RATIO" \ | ||
| --num-prompts $((CONC * 10)) \ | ||
| --max-concurrency "$CONC" \ | ||
| --result-filename "$RESULT_FILENAME" \ | ||
| --result-dir "$PWD/" | ||
|
|
||
| if [ "${RUN_EVAL}" = "true" ]; then | ||
| run_eval --framework lm-eval --port "$PORT" | ||
| append_lm_eval_summary | ||
| fi | ||
|
|
||
| stop_gpu_monitor | ||
| set +x |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔴 The perf-changelog entry added at lines 1749-1758 (and the outer NOTE comment at nvidia-master.yaml:1799-1803) describes the max-throughput config from the sibling PR #1132, not the low-latency-only fallback this PR actually adds: every field is wrong — image is
deepseek-v4-blackwellvs the actualdeepseek-v4-b300, it claimsDP=8 + DeepEPandTP=8/EP=8/dp-attn=truewith concurrency 4-1024/4-512 vs the actualtp:8, ep:1with conc 4-32 and no DP-attn/DeepEP, and it references a nonexistent configdsv4-fp4-b200-vllmwithpr-linkpointing at #1132. Both blocks look like they were inherited from the parent branch and should be rewritten to describe the low-latency-only fallback (or the changelog entry deferred until #1132 lands); the in-block comment at nvidia-master.yaml:1812-1817 already has the correct description and directly contradicts the stale outer NOTE.Extended reasoning...
What the bug is
This PR adds a new
dsv4-fp4-b300-sglangconfig to.github/configs/nvidia-master.yamland a matching entry toperf-changelog.yaml. The PR title and description make clear that it is a low-latency-only fallback — it strips the balanced and max-throughput rows because--moe-a2a-backend deepepis broken on this image/checkpoint. But the new changelog entry (lines 1749-1758) and the outer NOTE comment in the yaml (lines 1799-1803) both describe the opposite: the balanced/max-throughput recipe that the sibling PR #1132 will add once DeepEP is fixed.Field-by-field comparison
Changelog entry at
perf-changelog.yaml:1749-1758vs the actual yaml that this PR adds:lmsysorg/sglang:deepseek-v4-blackwelllmsysorg/sglang:deepseek-v4-b300(line 1805)TP=8/EP=8/dp-attn=true{ tp: 8, ep: 1 }, no dp-attn (lines 1821-1826)conc-start: 4, conc-end: 32for bothdsv4-fp4-b200-vllmdsv4-fp8-h200-vllm)pull/1132Step-by-step proof
.github/configs/nvidia-master.yaml. The new config has exactly one search-space tuple per ISL:{ tp: 8, ep: 1, conc-start: 4, conc-end: 32 }. Nodp-attention, no--moe-a2a-backend deepep, no EP>1.dsv4-fp4-b200-vllm." These two comment blocks describe mutually exclusive recipes for the same config entry.dsv4-fp4-b300-sglangsearch-space so only the low-latency (TP-only) recipe runs" and "Re-introduce the balanced and max-throughput rows on [NVIDIA] chore: B300 single node DeepSeek v4 SGLang #1132 once the FP8+DeepEP weight-postprocess issue is fixed upstream." The outer NOTE and the changelog entry describe the [NVIDIA] chore: B300 single node DeepSeek v4 SGLang #1132 version, not this PR..github/configs/nvidia-master.yamlfordsv4-fp4-b200-vllmreturns only the stale comment — the referenced config does not exist.pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1132points at the wrong PR: this is [NVIDIA] chore: B300 single node DeepSeek v4 SGLang LOW LATENCY ONLY #1143, and per the description [NVIDIA] chore: B300 single node DeepSeek v4 SGLang #1132 is the follow-up.Why existing code doesn't prevent it
perf-changelog.yamland the yaml comments are free-form documentation — nothing validates that the changelog description matches the config it's documenting. The entry appears to have been inherited from the parent branch (chore/dsv4-sgl-b300→ #1132) and thesearch-spacewas rewritten for the fallback without updating the surrounding comments or the changelog.Impact
After this PR merges,
perf-changelog.yaml— the authoritative record of what changed in each PR — will state that PR #1132 added a DP=8/DeepEP/EP=8 max-throughput recipe with concurrency up to 1024. In reality, #1132 has not merged, and PR #1143 added a TP=8/EP=1 low-latency recipe capped at concurrency 32. Future readers diffing historical performance numbers against the changelog will be materially misled about the recipe in effect. Inside the yaml, the two contradictory comment blocks make it hard to tell which one reflects the actual intent.How to fix
Rewrite both blocks to describe what this PR actually does. Minimal fix:
perf-changelog.yamlentry to: imagelmsysorg/sglang:deepseek-v4-b300, "Low-latency only (TP=8, EP=1, no DP-attn, no DeepEP) — fallback while the FP8 + DeepEP weight-postprocess path is broken on this checkpoint", concurrency 4-32 for both ISLs, drop thedsv4-fp4-b200-vllmreference, and setpr-linktopull/1143.nvidia-master.yaml:1799-1803(the in-block comment at 1812-1817 already explains the situation correctly) or rewrite it to match.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@claude please fix this
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Claude finished @cquil11's task in 1m 28s —— View job
Fixing stale perf-changelog and config comments
perf-changelog.yamlandnvidia-master.yamlto identify exact contentperf-changelog.yamlentry to describe the low-latency fallback (correct image, parallelism, concurrency, pr-link)nvidia-master.yamlChanges (410df74)
perf-changelog.yaml— rewrote thedsv4-fp4-b300-sglangentry:deepseek-v4-blackwell→deepseek-v4-b300dsv4-fp4-b200-vllmmirror reference#1132→#1143.github/configs/nvidia-master.yaml— replaced the outer NOTE (lines 1799-1803) that described the B200 max-throughput recipe with a note matching the actual low-latency fallback. The in-block TODO comment at line 1812 was already correct and is unchanged.