-
Notifications
You must be signed in to change notification settings - Fork 204
DSv4 B300 TRT #1233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
DSv4 B300 TRT #1233
Changes from 39 commits
Commits
Show all changes
43 commits
Select commit
Hold shift + click to select a range
ad9ac48
Add DSv4 TRT B200/300 test
Oseltamivir 69720d2
Merge remote-tracking branch 'origin/main' into b200-300-test
Oseltamivir 488ab3d
fix: use TensorRT-LLM DeepSeek-V4 branch image
Oseltamivir e079fb7
fix: point DeepSeek V4 image to correct org
Oseltamivir 6a949a6
Use runtime TensorRT-LLM DSv4 bootstrap
Oseltamivir 0ae2019
Merge branch 'main' into b200-300-test
Oseltamivir 9488f34
Fix TensorRT-LLM DSv4 runtime wheel build
Oseltamivir b0cc665
Use DeepSeek V4 TRTLLM image
Oseltamivir 8ee56af
Use anonymous GHCR pulls by default
Oseltamivir 2d48f08
Fix DSv4 TRT launch env
Oseltamivir 6e75819
Bypass mpirun for B300 DSv4 TRT
Oseltamivir e1e762d
larger sweep + mpi
Oseltamivir 58f7ac9
Merge branch 'main' into b200-300-test
Oseltamivir 8220f0d
mpi
Oseltamivir 79383bb
Merge branch 'main' into b200-300-test
Oseltamivir fb5d85f
b200 perf ok
Oseltamivir ef7f42c
OPAL
Oseltamivir 5f409ed
sweep
Oseltamivir 2b96532
Update TRTLLM DeepSeek-V4 image
Oseltamivir 8875f8a
Merge main into b200-300-test
Oseltamivir ed4f53b
Add B300 TRTLLM diagnostic
Oseltamivir 920d99f
Trigger B300 TRTLLM diagnostic
Oseltamivir 1335e5e
Retrigger B300 TRTLLM diagnostic
Oseltamivir 7ec182f
Merge main into b200-300-test
Oseltamivir 26f6ff3
Test Hadamard and KV cache variants
Oseltamivir f0411ea
diagnostics
Oseltamivir 2afbd4b
Update perf-changelog.yaml
Oseltamivir 0fb9711
Merge branch 'main' into b200-300-test
Oseltamivir 62de147
Merge branch 'main' into b200-300-test
Oseltamivir 8707733
DIAG
Oseltamivir 7bb2822
diag
Oseltamivir 62a06ed
diag
Oseltamivir b6eabaa
Merge branch 'main' into b200-300-test
Oseltamivir 34d0bb3
ppleas
Oseltamivir b802428
chore: clean up dsv4 trt diagnostics
Oseltamivir 7fc41f7
Merge branch 'main' into b200-300-test
Oseltamivir 5da58a8
not just evals
Oseltamivir 44f9c98
Split B300 TRT recipe from B200
Oseltamivir f815f19
Combine stale output cleanup steps
Oseltamivir 652fefd
Restore whitespace in perf-changelog.yaml and add AGENTS.md rule
github-actions[bot] 0c3054d
narrow down recipes
Oseltamivir 1036fae
Merge branch 'main' into b200-300-test
Oseltamivir 5451927
Update perf-changelog.yaml
Oseltamivir File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,163 @@ | ||
| #!/usr/bin/env bash | ||
|
|
||
| # DeepSeek-V4-Pro single-node TRTLLM recipe for B300. The configured image | ||
| # already contains NVIDIA/TensorRT-LLM@feat/deepseek_v4; do not build TRTLLM at | ||
| # runtime from this benchmark path. | ||
|
|
||
| source "$(dirname "$0")/../benchmark_lib.sh" | ||
|
|
||
| check_env_vars \ | ||
| MODEL \ | ||
| TP \ | ||
| CONC \ | ||
| ISL \ | ||
| OSL \ | ||
| MAX_MODEL_LEN \ | ||
| RANDOM_RANGE_RATIO \ | ||
| RESULT_FILENAME \ | ||
| DP_ATTENTION \ | ||
| EP_SIZE | ||
|
|
||
| if [[ -n "$SLURM_JOB_ID" ]]; then | ||
| echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME" | ||
| fi | ||
|
|
||
| echo "TP: $TP, CONC: $CONC, ISL: $ISL, OSL: $OSL, EP_SIZE: $EP_SIZE, DP_ATTENTION: $DP_ATTENTION" | ||
|
|
||
| export TRTLLM_DSV4_USE_MPIRUN="${TRTLLM_DSV4_USE_MPIRUN:-1}" | ||
| export TRTLLM_DSV4_SANITIZE_SLURM_MPI_ENV="${TRTLLM_DSV4_SANITIZE_SLURM_MPI_ENV:-1}" | ||
|
|
||
| sanitize_slurm_mpi_env_for_trtllm() { | ||
| if [[ "${TRTLLM_DSV4_SANITIZE_SLURM_MPI_ENV:-0}" != "1" ]]; then | ||
| return 0 | ||
| fi | ||
|
|
||
| echo "Sanitizing Slurm/PMI environment for TensorRT-LLM launch" | ||
| while IFS='=' read -r name _; do | ||
| case "$name" in | ||
| SLURM_*|PMIX*|PMI*|OMPI_*|ORTE_*) | ||
| unset "$name" | ||
| ;; | ||
| esac | ||
| done < <(env) | ||
| } | ||
|
|
||
| sanitize_slurm_mpi_env_for_trtllm | ||
|
|
||
| export NCCL_NVLS_ENABLE="${NCCL_NVLS_ENABLE:-0}" | ||
| echo "NCCL_NVLS_ENABLE: $NCCL_NVLS_ENABLE" | ||
|
|
||
| if [[ "$MODEL" != /* ]]; then | ||
| hf download "$MODEL" | ||
| fi | ||
|
|
||
| nvidia-smi | ||
|
|
||
| SERVER_LOG="$PWD/server.log" | ||
| PORT=${PORT:-8888} | ||
| EXTRA_CONFIG_FILE="dsv4-fp4-trt.yml" | ||
|
|
||
| MOE_BACKEND="TRTLLM" | ||
| MAX_BATCH_SIZE=$(( CONC > 16 ? CONC : 16 )) | ||
| CUDA_GRAPH_MAX_BATCH_SIZE="$MAX_BATCH_SIZE" | ||
| KV_CACHE_FREE_MEM_FRACTION="${KV_CACHE_FREE_MEM_FRACTION:-0.50}" | ||
|
|
||
| ATTENTION_DP_CONFIG="" | ||
| if [[ "$DP_ATTENTION" == "true" ]]; then | ||
| ATTENTION_DP_CONFIG=" | ||
| attention_dp_config: | ||
| batching_wait_iters: 0 | ||
| enable_balance: true | ||
| timeout_iters: 60" | ||
| fi | ||
|
|
||
| cat > "$EXTRA_CONFIG_FILE" << EOF | ||
| cuda_graph_config: | ||
| enable_padding: true | ||
| max_batch_size: $CUDA_GRAPH_MAX_BATCH_SIZE | ||
| enable_attention_dp: $DP_ATTENTION$ATTENTION_DP_CONFIG | ||
| print_iter_log: true | ||
| kv_cache_config: | ||
| tokens_per_block: 128 | ||
| dtype: fp8 | ||
| free_gpu_memory_fraction: $KV_CACHE_FREE_MEM_FRACTION | ||
| enable_block_reuse: false | ||
| stream_interval: 10 | ||
| num_postprocess_workers: 4 | ||
| moe_config: | ||
| backend: $MOE_BACKEND | ||
| EOF | ||
|
|
||
| echo "Generated config file contents:" | ||
| cat "$EXTRA_CONFIG_FILE" | ||
|
|
||
| MAX_MODEL_LEN=$(( MAX_MODEL_LEN > 8192 ? MAX_MODEL_LEN : 8192 )) | ||
| MAX_NUM_TOKENS=$(( ISL + OSL + 256 )) | ||
| MAX_NUM_TOKENS=$(( MAX_NUM_TOKENS > 8192 ? MAX_NUM_TOKENS : 8192 )) | ||
|
|
||
| if [ "${EVAL_ONLY}" = "true" ]; then | ||
| setup_eval_context | ||
| MAX_MODEL_LEN="$EVAL_MAX_MODEL_LEN" | ||
| MAX_NUM_TOKENS="$EVAL_MAX_MODEL_LEN" | ||
| fi | ||
|
|
||
| if [[ "${RUN_EVAL:-false}" == "true" || "${EVAL_ONLY:-false}" == "true" ]]; then | ||
| # DeepSeek-V4-Pro has hidden size 7168. The current TRTLLM fused-HC MHC | ||
| # path corrupts eval generations for this shape; keep eval servers on the | ||
| # unfused path until the fused kernel is guarded or supports 7168. | ||
| export TRTLLM_MHC_ENABLE_FUSED_HC="${TRTLLM_MHC_ENABLE_FUSED_HC:-0}" | ||
| echo "TRTLLM_MHC_ENABLE_FUSED_HC: $TRTLLM_MHC_ENABLE_FUSED_HC" | ||
| fi | ||
|
|
||
| start_gpu_monitor --output "$PWD/gpu_metrics.csv" | ||
|
|
||
| set -x | ||
| SERVE_CMD=( | ||
| trtllm-serve "$MODEL" \ | ||
| --host 0.0.0.0 \ | ||
| --port "$PORT" \ | ||
| --trust_remote_code \ | ||
| --backend pytorch \ | ||
| --max_batch_size "$MAX_BATCH_SIZE" \ | ||
| --max_seq_len "$MAX_MODEL_LEN" \ | ||
| --max_num_tokens "$MAX_NUM_TOKENS" \ | ||
| --tp_size "$TP" \ | ||
| --ep_size "$EP_SIZE" \ | ||
| --custom_tokenizer deepseek_v4 \ | ||
| --config "$EXTRA_CONFIG_FILE" | ||
| ) | ||
|
|
||
| if [[ "${TRTLLM_DSV4_USE_MPIRUN:-1}" == "0" ]]; then | ||
| "${SERVE_CMD[@]}" > "$SERVER_LOG" 2>&1 & | ||
| else | ||
| mpirun -n 1 --oversubscribe --allow-run-as-root \ | ||
| "${SERVE_CMD[@]}" \ | ||
| > "$SERVER_LOG" 2>&1 & | ||
| fi | ||
|
|
||
| SERVER_PID=$! | ||
|
|
||
| wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID" | ||
|
|
||
| run_benchmark_serving \ | ||
| --model "$MODEL" \ | ||
| --port "$PORT" \ | ||
| --backend openai-chat \ | ||
| --endpoint /v1/chat/completions \ | ||
| --input-len "$ISL" \ | ||
| --output-len "$OSL" \ | ||
| --random-range-ratio "$RANDOM_RANGE_RATIO" \ | ||
| --num-prompts "$(( CONC * 10 ))" \ | ||
| --max-concurrency "$CONC" \ | ||
| --result-filename "$RESULT_FILENAME" \ | ||
| --result-dir "$PWD/" \ | ||
| --trust-remote-code \ | ||
| --server-pid "$SERVER_PID" | ||
|
|
||
| if [ "${RUN_EVAL}" = "true" ]; then | ||
| run_eval --framework lm-eval --port "$PORT" | ||
| append_lm_eval_summary | ||
| fi | ||
|
|
||
| stop_gpu_monitor | ||
| set +x | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
plz enable it for everything before merging, in general, besides ctx length, there shouldnt be an
if eval, then turn on the env var that fixes eval