DSv4 B300 TRT #1233
Conversation
# Conflicts: # .github/configs/nvidia-master.yaml # perf-changelog.yaml
There was a problem hiding this comment.
Additional findings (outside current diff — PR may have been updated during review):
-
🟡
perf-changelog.yaml:2017— The new perf-changelog.yaml entry at line 2017 haspr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX— a literal placeholder rather than the actual PR number. Every other entry in this file uses a real numeric PR id (e.g. the immediately preceding entry usespull/1221); this should be updated topull/1233before merge so the changelog correctly traces back to its PR.Extended reasoning...
What's broken. The newly added perf-changelog.yaml entry for
dsv4-fp4-b200-trtanddsv4-fp4-b300-trtends with:\n\nyaml\n pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX\n\n\nXXXis a literal placeholder string left over from authoring — not the PR number. The current PR (per the metadata in this review) is #1233, so this should readhttps://github.com/SemiAnalysisAI/InferenceX/pull/1233.\n\nWhy this is wrong. Every other entry inperf-changelog.yamluses an actual numeric PR id. The three entries immediately preceding this one end withpull/1209,pull/1218, andpull/1221respectively, and a quick scan of the rest of the file confirms the pattern holds throughout. LeavingXXXin place breaks the changelog's only mechanism for tracing a config-change entry back to the PR that introduced it.\n\nWhy existing tooling didn't catch it.perf-changelog.yamlappears to be free-form YAML metadata; there is no schema validator that requirespr-linkto be a real GitHub URL. YAML parsers happily accept the placeholder as a string, and PR review is the last line of defense.\n\nImpact. Cosmetic / documentation only — no runtime behavior is affected. But once merged, anyone auditing the changelog (e.g., "which PR added the dsv4-fp4-b200-trt config?") will hit a 404 instead of landing on the correct PR. Fixing it post-merge requires another PR to touch the changelog, which is wasteful.\n\nHow to fix. ReplaceXXXwith1233on line 2017 ofperf-changelog.yamlbefore merging this PR.\n\nStep-by-step proof.\n1. Openperf-changelog.yamlat the bottom of the file (the new entry added by this PR).\n2. Observe the entry's last line:pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX.\n3. Compare with the immediately preceding entry (the dsv4-fp4-b300-vllm "Change image to vllm/vllm-openai:v0.20.0-cu130" entry): itspr-linkishttps://github.com/SemiAnalysisAI/InferenceX/pull/1221— a real numeric id.\n4. Compare with the PR's own metadata: the current PR number is1233.\n5. Conclusion: the new entry should usepull/1233, matching the convention used by every other entry in the file.
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25231354124 |
There was a problem hiding this comment.
Additional findings (outside current diff — PR may have been updated during review):
-
🟡
perf-changelog.yaml:2099-2101— The new dsv4-fp4-b300-trt changelog entry says the diagnostic compares 'NVFP4 KV cache', butbenchmarks/single_node/dsv4_fp4_b300_trt_diag.shnever invokesrun_variantwith an nvfp4/fp4 KV dtype — the actual KV dtypes exercised are fp8, unset (auto), bfloat16, and torch.bfloat16. Either replace 'NVFP4 KV cache' with 'BF16 KV cache' (or 'BF16 / torch.bfloat16 KV cache'), or add a realrun_variant 'nvfp4_kv_graph' 'nvfp4' ...call so the description matches the JSONL/summary output. Nit only — the diagnostic intentionally exits 1 (TRTLLM_DSV4_DIAG_FAIL_AFTER=1) so this can'''t be confused with a perf result, but reviewers cross-checking the changelog againstdsv4_trt_b300_diag_summary.jsonwill see the contradiction.Extended reasoning...
What the bug is.
perf-changelog.yaml:2101(the new entry fordsv4-fp4-b300-trt) describes the diagnostic as:"Diagnostic installs fast-hadamard-transform, then compares explicit FP8 KV cache, default KV cache, NVFP4 KV cache, and FP8 KV cache with CUDA graph disabled"
The diagnostic script
benchmarks/single_node/dsv4_fp4_b300_trt_diag.shdoes not exercise an NVFP4/FP4 KV cache anywhere.Step-by-step proof. Enumerating every
run_variantinvocation at the bottom of the diagnostic script and the second positional argument (thekv_dtype):Variant kv_dtypeargbaseline_fp8_graphfp8auto_kv_graphunset(auto/default)auto_kv_no_autotuneunsetauto_kv_vanilla_moeunsetfp8_no_cuda_graphfp8fp8_graph_mhc_fused_hc_offfp8baseline_fp8_graph_pp1fp8auto_kv_graph_pp1unsettp4_ep1_dpa_false_fp8_graphfp8tp8_ep8_dpa_true_fp8_graphfp8tp4_ep4_dpa_true_fp8_graphfp8bfloat16_kv_graphbfloat16torch_bfloat16_kv_graphtorch.bfloat16The set of KV dtypes that ever reach
write_configis{fp8, unset, bfloat16, torch.bfloat16}. There is nonvfp4,fp4, ormodelopt_fp4value passed askv_dtype, and no variant whose name implies an FP4 KV cache. Grepping the file fornvfp4|fp4only matches the unrelateddsv4-fp4-trt-${variant}.ymlconfig-file template name.Why the existing knobs don't cover it.
write_configonly emitsdtype: $kv_dtypewhen$kv_dtype != "unset", and even when set, the value is whatever was passed positionally. None of the call sites pass an FP4-flavored value, so there is no code path inside this diagnostic that produces an NVFP4 KV cache config. The fourth bullet — "FP8 KV cache with CUDA graph disabled" — does map correctly tofp8_no_cuda_graph. Only the third bullet (NVFP4) is fabricated.Impact. Documentation drift. The diagnostic still produces correct results for the variants it actually has, and the script intentionally exits 1 (
TRTLLM_DSV4_DIAG_FAIL_AFTER=1) so the row will never be charted as a benchmark result. But the changelog entry is the human-readable trail of why this PR exists, and it claims an ablation that does not appear indsv4_trt_b300_diag_summary.json. A reviewer cross-referencing the bullet list against the summary JSON will be unable to find the NVFP4 row and may either chase a phantom variant or assume the summary is incomplete.How to fix. Two equivalent fixes; pick one:
- Match the doc to the script (preferred, smallest diff): in
perf-changelog.yaml:2101, replaceNVFP4 KV cachewithBF16 KV cache(orBF16 / torch.bfloat16 KV cache, since bothbfloat16_kv_graphandtorch_bfloat16_kv_graphexist). This accurately describes what the diagnostic actually exercises. - Match the script to the doc: add a
run_variant 'nvfp4_kv_graph' 'nvfp4' 'on' "$((PORT_BASE + 13))" "TRTLLM" "default" "$TP" "$EP_SIZE" "$DP_ATTENTION" "4"call alongside the existingbfloat16_kv_graph/torch_bfloat16_kv_graphinvocations under theTRTLLM_DSV4_DIAG_ENABLE_EXPLICIT_KV_DTYPESblock. Note this will only succeed if the pinnedfeat/deepseek_v4branch actually acceptsnvfp4as akv_cache_config.dtype; otherwise the variant will simply fail readiness and the diagnostic will still record that result.
Option 1 is the right call given the diagnostic is temporary and the run is already gated to fail (
TRTLLM_DSV4_DIAG_FAIL_AFTER=1). Severity nit. - Match the doc to the script (preferred, smallest diff): in
| "dsv4_template_completion_ok": bool(probe.get("dsv4_template_completion_ok", False)), | ||
| "kvcache_nan_or_inf_warning": kvcache_nan, | ||
| "hadamard_missing_or_skipped_warning": hadamard_missing, | ||
| "kv_dtype_or_cache_log_lines": kv_dtype_lines, | ||
| "moe_warning": bool(moe_warning_lines), |
There was a problem hiding this comment.
🟡 The mhc word-boundary alternative in the diagnostic regex (benchmarks/single_node/dsv4_fp4_b300_trt_diag.sh:974) is written as r"...|\\bmhc\\b" inside a single-quoted Python heredoc, so the regex engine compiles \\b as a literal backslash + b rather than a word boundary. The bare-mhc probe will therefore never fire and mhc_warning will be silently False for every variant; fix is to use r"\bmhc\b" with a single backslash. This is a low-severity issue in a script that already exits 1 unconditionally via TRTLLM_DSV4_DIAG_FAIL_AFTER.
Extended reasoning...
What the bug is. Line 974 of benchmarks/single_node/dsv4_fp4_b300_trt_diag.sh builds the mhc-fused-HC ablation probe with:
mhc_warning_lines = matching_log_lines(r"mhc_fused_hc|fused_hc|hyper.?connection|\\bmhc\\b", 40)The enclosing heredoc opens with <<'\''PY'\'' (single-quoted), so bash performs no substitution — Python receives the source verbatim. In a Python raw string, two backslashes stay as two literal backslashes, so the regex engine compiles \\b as escaped-backslash + b, i.e. it matches a literal backslash followed by b, not a word boundary. To get a word boundary in a raw string you need a single backslash: r"\bmhc\b".
Step-by-step proof. Verified directly in Python:
>>> import re
>>> buggy = re.compile(r'\''mhc_fused_hc|fused_hc|hyper.?connection|\\bmhc\\b'\'', re.IGNORECASE)
>>> bool(buggy.search('\''foo mhc bar'\''))
False
>>> bool(buggy.search(r'\''\bmhc\b'\'')) # only the literal text matches
True
>>> fixed = re.compile(r'\''\bmhc\b'\'')
>>> bool(fixed.search('\''foo mhc bar'\''))
True
So the buggy alternative only fires on the literal four-character sequence \bmhc\b, which never appears in TRT-LLM logs.
Why existing code does not save us. The other three alternatives (mhc_fused_hc, fused_hc, hyper.?connection) still match the obvious cases, so the diagnostic does not crash and mhc_warning is sometimes True. But any log line that mentions just bare mhc (e.g. an mhc disabled warning, an mhc layer norm info line) will be silently dropped, defeating the whole point of the bare-mhc probe in the first place.
Impact. Limited. The script intentionally calls exit 1 at the end via TRTLLM_DSV4_DIAG_FAIL_AFTER=1 (line 1257) and is explicitly labelled a temporary diagnostic in perf-changelog.yaml. So a misleading mhc_warning field in dsv4_trt_b300_diag_summary.json does not affect the run outcome — it just makes the mhc-fused-HC ablation summary less useful while the script lives.
How to fix. Replace \\bmhc\\b with \bmhc\b (single backslashes). Inside the single-quoted heredoc the regex source is passed verbatim, so a single backslash is what you actually want.
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25263534314 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25265372369 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25266831448 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25266848432 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25266886145 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25268325388 |
| if [[ "${RUN_EVAL:-false}" == "true" || "${EVAL_ONLY:-false}" == "true" ]]; then | ||
| # DeepSeek-V4-Pro has hidden size 7168. The current TRTLLM fused-HC MHC | ||
| # path corrupts eval generations for this shape; keep eval servers on the | ||
| # unfused path until the fused kernel is guarded or supports 7168. | ||
| export TRTLLM_MHC_ENABLE_FUSED_HC="${TRTLLM_MHC_ENABLE_FUSED_HC:-0}" | ||
| echo "TRTLLM_MHC_ENABLE_FUSED_HC: $TRTLLM_MHC_ENABLE_FUSED_HC" | ||
| fi |
There was a problem hiding this comment.
plz enable it for everything before merging, in general, besides ctx length, there shouldnt be an if eval, then turn on the env var that fixes eval
Revert accidental trailing-space deletions on separator lines 1768 and 1889 in perf-changelog.yaml — CI depends on exact whitespace. Add rule to AGENTS.md to never delete whitespace in perf-changelog.yaml. Co-authored-by: functionstackx <functionstackx@users.noreply.github.com>
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25268432710 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25268600739 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25268675649 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25270014160 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25270043535 |
No description provided.