Skip to content

DSv4 B300 TRT #1233

Merged
Oseltamivir merged 43 commits intomainfrom
b200-300-test
May 3, 2026
Merged

DSv4 B300 TRT #1233
Oseltamivir merged 43 commits intomainfrom
b200-300-test

Conversation

@Oseltamivir
Copy link
Copy Markdown
Collaborator

No description provided.

@Oseltamivir Oseltamivir marked this pull request as ready for review April 30, 2026 00:37
@Oseltamivir Oseltamivir requested a review from a team April 30, 2026 00:37
# Conflicts:
#	.github/configs/nvidia-master.yaml
#	perf-changelog.yaml
Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional findings (outside current diff — PR may have been updated during review):

  • 🟡 perf-changelog.yaml:2017 — The new perf-changelog.yaml entry at line 2017 has pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX — a literal placeholder rather than the actual PR number. Every other entry in this file uses a real numeric PR id (e.g. the immediately preceding entry uses pull/1221); this should be updated to pull/1233 before merge so the changelog correctly traces back to its PR.

    Extended reasoning...

    What's broken. The newly added perf-changelog.yaml entry for dsv4-fp4-b200-trt and dsv4-fp4-b300-trt ends with:\n\nyaml\n pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX\n\n\nXXX is a literal placeholder string left over from authoring — not the PR number. The current PR (per the metadata in this review) is #1233, so this should read https://github.com/SemiAnalysisAI/InferenceX/pull/1233.\n\nWhy this is wrong. Every other entry in perf-changelog.yaml uses an actual numeric PR id. The three entries immediately preceding this one end with pull/1209, pull/1218, and pull/1221 respectively, and a quick scan of the rest of the file confirms the pattern holds throughout. Leaving XXX in place breaks the changelog's only mechanism for tracing a config-change entry back to the PR that introduced it.\n\nWhy existing tooling didn't catch it. perf-changelog.yaml appears to be free-form YAML metadata; there is no schema validator that requires pr-link to be a real GitHub URL. YAML parsers happily accept the placeholder as a string, and PR review is the last line of defense.\n\nImpact. Cosmetic / documentation only — no runtime behavior is affected. But once merged, anyone auditing the changelog (e.g., "which PR added the dsv4-fp4-b200-trt config?") will hit a 404 instead of landing on the correct PR. Fixing it post-merge requires another PR to touch the changelog, which is wasteful.\n\nHow to fix. Replace XXX with 1233 on line 2017 of perf-changelog.yaml before merging this PR.\n\nStep-by-step proof.\n1. Open perf-changelog.yaml at the bottom of the file (the new entry added by this PR).\n2. Observe the entry's last line: pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX.\n3. Compare with the immediately preceding entry (the dsv4-fp4-b300-vllm "Change image to vllm/vllm-openai:v0.20.0-cu130" entry): its pr-link is https://github.com/SemiAnalysisAI/InferenceX/pull/1221 — a real numeric id.\n4. Compare with the PR's own metadata: the current PR number is 1233.\n5. Conclusion: the new entry should use pull/1233, matching the convention used by every other entry in the file.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 1, 2026

Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional findings (outside current diff — PR may have been updated during review):

  • 🟡 perf-changelog.yaml:2099-2101 — The new dsv4-fp4-b300-trt changelog entry says the diagnostic compares 'NVFP4 KV cache', but benchmarks/single_node/dsv4_fp4_b300_trt_diag.sh never invokes run_variant with an nvfp4/fp4 KV dtype — the actual KV dtypes exercised are fp8, unset (auto), bfloat16, and torch.bfloat16. Either replace 'NVFP4 KV cache' with 'BF16 KV cache' (or 'BF16 / torch.bfloat16 KV cache'), or add a real run_variant 'nvfp4_kv_graph' 'nvfp4' ... call so the description matches the JSONL/summary output. Nit only — the diagnostic intentionally exits 1 (TRTLLM_DSV4_DIAG_FAIL_AFTER=1) so this can'''t be confused with a perf result, but reviewers cross-checking the changelog against dsv4_trt_b300_diag_summary.json will see the contradiction.

    Extended reasoning...

    What the bug is. perf-changelog.yaml:2101 (the new entry for dsv4-fp4-b300-trt) describes the diagnostic as:

    "Diagnostic installs fast-hadamard-transform, then compares explicit FP8 KV cache, default KV cache, NVFP4 KV cache, and FP8 KV cache with CUDA graph disabled"

    The diagnostic script benchmarks/single_node/dsv4_fp4_b300_trt_diag.sh does not exercise an NVFP4/FP4 KV cache anywhere.

    Step-by-step proof. Enumerating every run_variant invocation at the bottom of the diagnostic script and the second positional argument (the kv_dtype):

    Variant kv_dtype arg
    baseline_fp8_graph fp8
    auto_kv_graph unset (auto/default)
    auto_kv_no_autotune unset
    auto_kv_vanilla_moe unset
    fp8_no_cuda_graph fp8
    fp8_graph_mhc_fused_hc_off fp8
    baseline_fp8_graph_pp1 fp8
    auto_kv_graph_pp1 unset
    tp4_ep1_dpa_false_fp8_graph fp8
    tp8_ep8_dpa_true_fp8_graph fp8
    tp4_ep4_dpa_true_fp8_graph fp8
    bfloat16_kv_graph bfloat16
    torch_bfloat16_kv_graph torch.bfloat16

    The set of KV dtypes that ever reach write_config is {fp8, unset, bfloat16, torch.bfloat16}. There is no nvfp4, fp4, or modelopt_fp4 value passed as kv_dtype, and no variant whose name implies an FP4 KV cache. Grepping the file for nvfp4|fp4 only matches the unrelated dsv4-fp4-trt-${variant}.yml config-file template name.

    Why the existing knobs don't cover it. write_config only emits dtype: $kv_dtype when $kv_dtype != "unset", and even when set, the value is whatever was passed positionally. None of the call sites pass an FP4-flavored value, so there is no code path inside this diagnostic that produces an NVFP4 KV cache config. The fourth bullet — "FP8 KV cache with CUDA graph disabled" — does map correctly to fp8_no_cuda_graph. Only the third bullet (NVFP4) is fabricated.

    Impact. Documentation drift. The diagnostic still produces correct results for the variants it actually has, and the script intentionally exits 1 (TRTLLM_DSV4_DIAG_FAIL_AFTER=1) so the row will never be charted as a benchmark result. But the changelog entry is the human-readable trail of why this PR exists, and it claims an ablation that does not appear in dsv4_trt_b300_diag_summary.json. A reviewer cross-referencing the bullet list against the summary JSON will be unable to find the NVFP4 row and may either chase a phantom variant or assume the summary is incomplete.

    How to fix. Two equivalent fixes; pick one:

    1. Match the doc to the script (preferred, smallest diff): in perf-changelog.yaml:2101, replace NVFP4 KV cache with BF16 KV cache (or BF16 / torch.bfloat16 KV cache, since both bfloat16_kv_graph and torch_bfloat16_kv_graph exist). This accurately describes what the diagnostic actually exercises.
    2. Match the script to the doc: add a run_variant 'nvfp4_kv_graph' 'nvfp4' 'on' "$((PORT_BASE + 13))" "TRTLLM" "default" "$TP" "$EP_SIZE" "$DP_ATTENTION" "4" call alongside the existing bfloat16_kv_graph / torch_bfloat16_kv_graph invocations under the TRTLLM_DSV4_DIAG_ENABLE_EXPLICIT_KV_DTYPES block. Note this will only succeed if the pinned feat/deepseek_v4 branch actually accepts nvfp4 as a kv_cache_config.dtype; otherwise the variant will simply fail readiness and the diagnostic will still record that result.

    Option 1 is the right call given the diagnostic is temporary and the run is already gated to fail (TRTLLM_DSV4_DIAG_FAIL_AFTER=1). Severity nit.

Comment on lines +998 to +1002
"dsv4_template_completion_ok": bool(probe.get("dsv4_template_completion_ok", False)),
"kvcache_nan_or_inf_warning": kvcache_nan,
"hadamard_missing_or_skipped_warning": hadamard_missing,
"kv_dtype_or_cache_log_lines": kv_dtype_lines,
"moe_warning": bool(moe_warning_lines),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The mhc word-boundary alternative in the diagnostic regex (benchmarks/single_node/dsv4_fp4_b300_trt_diag.sh:974) is written as r"...|\\bmhc\\b" inside a single-quoted Python heredoc, so the regex engine compiles \\b as a literal backslash + b rather than a word boundary. The bare-mhc probe will therefore never fire and mhc_warning will be silently False for every variant; fix is to use r"\bmhc\b" with a single backslash. This is a low-severity issue in a script that already exits 1 unconditionally via TRTLLM_DSV4_DIAG_FAIL_AFTER.

Extended reasoning...

What the bug is. Line 974 of benchmarks/single_node/dsv4_fp4_b300_trt_diag.sh builds the mhc-fused-HC ablation probe with:

mhc_warning_lines = matching_log_lines(r"mhc_fused_hc|fused_hc|hyper.?connection|\\bmhc\\b", 40)

The enclosing heredoc opens with <<'\''PY'\'' (single-quoted), so bash performs no substitution — Python receives the source verbatim. In a Python raw string, two backslashes stay as two literal backslashes, so the regex engine compiles \\b as escaped-backslash + b, i.e. it matches a literal backslash followed by b, not a word boundary. To get a word boundary in a raw string you need a single backslash: r"\bmhc\b".

Step-by-step proof. Verified directly in Python:

>>> import re
>>> buggy = re.compile(r'\''mhc_fused_hc|fused_hc|hyper.?connection|\\bmhc\\b'\'', re.IGNORECASE)
>>> bool(buggy.search('\''foo mhc bar'\''))
False
>>> bool(buggy.search(r'\''\bmhc\b'\''))   # only the literal text matches
True
>>> fixed = re.compile(r'\''\bmhc\b'\'')
>>> bool(fixed.search('\''foo mhc bar'\''))
True

So the buggy alternative only fires on the literal four-character sequence \bmhc\b, which never appears in TRT-LLM logs.

Why existing code does not save us. The other three alternatives (mhc_fused_hc, fused_hc, hyper.?connection) still match the obvious cases, so the diagnostic does not crash and mhc_warning is sometimes True. But any log line that mentions just bare mhc (e.g. an mhc disabled warning, an mhc layer norm info line) will be silently dropped, defeating the whole point of the bare-mhc probe in the first place.

Impact. Limited. The script intentionally calls exit 1 at the end via TRTLLM_DSV4_DIAG_FAIL_AFTER=1 (line 1257) and is explicitly labelled a temporary diagnostic in perf-changelog.yaml. So a misleading mhc_warning field in dsv4_trt_b300_diag_summary.json does not affect the run outcome — it just makes the mhc-fused-HC ablation summary less useful while the script lives.

How to fix. Replace \\bmhc\\b with \bmhc\b (single backslashes). Inside the single-quoted heredoc the regex source is passed verbatim, so a single backslash is what you actually want.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 3, 2026

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 3, 2026

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 3, 2026

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 3, 2026

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 3, 2026

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 3, 2026

Comment on lines +104 to +110
if [[ "${RUN_EVAL:-false}" == "true" || "${EVAL_ONLY:-false}" == "true" ]]; then
# DeepSeek-V4-Pro has hidden size 7168. The current TRTLLM fused-HC MHC
# path corrupts eval generations for this shape; keep eval servers on the
# unfused path until the fused kernel is guarded or supports 7168.
export TRTLLM_MHC_ENABLE_FUSED_HC="${TRTLLM_MHC_ENABLE_FUSED_HC:-0}"
echo "TRTLLM_MHC_ENABLE_FUSED_HC: $TRTLLM_MHC_ENABLE_FUSED_HC"
fi
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plz enable it for everything before merging, in general, besides ctx length, there shouldnt be an if eval, then turn on the env var that fixes eval

Comment thread perf-changelog.yaml
Revert accidental trailing-space deletions on separator lines 1768 and
1889 in perf-changelog.yaml — CI depends on exact whitespace.

Add rule to AGENTS.md to never delete whitespace in perf-changelog.yaml.

Co-authored-by: functionstackx <functionstackx@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 3, 2026

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 3, 2026

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 3, 2026

@Oseltamivir Oseltamivir changed the title B200/300 test DSv4 B300 TRT May 3, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 3, 2026

@Oseltamivir Oseltamivir merged commit c7738cb into main May 3, 2026
11 of 37 checks passed
@Oseltamivir Oseltamivir deleted the b200-300-test branch May 3, 2026 04:44
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 3, 2026

xiaohuguo2023 pushed a commit to xiaohuguo2023/InferenceX that referenced this pull request May 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

3 participants