[None][fix] Fix fused MHC for DeepSeek-V4-Pro hidden size by Oseltamivir · Pull Request #13710 · NVIDIA/TensorRT-LLM

Oseltamivir · 2026-05-03T05:34:56Z

Summary

This fixes the SM100 fused mHC hyper-connection path for DeepSeek-V4-Pro.

DeepSeek-V4-Pro uses hidden size 7168, but the fused-HC MMA launcher was still effectively wired for hidden size 4096. The Python runner could select trtllm::mhc_fused_hc for 7168 tensors, while the C++ MMA path used compile-time shape constants and TMA descriptors built around the previous 4096-only instantiation. That can run without an immediate crash, but it corrupts hidden states and produces invalid generations.

Issue

The fused-HC MMA kernels are statically instantiated. Before this change:

mhcFusedHcKernel.cu had a single FHC_HIDDEN = 4096 constant.
SHAPE_K, residual/x TMA descriptors, and the MMA kernel template instantiations were all tied to that hidden size.
The Python autotuner treated the fused-HC runner as generic once SM100 MMA support was available.
DeepSeek-V4-Pro requests with hidden size 7168 could therefore enter the fused-HC path even though the C++ MMA instantiation did not match the runtime shape.

A direct 7168 instantiation also cannot blindly compile every existing kNumSplits value. With BLOCK_K=64, hidden size 7168 has 7168 / 64 = 112 H tiles, so kNumSplits=32 and 64 violate the kernel's compile-time split constraints. Valid MMA split sizes for 7168 are 1, 2, 4, 8, 16.

Run with failed evals: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/25231354124/job/73987414168

Fix

Replace the single 4096 fused-HC hidden constant with explicit supported hidden sizes: 4096 and 7168.
Add runtime dispatch for mhcFusedHcLaunch and mhcFusedHcAllInOneLaunch based on hidden_size.
Template the MMA fused-HC launch implementations on Hidden, so SHAPE_K and TMA descriptors use the runtime-matched compile-time hidden size.
Add compile-time hidden_size / kNumSplits validation so unsupported specializations are not instantiated.
Mirror that same split-size filtering in MhcFusedHcRunner.get_valid_tactics, so the autotuner does not emit invalid MMA tactics for 7168.
Make fallback tactic selection hidden-size aware.
Document the new shape contract in mhcKernels.h.
Add explicit FMA-path guards requiring hidden_size % 64 == 0.

Image with build of forked trtllm: https://github.com/orgs/SemiAnalysisAI/packages/container/package/trtllm-deepseek-v4

Validation

python3 -m py_compile tensorrt_llm/_torch/modules/mhc/mhc_cuda.py
Built a TensorRT-LLM image from this branch: ghcr.io/semianalysisai/trtllm-deepseek-v4:fix-mhc7168-eb20e9e
Ran DeepSeek-V4-Pro B300 evals with fused MHC enabled (TRTLLM_MHC_ENABLE_FUSED_HC=1): https://github.com/SemiAnalysisAI/InferenceX/actions/runs/25270557269
The run exercised trtllm::mhc_fused_hc on 7168 hidden-size tensors and completed successfully.
- Run with fix: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/25270557269/job/74092057898

mikeiovine · 2026-05-04T16:54:37Z

/bot run

tensorrt-cicd · 2026-05-04T17:01:23Z

PR_Github #46665 [ run ] triggered by Bot. Commit: 23b1492 Link to invocation

tensorrt-cicd · 2026-05-04T21:46:31Z

PR_Github #46665 [ run ] completed with state SUCCESS. Commit: 23b1492
/LLM/main/L0_MergeRequest_PR pipeline #36706 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Oseltamivir · 2026-05-04T22:09:27Z

@mikeiovine

Can I get details of CI failure?

juney-nvidia · 2026-05-05T01:49:48Z

Tagging @mingyangHao for vis on this PR.

Signed-off-by: Oseltamivir <bryansg2013@gmail.com>

Oseltamivir · 2026-05-05T03:40:49Z

@mingyangHao do you have info on CI failure?

mingyangHao · 2026-05-05T04:07:06Z

@mingyangHao do you have info on CI failure?

Hi I can see there is a build error, but I dont think that is related to your commit.

Oseltamivir · 2026-05-05T04:08:42Z

oof

Oseltamivir · 2026-05-05T04:11:21Z

@mingyangHao if you wanna test, I have an image at https://github.com/orgs/SemiAnalysisAI/packages/container/package/trtllm-deepseek-v4

Build script

#!/usr/bin/env bash

set -euo pipefail

TRTLLM_REPO="${TRTLLM_REPO:-https://github.com/NVIDIA/TensorRT-LLM.git}"
TRTLLM_REF="${TRTLLM_REF:-feat/deepseek_v4}"
TRTLLM_COMMIT="${TRTLLM_COMMIT:-HEAD}"
IMAGE_REPO="${IMAGE_REPO:-ghcr.io/semianalysisai/trtllm-deepseek-v4}"
IMAGE_WITH_TAG="${IMAGE_WITH_TAG:-}"
CUDA_ARCHS="${CUDA_ARCHS:-100-real;103-real}"
PUSH="${PUSH:-0}"
KEEP_SRC="${KEEP_SRC:-0}"

require_cmd() {
    if ! command -v "$1" >/dev/null 2>&1; then
        echo "Missing required command: $1" >&2
        exit 1
    fi
}

to_enroot_image() {
    local image="$1"
    local registry="${image%%/*}"
    local rest="${image#*/}"

    if [[ "$image" == "$rest" ]]; then
        printf '%s\n' "$image"
    elif [[ "$registry" == *.* || "$registry" == *:* || "$registry" == "localhost" ]]; then
        printf '%s#%s\n' "$registry" "$rest"
    else
        printf '%s\n' "$image"
    fi
}

require_cmd docker
require_cmd git
require_cmd make

if ! docker buildx version >/dev/null 2>&1; then
    echo "docker buildx is required to build TensorRT-LLM release images." >&2
    exit 1
fi

if ! git lfs version >/dev/null 2>&1; then
    echo "git-lfs is required. Install it, then rerun this script." >&2
    exit 1
fi

WORKDIR=""
if [[ -n "${TRTLLM_SRC_DIR:-}" ]]; then
    SRC_DIR="$TRTLLM_SRC_DIR"
else
    WORKDIR="$(mktemp -d "${TMPDIR:-/tmp}/trtllm-dsv4-build.XXXXXX")"
    SRC_DIR="$WORKDIR/TensorRT-LLM"
fi

cleanup() {
    if [[ -n "$WORKDIR" && "$KEEP_SRC" != "1" ]]; then
        rm -rf "$WORKDIR"
    elif [[ -n "$WORKDIR" ]]; then
        echo "Keeping TensorRT-LLM checkout at $SRC_DIR"
    fi
}
trap cleanup EXIT

if [[ ! -d "$SRC_DIR/.git" ]]; then
    git clone --recurse-submodules --branch "$TRTLLM_REF" "$TRTLLM_REPO" "$SRC_DIR"
fi

cd "$SRC_DIR"
git fetch origin "$TRTLLM_REF"
git checkout -B "$TRTLLM_REF" "origin/$TRTLLM_REF" 2>/dev/null || git checkout "$TRTLLM_REF"
if [[ -n "$TRTLLM_COMMIT" ]]; then
    git checkout "$TRTLLM_COMMIT"
fi
git submodule update --init --recursive
git lfs install --local
git lfs pull

ACTUAL_COMMIT="$(git rev-parse HEAD)"
SHORT_COMMIT="$(git rev-parse --short=7 HEAD)"
REF_TAG="$(printf '%s' "$TRTLLM_REF" | tr '/:@' '-' | tr -c 'A-Za-z0-9_.-' '-')"

if [[ -z "$IMAGE_WITH_TAG" ]]; then
    IMAGE_WITH_TAG="${IMAGE_REPO}:${REF_TAG}-${SHORT_COMMIT}"
fi

echo "Building TensorRT-LLM DeepSeek-V4 image"
echo "  source: $TRTLLM_REPO"
echo "  ref:    $TRTLLM_REF"
echo "  commit: $ACTUAL_COMMIT"
echo "  image:  $IMAGE_WITH_TAG"
echo "  archs:  $CUDA_ARCHS"

make -C docker release_build \
    IMAGE_WITH_TAG="$IMAGE_WITH_TAG" \
    CUDA_ARCHS="$CUDA_ARCHS" \
    GIT_COMMIT="$ACTUAL_COMMIT"

if [[ "$PUSH" == "1" ]]; then
    docker push "$IMAGE_WITH_TAG"
fi

echo
echo "Docker image: $IMAGE_WITH_TAG"
echo "InferenceX/enroot image string: $(to_enroot_image "$IMAGE_WITH_TAG")"

Signed-off-by: Mingyang Hao <mingyangHao@users.noreply.github.com>

mingyangHao

LGTM. I have tested it locally and they all passed. Some test coverage has been added as well.

mingyangHao · 2026-05-05T08:06:32Z

Please make sure pre-commit check pass, thank you.

pcastonguay · 2026-05-05T18:13:37Z

Fixed pre-commit in #13771 and merged it. Closing this one.

Oseltamivir requested a review from a team as a code owner May 3, 2026 05:34

Oseltamivir requested review from mikeiovine and removed request for a team May 3, 2026 05:34

github-actions Bot assigned Oseltamivir May 3, 2026

Oseltamivir changed the title ~~Fix fused MHC for DeepSeek-V4-Pro hidden size~~ [fix] Fused MHC for DSv4 hidden size May 3, 2026

svc-trtllm-gh-bot added the Community want to contribute PRs initiated from Community label May 3, 2026

Oseltamivir changed the title ~~[fix] Fused MHC for DSv4 hidden size~~ [None][fix] Fix fused MHC for DeepSeek-V4-Pro hidden size May 3, 2026

Oseltamivir force-pushed the fix/dsv4-pro-fused-mhc-hidden-7168 branch from eb20e9e to 23b1492 Compare May 3, 2026 17:26

functionstackx mentioned this pull request May 3, 2026

Update DSV4 TRT fused MHC image SemiAnalysisAI/InferenceX#1270

Merged

mikeiovine approved these changes May 4, 2026

View reviewed changes

juney-nvidia requested a review from mingyangHao May 5, 2026 01:49

mingyangHao reviewed May 5, 2026

View reviewed changes

Comment thread cpp/tensorrt_llm/kernels/mhcKernels/mhcFusedHcKernel.cu Outdated

Comment thread cpp/tensorrt_llm/kernels/mhcKernels/mhcFusedHcKernel.cu

mingyangHao requested changes May 5, 2026

View reviewed changes

Comment thread tensorrt_llm/_torch/modules/mhc/mhc_cuda.py

Oseltamivir force-pushed the fix/dsv4-pro-fused-mhc-hidden-7168 branch from 23b1492 to 5e7c96f Compare May 5, 2026 03:33

fix: support 7168 fused MHC hidden size

c43326b

Signed-off-by: Oseltamivir <bryansg2013@gmail.com>

Oseltamivir force-pushed the fix/dsv4-pro-fused-mhc-hidden-7168 branch from 5e7c96f to c43326b Compare May 5, 2026 03:37

test: add DeepSeek-V4 Pro MHC coverage

c7cbaa4

Signed-off-by: Mingyang Hao <mingyangHao@users.noreply.github.com>

mingyangHao approved these changes May 5, 2026

View reviewed changes

lfr-0531 added the deepseek-v4 label May 5, 2026

pcastonguay mentioned this pull request May 5, 2026

[None][fix] Fix fused MHC for DeepSeek-V4-Pro hidden size #13771

Merged

1 task

pcastonguay closed this May 5, 2026

Conversation

Oseltamivir commented May 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Issue

Fix

Validation

Uh oh!

mikeiovine commented May 4, 2026

Uh oh!

tensorrt-cicd commented May 4, 2026

Uh oh!

tensorrt-cicd commented May 4, 2026

Uh oh!

Oseltamivir commented May 4, 2026

Uh oh!

juney-nvidia commented May 5, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Oseltamivir commented May 5, 2026

Uh oh!

mingyangHao commented May 5, 2026

Uh oh!

Oseltamivir commented May 5, 2026

Uh oh!

Oseltamivir commented May 5, 2026

Uh oh!

mingyangHao left a comment

Choose a reason for hiding this comment

Uh oh!

mingyangHao commented May 5, 2026

Uh oh!

pcastonguay commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Oseltamivir commented May 3, 2026 •

edited

Loading