Skip to content

[None][fix] Fix fused MHC for DeepSeek-V4-Pro hidden size#13710

Closed
Oseltamivir wants to merge 2 commits intoNVIDIA:feat/deepseek_v4from
Oseltamivir:fix/dsv4-pro-fused-mhc-hidden-7168
Closed

[None][fix] Fix fused MHC for DeepSeek-V4-Pro hidden size#13710
Oseltamivir wants to merge 2 commits intoNVIDIA:feat/deepseek_v4from
Oseltamivir:fix/dsv4-pro-fused-mhc-hidden-7168

Conversation

@Oseltamivir
Copy link
Copy Markdown

@Oseltamivir Oseltamivir commented May 3, 2026

Summary

This fixes the SM100 fused mHC hyper-connection path for DeepSeek-V4-Pro.

DeepSeek-V4-Pro uses hidden size 7168, but the fused-HC MMA launcher was still effectively wired for hidden size 4096. The Python runner could select trtllm::mhc_fused_hc for 7168 tensors, while the C++ MMA path used compile-time shape constants and TMA descriptors built around the previous 4096-only instantiation. That can run without an immediate crash, but it corrupts hidden states and produces invalid generations.

Issue

The fused-HC MMA kernels are statically instantiated. Before this change:

  • mhcFusedHcKernel.cu had a single FHC_HIDDEN = 4096 constant.
  • SHAPE_K, residual/x TMA descriptors, and the MMA kernel template instantiations were all tied to that hidden size.
  • The Python autotuner treated the fused-HC runner as generic once SM100 MMA support was available.
  • DeepSeek-V4-Pro requests with hidden size 7168 could therefore enter the fused-HC path even though the C++ MMA instantiation did not match the runtime shape.

A direct 7168 instantiation also cannot blindly compile every existing kNumSplits value. With BLOCK_K=64, hidden size 7168 has 7168 / 64 = 112 H tiles, so kNumSplits=32 and 64 violate the kernel's compile-time split constraints. Valid MMA split sizes for 7168 are 1, 2, 4, 8, 16.

Run with failed evals: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/25231354124/job/73987414168

Fix

  • Replace the single 4096 fused-HC hidden constant with explicit supported hidden sizes: 4096 and 7168.
  • Add runtime dispatch for mhcFusedHcLaunch and mhcFusedHcAllInOneLaunch based on hidden_size.
  • Template the MMA fused-HC launch implementations on Hidden, so SHAPE_K and TMA descriptors use the runtime-matched compile-time hidden size.
  • Add compile-time hidden_size / kNumSplits validation so unsupported specializations are not instantiated.
  • Mirror that same split-size filtering in MhcFusedHcRunner.get_valid_tactics, so the autotuner does not emit invalid MMA tactics for 7168.
  • Make fallback tactic selection hidden-size aware.
  • Document the new shape contract in mhcKernels.h.
  • Add explicit FMA-path guards requiring hidden_size % 64 == 0.

Image with build of forked trtllm: https://github.com/orgs/SemiAnalysisAI/packages/container/package/trtllm-deepseek-v4

Validation

@Oseltamivir Oseltamivir requested a review from a team as a code owner May 3, 2026 05:34
@Oseltamivir Oseltamivir requested review from mikeiovine and removed request for a team May 3, 2026 05:34
@Oseltamivir Oseltamivir changed the title Fix fused MHC for DeepSeek-V4-Pro hidden size [fix] Fused MHC for DSv4 hidden size May 3, 2026
@svc-trtllm-gh-bot svc-trtllm-gh-bot added the Community want to contribute PRs initiated from Community label May 3, 2026
@Oseltamivir Oseltamivir changed the title [fix] Fused MHC for DSv4 hidden size [None][fix] Fix fused MHC for DeepSeek-V4-Pro hidden size May 3, 2026
@Oseltamivir Oseltamivir force-pushed the fix/dsv4-pro-fused-mhc-hidden-7168 branch from eb20e9e to 23b1492 Compare May 3, 2026 17:26
@mikeiovine
Copy link
Copy Markdown
Collaborator

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46665 [ run ] triggered by Bot. Commit: 23b1492 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46665 [ run ] completed with state SUCCESS. Commit: 23b1492
/LLM/main/L0_MergeRequest_PR pipeline #36706 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@Oseltamivir
Copy link
Copy Markdown
Author

@mikeiovine

Can I get details of CI failure?

@juney-nvidia juney-nvidia requested a review from mingyangHao May 5, 2026 01:49
@juney-nvidia
Copy link
Copy Markdown
Collaborator

Tagging @mingyangHao for vis on this PR.

Comment thread cpp/tensorrt_llm/kernels/mhcKernels/mhcFusedHcKernel.cu Outdated
Comment thread cpp/tensorrt_llm/kernels/mhcKernels/mhcFusedHcKernel.cu
Comment thread tensorrt_llm/_torch/modules/mhc/mhc_cuda.py
@Oseltamivir Oseltamivir force-pushed the fix/dsv4-pro-fused-mhc-hidden-7168 branch from 23b1492 to 5e7c96f Compare May 5, 2026 03:33
Signed-off-by: Oseltamivir <bryansg2013@gmail.com>
@Oseltamivir Oseltamivir force-pushed the fix/dsv4-pro-fused-mhc-hidden-7168 branch from 5e7c96f to c43326b Compare May 5, 2026 03:37
@Oseltamivir
Copy link
Copy Markdown
Author

@mingyangHao do you have info on CI failure?

@mingyangHao
Copy link
Copy Markdown
Collaborator

@mingyangHao do you have info on CI failure?

Hi I can see there is a build error, but I dont think that is related to your commit.

@Oseltamivir
Copy link
Copy Markdown
Author

oof

@Oseltamivir
Copy link
Copy Markdown
Author

@mingyangHao if you wanna test, I have an image at https://github.com/orgs/SemiAnalysisAI/packages/container/package/trtllm-deepseek-v4

Build script
#!/usr/bin/env bash

set -euo pipefail

TRTLLM_REPO="${TRTLLM_REPO:-https://github.com/NVIDIA/TensorRT-LLM.git}"
TRTLLM_REF="${TRTLLM_REF:-feat/deepseek_v4}"
TRTLLM_COMMIT="${TRTLLM_COMMIT:-HEAD}"
IMAGE_REPO="${IMAGE_REPO:-ghcr.io/semianalysisai/trtllm-deepseek-v4}"
IMAGE_WITH_TAG="${IMAGE_WITH_TAG:-}"
CUDA_ARCHS="${CUDA_ARCHS:-100-real;103-real}"
PUSH="${PUSH:-0}"
KEEP_SRC="${KEEP_SRC:-0}"

require_cmd() {
    if ! command -v "$1" >/dev/null 2>&1; then
        echo "Missing required command: $1" >&2
        exit 1
    fi
}

to_enroot_image() {
    local image="$1"
    local registry="${image%%/*}"
    local rest="${image#*/}"

    if [[ "$image" == "$rest" ]]; then
        printf '%s\n' "$image"
    elif [[ "$registry" == *.* || "$registry" == *:* || "$registry" == "localhost" ]]; then
        printf '%s#%s\n' "$registry" "$rest"
    else
        printf '%s\n' "$image"
    fi
}

require_cmd docker
require_cmd git
require_cmd make

if ! docker buildx version >/dev/null 2>&1; then
    echo "docker buildx is required to build TensorRT-LLM release images." >&2
    exit 1
fi

if ! git lfs version >/dev/null 2>&1; then
    echo "git-lfs is required. Install it, then rerun this script." >&2
    exit 1
fi

WORKDIR=""
if [[ -n "${TRTLLM_SRC_DIR:-}" ]]; then
    SRC_DIR="$TRTLLM_SRC_DIR"
else
    WORKDIR="$(mktemp -d "${TMPDIR:-/tmp}/trtllm-dsv4-build.XXXXXX")"
    SRC_DIR="$WORKDIR/TensorRT-LLM"
fi

cleanup() {
    if [[ -n "$WORKDIR" && "$KEEP_SRC" != "1" ]]; then
        rm -rf "$WORKDIR"
    elif [[ -n "$WORKDIR" ]]; then
        echo "Keeping TensorRT-LLM checkout at $SRC_DIR"
    fi
}
trap cleanup EXIT

if [[ ! -d "$SRC_DIR/.git" ]]; then
    git clone --recurse-submodules --branch "$TRTLLM_REF" "$TRTLLM_REPO" "$SRC_DIR"
fi

cd "$SRC_DIR"
git fetch origin "$TRTLLM_REF"
git checkout -B "$TRTLLM_REF" "origin/$TRTLLM_REF" 2>/dev/null || git checkout "$TRTLLM_REF"
if [[ -n "$TRTLLM_COMMIT" ]]; then
    git checkout "$TRTLLM_COMMIT"
fi
git submodule update --init --recursive
git lfs install --local
git lfs pull

ACTUAL_COMMIT="$(git rev-parse HEAD)"
SHORT_COMMIT="$(git rev-parse --short=7 HEAD)"
REF_TAG="$(printf '%s' "$TRTLLM_REF" | tr '/:@' '-' | tr -c 'A-Za-z0-9_.-' '-')"

if [[ -z "$IMAGE_WITH_TAG" ]]; then
    IMAGE_WITH_TAG="${IMAGE_REPO}:${REF_TAG}-${SHORT_COMMIT}"
fi

echo "Building TensorRT-LLM DeepSeek-V4 image"
echo "  source: $TRTLLM_REPO"
echo "  ref:    $TRTLLM_REF"
echo "  commit: $ACTUAL_COMMIT"
echo "  image:  $IMAGE_WITH_TAG"
echo "  archs:  $CUDA_ARCHS"

make -C docker release_build \
    IMAGE_WITH_TAG="$IMAGE_WITH_TAG" \
    CUDA_ARCHS="$CUDA_ARCHS" \
    GIT_COMMIT="$ACTUAL_COMMIT"

if [[ "$PUSH" == "1" ]]; then
    docker push "$IMAGE_WITH_TAG"
fi

echo
echo "Docker image: $IMAGE_WITH_TAG"
echo "InferenceX/enroot image string: $(to_enroot_image "$IMAGE_WITH_TAG")"

Signed-off-by: Mingyang Hao <mingyangHao@users.noreply.github.com>
Copy link
Copy Markdown
Collaborator

@mingyangHao mingyangHao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I have tested it locally and they all passed. Some test coverage has been added as well.

@mingyangHao
Copy link
Copy Markdown
Collaborator

Please make sure pre-commit check pass, thank you.

@pcastonguay
Copy link
Copy Markdown
Collaborator

Fixed pre-commit in #13771 and merged it. Closing this one.

@pcastonguay pcastonguay closed this May 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants