[None][fix] Fix fused MHC for DeepSeek-V4-Pro hidden size#13710
[None][fix] Fix fused MHC for DeepSeek-V4-Pro hidden size#13710Oseltamivir wants to merge 2 commits intoNVIDIA:feat/deepseek_v4from
Conversation
eb20e9e to
23b1492
Compare
|
/bot run |
|
PR_Github #46665 [ run ] triggered by Bot. Commit: |
|
PR_Github #46665 [ run ] completed with state
|
|
Can I get details of CI failure? |
|
Tagging @mingyangHao for vis on this PR. |
23b1492 to
5e7c96f
Compare
Signed-off-by: Oseltamivir <bryansg2013@gmail.com>
5e7c96f to
c43326b
Compare
|
@mingyangHao do you have info on CI failure? |
Hi I can see there is a build error, but I dont think that is related to your commit. |
|
oof |
|
@mingyangHao if you wanna test, I have an image at https://github.com/orgs/SemiAnalysisAI/packages/container/package/trtllm-deepseek-v4 Build script |
Signed-off-by: Mingyang Hao <mingyangHao@users.noreply.github.com>
mingyangHao
left a comment
There was a problem hiding this comment.
LGTM. I have tested it locally and they all passed. Some test coverage has been added as well.
|
Please make sure pre-commit check pass, thank you. |
|
Fixed pre-commit in #13771 and merged it. Closing this one. |
Summary
This fixes the SM100 fused mHC hyper-connection path for DeepSeek-V4-Pro.
DeepSeek-V4-Pro uses hidden size 7168, but the fused-HC MMA launcher was still effectively wired for hidden size 4096. The Python runner could select
trtllm::mhc_fused_hcfor 7168 tensors, while the C++ MMA path used compile-time shape constants and TMA descriptors built around the previous 4096-only instantiation. That can run without an immediate crash, but it corrupts hidden states and produces invalid generations.Issue
The fused-HC MMA kernels are statically instantiated. Before this change:
mhcFusedHcKernel.cuhad a singleFHC_HIDDEN = 4096constant.SHAPE_K, residual/x TMA descriptors, and the MMA kernel template instantiations were all tied to that hidden size.A direct 7168 instantiation also cannot blindly compile every existing
kNumSplitsvalue. WithBLOCK_K=64, hidden size 7168 has7168 / 64 = 112H tiles, sokNumSplits=32and64violate the kernel's compile-time split constraints. Valid MMA split sizes for 7168 are1, 2, 4, 8, 16.Run with failed evals: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/25231354124/job/73987414168
Fix
mhcFusedHcLaunchandmhcFusedHcAllInOneLaunchbased onhidden_size.Hidden, soSHAPE_Kand TMA descriptors use the runtime-matched compile-time hidden size.hidden_size/kNumSplitsvalidation so unsupported specializations are not instantiated.MhcFusedHcRunner.get_valid_tactics, so the autotuner does not emit invalid MMA tactics for 7168.mhcKernels.h.hidden_size % 64 == 0.Image with build of forked trtllm: https://github.com/orgs/SemiAnalysisAI/packages/container/package/trtllm-deepseek-v4
Validation
python3 -m py_compile tensorrt_llm/_torch/modules/mhc/mhc_cuda.pyghcr.io/semianalysisai/trtllm-deepseek-v4:fix-mhc7168-eb20e9eTRTLLM_MHC_ENABLE_FUSED_HC=1): https://github.com/SemiAnalysisAI/InferenceX/actions/runs/25270557269trtllm::mhc_fused_hcon 7168 hidden-size tensors and completed successfully.