Add comprehensive deep research documentation for TVM TIR to Tensor Core (WMMA) lowering pipeline #5
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Overview
This PR provides a complete, source-code-grounded deep dive into how Apache TVM lowers TensorIR (TIR) to NVIDIA Tensor Core instructions, covering the entire pipeline from Python-level tensor intrinsics to final PTX assembly. This addresses the need for comprehensive documentation of TVM's tensor core support and provides concrete examples for developers working with CUDA tensor cores.
What's Included
📚 Comprehensive Documentation (95KB+)
Main Research Document (
docs/TVM_TIR_to_TensorCore_Lowering_Deep_Research.md)Visual Flow Diagrams (
docs/TVM_TensorCore_Lowering_Flow_Diagrams.md)Quick Start Guide (
RESEARCH_README.md)🚀 Minimal Reproducible Example
Executable Demo (
examples/tensor_core_mre.py)The MRE demonstrates:
Expected output shows
mma.sync.alignedinstructions in generated PTX:Key Findings
Two Parallel Lowering Paths
wmma_sync_16x16x16_*mma_f16f16f32,mma_i8i8i32wmma.matrix_a/b/accumulatorwarp,m16n8k8.matrixA/B/Cnvcuda::wmma::mma_sync()__asm__ volatileArchitecture Support
mma.sync.alignedinstructionsSource Code References
All documentation includes specific file paths and line numbers for version
d03d0ba9340c509e983dd7066d3a182ad00e9622:python/tvm/tir/tensor_intrin/cuda.py: Lines 222-1371 (intrinsic definitions)src/target/source/codegen_cuda.cc: Lines 933-1720 (CodeGen emission)src/target/source/ptx.cc: Lines 542-596 (PTX assembly generation)src/tir/transforms/tensorcore_infer_fragment.cc: Fragment metadata inferenceUsage Example
Validation Methods
Four methods documented for verifying Tensor Core usage:
mma.syncandldmatrixinstructionswmma.*orwarp)Files Changed
docs/TVM_TIR_to_TensorCore_Lowering_Deep_Research.md(new, 30KB)docs/TVM_TensorCore_Lowering_Flow_Diagrams.md(new, 30KB)RESEARCH_README.md(new, 13KB)SUMMARY.md(new, 7KB)examples/tensor_core_mre.py(new, 14KB)examples/README.md(new, 1KB)Benefits
This documentation enables developers to:
All claims are backed by source code evidence with specific file paths and line numbers. The MRE provides executable proof of the documented behavior on sm_89 (Ada) architecture.
Original prompt
🔎 Deep Research Prompt: TVM TIR → Tensor Core (WMMA) Lowering
ROLE
你是 TVM 编译器与 GPU 内核的资深研究员。你的任务是用源码考据 + 最小可复现实验(MRE)+ 产出物验证的方式,完整梳理 “TVM 的 TIR 如何被张量化/Lower 成 NVIDIA Tensor Core 指令(
nvcuda::wmma::*/mma.sync.aligned.*,及可用时wgmma.*)”。GOAL
tensorize)→ TIR intrin call → TIR transform passes → CUDA/NVVM/LLVM CodeGen → C++ WMMA API/LLVM NVVM intrinsics/inline PTX → 最终 PTX/SASS 的端到端数据流与控制流。mma.sync.aligned.*(Ada: 无 WGMMA),顺带说明 sm_90(Hopper)是否能走wgmma.mma_async.*(若 TVM 版本支持)。Deliverables(必须提供)
A. 架构图:从 TensorIR →(tensorize)→ TIR intrin → TIR passes(命名+顺序)→ CodeGen CUDA/LLVM NVPTX → WMMA API/NVVM/inline PTX → PTX/SASS 的流程图。
B. 源码清单:逐条列出并用 1–2 句说明每个关键文件/函数/类的职责与相互关系;给出提交的 Git commit SHA 以固定版本。
C. 显式映射表:
wmma.matrix_a,wmma.matrix_b,wmma.accumulator)nvcuda::wmma::mma_sync/llvm.nvvm.wmma.*/mma.sync.aligned.*)m16n16k16, row/col-major,f16/f16/f16/f16组合)。D. MRE 证据:
tensorize)target="cuda -arch=sm_89"构建产物中 PTX 片段(至少 2 行,包含mma.sync.aligned.m16n16k16.*)wgmma分支是否被 TVM 支持与触发的证据(源码与 PTX)。E. 决策条件:哪些 flag/target/attrs 决定走 WMMA vs 普通 CUDA 核心;何处决定用 C++ WMMA API vs LLVM NVVM intrinsic vs inline PTX。
F. 版本差异:简述 TVM 近两年(例如 v0.10→main)的变更点(文件位置迁移、API 改名、WGMMA 支持 PR 等)。
G. 陷阱与边界:布局(A/B row/col)、对齐、碎片形状、
ldmatrix/store的配套约束,什么时候会退化为非 Tensor Core 路径。必查方向与线索(请以源码为准逐一核对)
TensorIntrin 定义处(Python 侧):
python/tvm/tir/tensor_intrin/cuda/*.py、python/tvm/topi/cuda/tensor_intrin.py、tvm/script/ir_builder/tir相关。wmma_load_matrix_sync/wmma_store_matrix_sync/wmma_fill/wmma_mma_sync以及 形状/类型/布局签名。Schedule → Tensorize → TIR 重写
tensorize如何把循环映射到上述 TensorIntrin?T.call_intrin/T.call_extern/tir.ptx_*等 call?TIR Pass 管线(TVM Pass Infra)
LowerTensorCore、RewriteTensorCore,LowerWarpMemory,LowerIntrin,StorageRewrite等(以源码为准列全名)。CodeGen 层
src/target/source/codegen_cuda.cc、src/target/llvm/codegen_llvm.cc、src/target/source/codegen_nvptx.cc等。哪个分支把特定的 TIR Call 映射到:
nvcuda::wmma::load/mma/storeC++ API(需要#include <mma.h>)llvm.nvvm.wmma.*intrinsicsmma.sync.aligned.*/wgmma.*)。打印源码中的 关键 switch/case/if,指出触发条件(目标架构、dtype、布局)。
Target / Attr 触发
target = "cuda -arch=sm_89"(Ada:应走mma.sync,无wgmma)sm_90(Hopper:可能wgmma,请验证 TVM 支持情况与触发路径)tir.use_async_copy、-max_num_threads、tvm_use_tensorcore等是否影响路径?最小复现实验(请提供完整脚本 & 产物)
构造 16×16×16 的 FP16 GEMM(A: row-major, B: col-major 或按 TVM intrin 要求),用 TensorIR + schedule 的
tensorize对应 WMMA intrin。tvm.build(mod, target="cuda -arch=sm_89")后导出并打印 PTX:mod.imported_modules[0].get_source()或mod.get_source("ptx")(按版本适配)。mma.sync.aligned.m16n16k16.*PTX 行。额外:如环境可用,替换
-arch=sm_90再做一次,验证是否出现wgmma.mma_async.*;若不支持,请在源码与 issue/PR 层面给出佐证。方法学与工具
源码检索建议(ripgrep):
rg -n "wmma" src/ python/rg -n "mma_sync|mma\.sync|wgmma" -g "*.{cc,py,cu}"rg -n "tensorize|TensorIntrin|matrix_a|matrix_b|accumulator"打印 Pass 流:查
relay.build/tir.transform管线打印方法(PassInstrument 或TVM_LOG_DEBUG)。如有 python 端注册:确认 intrin descriptor 与 implementation 的绑定(desc/impl)。
输出格式(请严格遵守)
文件路径 :: 短说明 :: 关键函数/类 :: 触发条件/要点TIR intrin → CodeGen API/Intrinsic → PTX 指令(含形状/布局/类型)质量红线
环境假设(可据实调整但需记录)
main分支最新提交为主,并记录 SHA;若对比稳定版,请写明 tag💬 Share your feedback on Copilot coding agent for the chance to win a $200 gift card! Click here to start the survey.