Conversation
Contributor
There was a problem hiding this comment.
Pull Request Overview
This PR introduces a new fpA intB GEMM kernel based on the WeightOnlyGroupwiseQuantGemmPlugin from TensorRT-LLM to improve LLM throughput. Key changes include the addition of several new header files implementing specialized dequantized multistage MMA operations, updated traits for int8 and fpA intB GEMM kernels, and CMake configuration updates to support enhanced CUDA architectures and features.
Reviewed Changes
Copilot reviewed 80 out of 80 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| onnxruntime/contrib_ops/cuda/llm/cutlass_extensions/gemm/threadblock/default_dq_mma_multistage.h | Implements dequantized multistage MMA operations with template specializations for different quantization methods. |
| onnxruntime/contrib_ops/cuda/llm/cutlass_extensions/gemm/threadblock/default_dq_mma.h | Declares threadblock-scoped GEMM kernels with dequantization support. |
| onnxruntime/contrib_ops/cuda/llm/cutlass_extensions/gemm/kernel/mixed_gemm_B_layout.h | Provides layout especializations to unify weight layouts for quantized GEMM kernels. |
| onnxruntime/contrib_ops/cuda/llm/cutlass_extensions/gemm/kernel/default_int8_traits.h | Defines architecture traits for int8 GEMM implementations. |
| onnxruntime/contrib_ops/cuda/llm/cutlass_extensions/gemm/kernel/default_fpA_intB_traits.h | Introduces fpA intB traits to support mixed precision GEMM operations. |
| onnxruntime/contrib_ops/cuda/llm/cutlass_extensions/gemm/device/gemm_universal_base_compat.h | Implements a compatibility layer for universal GEMM kernels supporting different modes and split-K configurations. |
| onnxruntime/contrib_ops/cuda/llm/cutlass_extensions/gemm/collective/*.hpp and *.inl | Adds collective implementations and builders for interleaved MMA operations targeting SM90 and later architectures. |
| onnxruntime/contrib_ops/cuda/llm/cutlass_extensions/epilogue_helpers.h and fused_activations.h | Defines helper types and fused activation functions to support various epilogue operations. |
| onnxruntime/contrib_ops/cuda/llm/cutlass_extensions/compute_occupancy.h | Provides utilities for computing the occupancy of CUDA kernels based on shared memory usage. |
| onnxruntime/contrib_ops/cuda/llm/cutlass_extensions/arch/mma.h | Defines new tags and helper templates for dequantized MMA operators based on the quantization scheme. |
| onnxruntime/contrib_ops/cuda/llm/common/* | Adds common utilities for workspace memory alignment, CUDA runtime queries, and logging. |
| cmake/* | Updates CMake configuration to properly detect and set up supported CUDA architectures and features (e.g. enabling FP8/FP4, TMA support). |
nenad1002
reviewed
May 29, 2025
onnxruntime/contrib_ops/cuda/llm/cutlass_extensions/epilogue/thread/fused_activations.h
Show resolved
Hide resolved
nenad1002
approved these changes
May 30, 2025
tianleiwu
added a commit
that referenced
this pull request
Jun 3, 2025
### Description Implement fpA intB gemm preprocess in cuda kernel to speed up weight prepacking. ### Motivation and Context Original preprocess code (in #24854) is for CPU, which is slow and need extra memory copy between CPU and GPU.
tianleiwu
added a commit
that referenced
this pull request
Jun 5, 2025
### Description * Enable fp16 intB gemm kernels when zero points is not provided. * Minor changes of `fpA_intB_gemv/dispatcher.h` to fix build error for sm < 5.3. * Minor changes of `fpA_intB_gemm_preprocessors_impl.h` to fix unreachable code warnings in debug build. Note that we have existed test cases like `MatMulNBits.Fp16_Int4_NoZeroPoint` could cover the unit test. ### Motivation and Context The zero point input is optional for MatMulNBits. In #24854, we only enable fp16 intB gemm when zero points is provided.
javier-intel
pushed a commit
to intel/onnxruntime
that referenced
this pull request
Jun 15, 2025
### Description Implement fpA intB gemm preprocess in cuda kernel to speed up weight prepacking. ### Motivation and Context Original preprocess code (in microsoft#24854) is for CPU, which is slow and need extra memory copy between CPU and GPU.
javier-intel
pushed a commit
to intel/onnxruntime
that referenced
this pull request
Jun 15, 2025
### Description * Enable fp16 intB gemm kernels when zero points is not provided. * Minor changes of `fpA_intB_gemv/dispatcher.h` to fix build error for sm < 5.3. * Minor changes of `fpA_intB_gemm_preprocessors_impl.h` to fix unreachable code warnings in debug build. Note that we have existed test cases like `MatMulNBits.Fp16_Int4_NoZeroPoint` could cover the unit test. ### Motivation and Context The zero point input is optional for MatMulNBits. In microsoft#24854, we only enable fp16 intB gemm when zero points is provided.
ankus-qti
pushed a commit
to CodeLinaro/onnxruntime
that referenced
this pull request
Nov 25, 2025
### Description * Add fpA intB gemm kernel from WeightOnlyGroupwiseQuantGemmPlugin of TensorRT-LLM. * Add prepacking to convert weight/scales/zero_points to adapt MatMulNBits to use the kernel. Limitations: * Only enable fp16 kernel. BF16 support will be added later. * Requires zero points. The support of scales only might be added later. * Bias is not enabled since previous MatMulNBits kernel does not support bias. ### Motivation and Context To improve performance of LLM. Initial result shows 2.2x throughput on prompt processing and 1.25X throughput on token generation using onnxruntime-genai benchmark_e2e.py on phi-4-mini-instruct on A100.
ankus-qti
pushed a commit
to CodeLinaro/onnxruntime
that referenced
this pull request
Nov 25, 2025
### Description Implement fpA intB gemm preprocess in cuda kernel to speed up weight prepacking. ### Motivation and Context Original preprocess code (in microsoft#24854) is for CPU, which is slow and need extra memory copy between CPU and GPU.
ankus-qti
pushed a commit
to CodeLinaro/onnxruntime
that referenced
this pull request
Nov 25, 2025
### Description * Enable fp16 intB gemm kernels when zero points is not provided. * Minor changes of `fpA_intB_gemv/dispatcher.h` to fix build error for sm < 5.3. * Minor changes of `fpA_intB_gemm_preprocessors_impl.h` to fix unreachable code warnings in debug build. Note that we have existed test cases like `MatMulNBits.Fp16_Int4_NoZeroPoint` could cover the unit test. ### Motivation and Context The zero point input is optional for MatMulNBits. In microsoft#24854, we only enable fp16 intB gemm when zero points is provided.
3 tasks
qti-jkilpatrick
pushed a commit
to onnxruntime/onnxruntime-qnn
that referenced
this pull request
Jan 26, 2026
### Description Implement fpA intB gemm preprocess in cuda kernel to speed up weight prepacking. ### Motivation and Context Original preprocess code (in microsoft/onnxruntime#24854) is for CPU, which is slow and need extra memory copy between CPU and GPU.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Limitations:
The feature is currently disabled by default. To enable it, set an environment variable
ORT_FPA_INTB_GEMM=1to enable it.Motivation and Context
To improve performance of LLM.
Initial result shows 2.2x throughput on prompt processing and 1.25X throughput on token generation using onnxruntime-genai benchmark_e2e.py on phi-4-mini-instruct on A100.