Skip to content

[CUDA] fp16 intB gemm#24854

Merged
tianleiwu merged 26 commits intomainfrom
tlwu/fp16_intB_gemm
May 30, 2025
Merged

[CUDA] fp16 intB gemm#24854
tianleiwu merged 26 commits intomainfrom
tlwu/fp16_intB_gemm

Conversation

@tianleiwu
Copy link
Copy Markdown
Contributor

@tianleiwu tianleiwu commented May 23, 2025

Description

  • Add fpA intB gemm kernel from WeightOnlyGroupwiseQuantGemmPlugin of TensorRT-LLM.
  • Add prepacking to convert weight/scales/zero_points to adapt MatMulNBits to use the kernel.

Limitations:

  • Only enable fp16 kernel. BF16 support will be added later.
  • Requires zero points. The support of scales only might be added later.
  • Bias is not enabled since previous MatMulNBits kernel does not support bias.
  • Weight prepacking (preprocessing) is done in CPU, which could be slow for large models. We will move that part to cuda later.
  • GEMM profiling is done during session creation. It could take a long time for large models. We might add cache of profiling result (and load it from file if available).

The feature is currently disabled by default. To enable it, set an environment variable ORT_FPA_INTB_GEMM=1 to enable it.

Motivation and Context

To improve performance of LLM.

Initial result shows 2.2x throughput on prompt processing and 1.25X throughput on token generation using onnxruntime-genai benchmark_e2e.py on phi-4-mini-instruct on A100.

@tianleiwu tianleiwu marked this pull request as draft May 23, 2025 19:46
@tianleiwu tianleiwu marked this pull request as ready for review May 29, 2025 00:21
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a new fpA intB GEMM kernel based on the WeightOnlyGroupwiseQuantGemmPlugin from TensorRT-LLM to improve LLM throughput. Key changes include the addition of several new header files implementing specialized dequantized multistage MMA operations, updated traits for int8 and fpA intB GEMM kernels, and CMake configuration updates to support enhanced CUDA architectures and features.

Reviewed Changes

Copilot reviewed 80 out of 80 changed files in this pull request and generated no comments.

Show a summary per file
File Description
onnxruntime/contrib_ops/cuda/llm/cutlass_extensions/gemm/threadblock/default_dq_mma_multistage.h Implements dequantized multistage MMA operations with template specializations for different quantization methods.
onnxruntime/contrib_ops/cuda/llm/cutlass_extensions/gemm/threadblock/default_dq_mma.h Declares threadblock-scoped GEMM kernels with dequantization support.
onnxruntime/contrib_ops/cuda/llm/cutlass_extensions/gemm/kernel/mixed_gemm_B_layout.h Provides layout especializations to unify weight layouts for quantized GEMM kernels.
onnxruntime/contrib_ops/cuda/llm/cutlass_extensions/gemm/kernel/default_int8_traits.h Defines architecture traits for int8 GEMM implementations.
onnxruntime/contrib_ops/cuda/llm/cutlass_extensions/gemm/kernel/default_fpA_intB_traits.h Introduces fpA intB traits to support mixed precision GEMM operations.
onnxruntime/contrib_ops/cuda/llm/cutlass_extensions/gemm/device/gemm_universal_base_compat.h Implements a compatibility layer for universal GEMM kernels supporting different modes and split-K configurations.
onnxruntime/contrib_ops/cuda/llm/cutlass_extensions/gemm/collective/*.hpp and *.inl Adds collective implementations and builders for interleaved MMA operations targeting SM90 and later architectures.
onnxruntime/contrib_ops/cuda/llm/cutlass_extensions/epilogue_helpers.h and fused_activations.h Defines helper types and fused activation functions to support various epilogue operations.
onnxruntime/contrib_ops/cuda/llm/cutlass_extensions/compute_occupancy.h Provides utilities for computing the occupancy of CUDA kernels based on shared memory usage.
onnxruntime/contrib_ops/cuda/llm/cutlass_extensions/arch/mma.h Defines new tags and helper templates for dequantized MMA operators based on the quantization scheme.
onnxruntime/contrib_ops/cuda/llm/common/* Adds common utilities for workspace memory alignment, CUDA runtime queries, and logging.
cmake/* Updates CMake configuration to properly detect and set up supported CUDA architectures and features (e.g. enabling FP8/FP4, TMA support).

@tianleiwu tianleiwu requested a review from kunal-vaishnavi May 29, 2025 00:27
@tianleiwu tianleiwu merged commit 9d6546e into main May 30, 2025
88 checks passed
@tianleiwu tianleiwu deleted the tlwu/fp16_intB_gemm branch May 30, 2025 17:55
tianleiwu added a commit that referenced this pull request Jun 3, 2025
### Description

Implement fpA intB gemm preprocess in cuda kernel to speed up weight
prepacking.

### Motivation and Context

Original preprocess code (in
#24854) is for CPU, which
is slow and need extra memory copy between CPU and GPU.
tianleiwu added a commit that referenced this pull request Jun 5, 2025
### Description
* Enable fp16 intB gemm kernels when zero points is not provided.
* Minor changes of `fpA_intB_gemv/dispatcher.h` to fix build error for
sm < 5.3.
* Minor changes of `fpA_intB_gemm_preprocessors_impl.h` to fix
unreachable code warnings in debug build.

Note that we have existed test cases like
`MatMulNBits.Fp16_Int4_NoZeroPoint` could cover the unit test.

### Motivation and Context
The zero point input is optional for MatMulNBits. In
#24854, we only enable fp16
intB gemm when zero points is provided.
javier-intel pushed a commit to intel/onnxruntime that referenced this pull request Jun 15, 2025
### Description

Implement fpA intB gemm preprocess in cuda kernel to speed up weight
prepacking.

### Motivation and Context

Original preprocess code (in
microsoft#24854) is for CPU, which
is slow and need extra memory copy between CPU and GPU.
javier-intel pushed a commit to intel/onnxruntime that referenced this pull request Jun 15, 2025
### Description
* Enable fp16 intB gemm kernels when zero points is not provided.
* Minor changes of `fpA_intB_gemv/dispatcher.h` to fix build error for
sm < 5.3.
* Minor changes of `fpA_intB_gemm_preprocessors_impl.h` to fix
unreachable code warnings in debug build.

Note that we have existed test cases like
`MatMulNBits.Fp16_Int4_NoZeroPoint` could cover the unit test.

### Motivation and Context
The zero point input is optional for MatMulNBits. In
microsoft#24854, we only enable fp16
intB gemm when zero points is provided.
ankus-qti pushed a commit to CodeLinaro/onnxruntime that referenced this pull request Nov 25, 2025
### Description
* Add fpA intB gemm kernel from WeightOnlyGroupwiseQuantGemmPlugin of
TensorRT-LLM.
* Add prepacking to convert weight/scales/zero_points to adapt
MatMulNBits to use the kernel.

Limitations:
* Only enable fp16 kernel. BF16 support will be added later.
* Requires zero points. The support of scales only might be added later.
* Bias is not enabled since previous MatMulNBits kernel does not support
bias.

### Motivation and Context

To improve performance of LLM. 

Initial result shows 2.2x throughput on prompt processing and 1.25X
throughput on token generation using onnxruntime-genai benchmark_e2e.py
on phi-4-mini-instruct on A100.
ankus-qti pushed a commit to CodeLinaro/onnxruntime that referenced this pull request Nov 25, 2025
### Description

Implement fpA intB gemm preprocess in cuda kernel to speed up weight
prepacking.

### Motivation and Context

Original preprocess code (in
microsoft#24854) is for CPU, which
is slow and need extra memory copy between CPU and GPU.
ankus-qti pushed a commit to CodeLinaro/onnxruntime that referenced this pull request Nov 25, 2025
### Description
* Enable fp16 intB gemm kernels when zero points is not provided.
* Minor changes of `fpA_intB_gemv/dispatcher.h` to fix build error for
sm < 5.3.
* Minor changes of `fpA_intB_gemm_preprocessors_impl.h` to fix
unreachable code warnings in debug build.

Note that we have existed test cases like
`MatMulNBits.Fp16_Int4_NoZeroPoint` could cover the unit test.

### Motivation and Context
The zero point input is optional for MatMulNBits. In
microsoft#24854, we only enable fp16
intB gemm when zero points is provided.
qti-jkilpatrick pushed a commit to onnxruntime/onnxruntime-qnn that referenced this pull request Jan 26, 2026
### Description

Implement fpA intB gemm preprocess in cuda kernel to speed up weight
prepacking.

### Motivation and Context

Original preprocess code (in
microsoft/onnxruntime#24854) is for CPU, which
is slow and need extra memory copy between CPU and GPU.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants