[CUDA] fp16 intB gemm by tianleiwu · Pull Request #24854 · microsoft/onnxruntime

tianleiwu · 2025-05-23T19:46:49Z

Description

Add fpA intB gemm kernel from WeightOnlyGroupwiseQuantGemmPlugin of TensorRT-LLM.
Add prepacking to convert weight/scales/zero_points to adapt MatMulNBits to use the kernel.

Limitations:

Only enable fp16 kernel. BF16 support will be added later.
Requires zero points. The support of scales only might be added later.
Bias is not enabled since previous MatMulNBits kernel does not support bias.
Weight prepacking (preprocessing) is done in CPU, which could be slow for large models. We will move that part to cuda later.
GEMM profiling is done during session creation. It could take a long time for large models. We might add cache of profiling result (and load it from file if available).

The feature is currently disabled by default. To enable it, set an environment variable ORT_FPA_INTB_GEMM=1 to enable it.

Motivation and Context

To improve performance of LLM.

Initial result shows 2.2x throughput on prompt processing and 1.25X throughput on token generation using onnxruntime-genai benchmark_e2e.py on phi-4-mini-instruct on A100.

This reverts commit b275f1a.

onnxruntime/contrib_ops/cuda/llm/generate_kernels.py

onnxruntime/contrib_ops/cuda/llm/cutlass_heuristic.cc

Copilot

Pull Request Overview

This PR introduces a new fpA intB GEMM kernel based on the WeightOnlyGroupwiseQuantGemmPlugin from TensorRT-LLM to improve LLM throughput. Key changes include the addition of several new header files implementing specialized dequantized multistage MMA operations, updated traits for int8 and fpA intB GEMM kernels, and CMake configuration updates to support enhanced CUDA architectures and features.

Reviewed Changes

Copilot reviewed 80 out of 80 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
onnxruntime/contrib_ops/cuda/llm/cutlass_extensions/gemm/threadblock/default_dq_mma_multistage.h	Implements dequantized multistage MMA operations with template specializations for different quantization methods.
onnxruntime/contrib_ops/cuda/llm/cutlass_extensions/gemm/threadblock/default_dq_mma.h	Declares threadblock-scoped GEMM kernels with dequantization support.
onnxruntime/contrib_ops/cuda/llm/cutlass_extensions/gemm/kernel/mixed_gemm_B_layout.h	Provides layout especializations to unify weight layouts for quantized GEMM kernels.
onnxruntime/contrib_ops/cuda/llm/cutlass_extensions/gemm/kernel/default_int8_traits.h	Defines architecture traits for int8 GEMM implementations.
onnxruntime/contrib_ops/cuda/llm/cutlass_extensions/gemm/kernel/default_fpA_intB_traits.h	Introduces fpA intB traits to support mixed precision GEMM operations.
onnxruntime/contrib_ops/cuda/llm/cutlass_extensions/gemm/device/gemm_universal_base_compat.h	Implements a compatibility layer for universal GEMM kernels supporting different modes and split-K configurations.
onnxruntime/contrib_ops/cuda/llm/cutlass_extensions/gemm/collective/.hpp and .inl	Adds collective implementations and builders for interleaved MMA operations targeting SM90 and later architectures.
onnxruntime/contrib_ops/cuda/llm/cutlass_extensions/epilogue_helpers.h and fused_activations.h	Defines helper types and fused activation functions to support various epilogue operations.
onnxruntime/contrib_ops/cuda/llm/cutlass_extensions/compute_occupancy.h	Provides utilities for computing the occupancy of CUDA kernels based on shared memory usage.
onnxruntime/contrib_ops/cuda/llm/cutlass_extensions/arch/mma.h	Defines new tags and helper templates for dequantized MMA operators based on the quantization scheme.
onnxruntime/contrib_ops/cuda/llm/common/*	Adds common utilities for workspace memory alignment, CUDA runtime queries, and logging.
cmake/*	Updates CMake configuration to properly detect and set up supported CUDA architectures and features (e.g. enabling FP8/FP4, TMA support).

onnxruntime/contrib_ops/cuda/llm/cutlass_extensions/compute_occupancy.h

onnxruntime/contrib_ops/cuda/llm/common/logger.h

onnxruntime/contrib_ops/cuda/llm/cutlass_preprocessors.cc

onnxruntime/contrib_ops/cuda/llm/cutlass_extensions/compute_occupancy.h

onnxruntime/contrib_ops/cuda/llm/cutlass_extensions/epilogue/thread/fused_activations.h

onnxruntime/contrib_ops/cuda/llm/fpA_intB_gemm_adpator.cu

### Description Implement fpA intB gemm preprocess in cuda kernel to speed up weight prepacking. ### Motivation and Context Original preprocess code (in #24854) is for CPU, which is slow and need extra memory copy between CPU and GPU.

### Description * Enable fp16 intB gemm kernels when zero points is not provided. * Minor changes of `fpA_intB_gemv/dispatcher.h` to fix build error for sm < 5.3. * Minor changes of `fpA_intB_gemm_preprocessors_impl.h` to fix unreachable code warnings in debug build. Note that we have existed test cases like `MatMulNBits.Fp16_Int4_NoZeroPoint` could cover the unit test. ### Motivation and Context The zero point input is optional for MatMulNBits. In #24854, we only enable fp16 intB gemm when zero points is provided.

### Description Implement fpA intB gemm preprocess in cuda kernel to speed up weight prepacking. ### Motivation and Context Original preprocess code (in microsoft#24854) is for CPU, which is slow and need extra memory copy between CPU and GPU.

### Description * Enable fp16 intB gemm kernels when zero points is not provided. * Minor changes of `fpA_intB_gemv/dispatcher.h` to fix build error for sm < 5.3. * Minor changes of `fpA_intB_gemm_preprocessors_impl.h` to fix unreachable code warnings in debug build. Note that we have existed test cases like `MatMulNBits.Fp16_Int4_NoZeroPoint` could cover the unit test. ### Motivation and Context The zero point input is optional for MatMulNBits. In microsoft#24854, we only enable fp16 intB gemm when zero points is provided.

### Description * Add fpA intB gemm kernel from WeightOnlyGroupwiseQuantGemmPlugin of TensorRT-LLM. * Add prepacking to convert weight/scales/zero_points to adapt MatMulNBits to use the kernel. Limitations: * Only enable fp16 kernel. BF16 support will be added later. * Requires zero points. The support of scales only might be added later. * Bias is not enabled since previous MatMulNBits kernel does not support bias. ### Motivation and Context To improve performance of LLM. Initial result shows 2.2x throughput on prompt processing and 1.25X throughput on token generation using onnxruntime-genai benchmark_e2e.py on phi-4-mini-instruct on A100.

### Description Implement fpA intB gemm preprocess in cuda kernel to speed up weight prepacking. ### Motivation and Context Original preprocess code (in microsoft#24854) is for CPU, which is slow and need extra memory copy between CPU and GPU.

### Description * Enable fp16 intB gemm kernels when zero points is not provided. * Minor changes of `fpA_intB_gemv/dispatcher.h` to fix build error for sm < 5.3. * Minor changes of `fpA_intB_gemm_preprocessors_impl.h` to fix unreachable code warnings in debug build. Note that we have existed test cases like `MatMulNBits.Fp16_Int4_NoZeroPoint` could cover the unit test. ### Motivation and Context The zero point input is optional for MatMulNBits. In microsoft#24854, we only enable fp16 intB gemm when zero points is provided.

### Description Implement fpA intB gemm preprocess in cuda kernel to speed up weight prepacking. ### Motivation and Context Original preprocess code (in microsoft/onnxruntime#24854) is for CPU, which is slow and need extra memory copy between CPU and GPU.

tianleiwu added 4 commits May 23, 2025 19:07

Add fp16 int4/8 gemm

baa6022

clean up moe

b275f1a

Merge branch 'main' into tlwu/fp16_intB_gemm

acf8d02

Revert "clean up moe"

ea3c1fe

This reverts commit b275f1a.

tianleiwu marked this pull request as draft May 23, 2025 19:46

github-advanced-security bot found potential problems May 23, 2025

View reviewed changes

tianleiwu added 5 commits May 24, 2025 02:11

update generate_kernels.py

c77f9bf

rename nv_infer_runtime.h to nv_infer_datatype.h

fe8aea9

Add tests

731301d

fix 4 bit zero point adaptor

43b6a84

Add scale only kernels

8916212

github-advanced-security bot found potential problems May 25, 2025

View reviewed changes

onnxruntime/contrib_ops/cuda/llm/cutlass_heuristic.cc Fixed Show fixed Hide fixed

tianleiwu added 13 commits May 25, 2025 00:12

format

0c3461b

Revert "Add scale only kernels"

13e9be1

remove scaleonly cu files

f48ddfb

Update cuda compiler settings

1fdf33f

remove ENABLE_BF16

d0dbfac

fix unused parameters warnings

528e261

Add bf16 gemv

358c832

keep last virtual

7584128

move min cuda version to cuda_configuraiton.cmake

31a4194

increase test tolerance

213f096

static cast

cae37c9

fix unused paramter warnings

7207bc4

prepacking

7b23374

tianleiwu marked this pull request as ready for review May 29, 2025 00:21

tianleiwu requested review from Copilot, jiafatom and nenad1002 May 29, 2025 00:26

Copilot AI reviewed May 29, 2025

View reviewed changes

tianleiwu requested a review from kunal-vaishnavi May 29, 2025 00:27

format

39a5d99

nenad1002 reviewed May 29, 2025

View reviewed changes

onnxruntime/contrib_ops/cuda/llm/cutlass_extensions/compute_occupancy.h Show resolved Hide resolved

onnxruntime/contrib_ops/cuda/llm/common/logger.h Show resolved Hide resolved

onnxruntime/contrib_ops/cuda/llm/cutlass_preprocessors.cc Show resolved Hide resolved

Merge branch 'main' into tlwu/fp16_intB_gemm

bd27273

kunal-vaishnavi reviewed May 29, 2025

View reviewed changes

onnxruntime/contrib_ops/cuda/llm/cutlass_extensions/compute_occupancy.h Show resolved Hide resolved

kunal-vaishnavi reviewed May 29, 2025

View reviewed changes

onnxruntime/contrib_ops/cuda/llm/cutlass_extensions/epilogue/thread/fused_activations.h Show resolved Hide resolved

disable by default

b3f49b0

kunal-vaishnavi reviewed May 29, 2025

View reviewed changes

onnxruntime/contrib_ops/cuda/llm/fpA_intB_gemm_adpator.cu Show resolved Hide resolved

rewrite transpose b kernel

00f36f4

tianleiwu requested review from kunal-vaishnavi and nenad1002 May 30, 2025 05:35

nenad1002 approved these changes May 30, 2025

View reviewed changes

tianleiwu merged commit 9d6546e into main May 30, 2025
88 checks passed

tianleiwu deleted the tlwu/fp16_intB_gemm branch May 30, 2025 17:55

tianleiwu mentioned this pull request May 30, 2025

[CUDA] FpA IntB Gemm Weight Conversion in GPU #24914

Merged

tianleiwu mentioned this pull request Jun 4, 2025

[CUDA] fp16 intB gemm scale only kernel #24955

Merged

hmaarrfk mentioned this pull request Dec 27, 2025

onnxruntime v1.23.2 conda-forge/onnxruntime-feedstock#164

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CUDA] fp16 intB gemm#24854

[CUDA] fp16 intB gemm#24854
tianleiwu merged 26 commits intomainfrom
tlwu/fp16_intB_gemm

tianleiwu commented May 23, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

tianleiwu commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

tianleiwu commented May 23, 2025 •

edited

Loading