Add support for custom position ids and attention bias to GQA CPU operator #23944

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

derdeljan-msft merged 25 commits into main from derdeljan/gqa-tree-decoding

Mar 14, 2025

cmake/onnxruntime_mlas.cmake

-Original file line number
+Diff line change
@@ Expand Up / @@ -27,6 +27,8 @@ onnxruntime_add_static_library(onnxruntime_mlas @@
       ${MLAS_SRC_DIR}/activate.cpp
       ${MLAS_SRC_DIR}/logistic.cpp
       ${MLAS_SRC_DIR}/tanh.cpp
+      ${MLAS_SRC_DIR}/eltwise.h
+      ${MLAS_SRC_DIR}/eltwise.cpp
       ${MLAS_SRC_DIR}/erf.cpp
       ${MLAS_SRC_DIR}/compute.cpp
       ${MLAS_SRC_DIR}/quantize.cpp
@@ Expand Down Expand Up / @@ -101,6 +103,9 @@ function(setup_mlas_source_for_windows) @@
             ${MLAS_SRC_DIR}/softmax_kernel_neon.h
             ${MLAS_SRC_DIR}/softmax_kernel_neon.cpp
             ${MLAS_SRC_DIR}/softmax_kernel_neon_fp16.cpp
+            ${MLAS_SRC_DIR}/eltwise_kernel_neon.h
+            ${MLAS_SRC_DIR}/eltwise_kernel_neon.cpp
+            ${MLAS_SRC_DIR}/eltwise_kernel_neon_fp16.cpp
           )
           set(mlas_platform_preprocess_srcs
@@ Expand Down Expand Up / @@ -387,6 +392,8 @@ else() @@
               ${MLAS_SRC_DIR}/hgemm_kernel_neon.cpp
               ${MLAS_SRC_DIR}/softmax_kernel_neon.h
               ${MLAS_SRC_DIR}/softmax_kernel_neon.cpp
+              ${MLAS_SRC_DIR}/eltwise_kernel_neon.h
+              ${MLAS_SRC_DIR}/eltwise_kernel_neon.cpp
             )
             set_source_files_properties(${MLAS_SRC_DIR}/sqnbitgemm_kernel_neon_int8.cpp
                                         PROPERTIES COMPILE_FLAGS " -march=armv8.2-a+dotprod")
@@ Expand All / @@ -409,6 +416,7 @@ else() @@
                 ${MLAS_SRC_DIR}/rotary_embedding_kernel_neon_fp16.cpp
                 ${MLAS_SRC_DIR}/halfgemm_kernel_neon_fp16.cpp
                 ${MLAS_SRC_DIR}/softmax_kernel_neon_fp16.cpp
+                ${MLAS_SRC_DIR}/eltwise_kernel_neon_fp16.cpp
               )
               set_source_files_properties(${MLAS_SRC_DIR}/aarch64/HalfGemmKernelNeon.S PROPERTIES COMPILE_FLAGS " -march=armv8.2-a+fp16 ")
               set_source_files_properties(${MLAS_SRC_DIR}/aarch64/QgemmS8S8KernelSmmla.S PROPERTIES COMPILE_FLAGS " -march=armv8.2-a+i8mm ")
@@ Expand All / @@ -423,6 +431,7 @@ else() @@
               set_source_files_properties(${MLAS_SRC_DIR}/rotary_embedding_kernel_neon_fp16.cpp PROPERTIES COMPILE_FLAGS " -march=armv8.2-a+fp16 ")
               set_source_files_properties(${MLAS_SRC_DIR}/halfgemm_kernel_neon_fp16.cpp PROPERTIES COMPILE_FLAGS " -march=armv8.2-a+fp16 ")
               set_source_files_properties(${MLAS_SRC_DIR}/softmax_kernel_neon_fp16.cpp PROPERTIES COMPILE_FLAGS " -march=armv8.2-a+fp16 ")
+              set_source_files_properties(${MLAS_SRC_DIR}/eltwise_kernel_neon_fp16.cpp PROPERTIES COMPILE_FLAGS " -march=armv8.2-a+fp16 ")
             endif()
             if(ONNXRUNTIME_MLAS_MULTI_ARCH)
@@ Expand Down @@

docs/ContribOperators.md

-Original file line number
+Diff line change
@@ Expand Up @@
     <dd>Softcap value for attention weights. Default value is 0.</dd>
     </dl>
-    #### Inputs (7 - 9)
+    #### Inputs (7 - 11)
     <dl>
     <dt><tt>query</tt> : T</dt>
@@ Expand All @@
     <dd>2D tensor with shape (max_sequence_length, head_size / 2).</dd>
     <dt><tt>sin_cache</tt> (optional) : T</dt>
     <dd>2D tensor with shape (max_sequence_length, head_size / 2).</dd>
+    <dt><tt>position_ids</tt> (optional) : tensor(int64)</dt>
+    <dd>2D tensor with shape (batch_size, sequence_length). When processing the first prompt the kernel uses only the first element</dd>
+    <dt><tt>attention_bias</tt> (optional) : T</dt>
+    <dd>additional add to QxK' with shape (batch_size or 1, num_heads or 1, sequence_length, total_sequence_length)</dd>
     </dl>
     #### Outputs
@@ Expand Down @@

docs/OperatorKernels.md

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -520,7 +520,7 @@ Do not modify directly.*
  
    |Gelu|*in* X:**T**<br> *out* Y:**T**|1+|**T** = tensor(float)|

    |GreedySearch|*in* input_ids:**I**<br> *in* max_length:**I**<br> *in* min_length:**I**<br> *in* repetition_penalty:**T**<br> *in* vocab_mask:**I**<br> *in* prefix_vocab_mask:**I**<br> *in* attention_mask:**I**<br> *out* sequences:**I**|1+|**T** = tensor(float)|

    |GridSample|*in* X:**T1**<br> *in* Grid:**T1**<br> *out* Y:**T2**|1+|**T1** = tensor(float)<br/> **T2** = tensor(float)|

    |GroupQueryAttention|*in* query:**T**<br> *in* key:**T**<br> *in* value:**T**<br> *in* past_key:**T**<br> *in* past_value:**T**<br> *in* seqlens_k:**M**<br> *in* total_sequence_length:**M**<br> *in* cos_cache:**T**<br> *in* sin_cache:**T**<br> *out* output:**T**<br> *out* present_key:**T**<br> *out* present_value:**T**|1+|**M** = tensor(int32)<br/> **T** = tensor(float), tensor(float16)|

    |GroupQueryAttention|*in* query:**T**<br> *in* key:**T**<br> *in* value:**T**<br> *in* past_key:**T**<br> *in* past_value:**T**<br> *in* seqlens_k:**M**<br> *in* total_sequence_length:**M**<br> *in* cos_cache:**T**<br> *in* sin_cache:**T**<br> *in* position_ids:**tensor(int64)**<br> *in* attention_bias:**T**<br> *out* output:**T**<br> *out* present_key:**T**<br> *out* present_value:**T**|1+|**M** = tensor(int32)<br/> **T** = tensor(float), tensor(float16)|

    |Inverse|*in* X:**T**<br> *out* Y:**T**|1+|**T** = tensor(double), tensor(float), tensor(float16)|

    |MatMulBnb4|*in* A:**T1**<br> *in* B:**T2**<br> *in* absmax:**T1**<br> *out* Y:**T1**|1+|**T1** = tensor(float)<br/> **T2** = tensor(uint8)|

    |MatMulFpQ4|*in* A:**T1**<br> *in* B:**T2**<br> *in* B_shape:**T3**<br> *out* Y:**T1**|1+|**T1** = tensor(float)<br/> **T2** = tensor(uint8)<br/> **T3** = tensor(int64)|

    @@ -922,7 +922,7 @@ Do not modify directly.*
  
    |GreedySearch|*in* input_ids:**I**<br> *in* max_length:**I**<br> *in* min_length:**I**<br> *in* repetition_penalty:**T**<br> *in* vocab_mask:**I**<br> *in* prefix_vocab_mask:**I**<br> *in* attention_mask:**I**<br> *out* sequences:**I**|1+|**T** = tensor(float), tensor(float16)|

    |GridSample|*in* X:**T1**<br> *in* Grid:**T1**<br> *out* Y:**T2**|1+|**T1** = tensor(float)<br/> **T2** = tensor(float)|

    |GroupNorm|*in* X:**T**<br> *in* gamma:**M**<br> *in* beta:**M**<br> *out* Y:**T**|1+|**T** = tensor(float), tensor(float16)|

    |GroupQueryAttention|*in* query:**T**<br> *in* key:**T**<br> *in* value:**T**<br> *in* past_key:**T**<br> *in* past_value:**T**<br> *in* seqlens_k:**M**<br> *in* total_sequence_length:**M**<br> *in* cos_cache:**T**<br> *in* sin_cache:**T**<br> *out* output:**T**<br> *out* present_key:**T**<br> *out* present_value:**T**|1+|**M** = tensor(int32)<br/> **T** = tensor(bfloat16), tensor(float16)|

    |GroupQueryAttention|*in* query:**T**<br> *in* key:**T**<br> *in* value:**T**<br> *in* past_key:**T**<br> *in* past_value:**T**<br> *in* seqlens_k:**M**<br> *in* total_sequence_length:**M**<br> *in* cos_cache:**T**<br> *in* sin_cache:**T**<br> *in* position_ids:**tensor(int64)**<br> *in* attention_bias:**T**<br> *out* output:**T**<br> *out* present_key:**T**<br> *out* present_value:**T**|1+|**M** = tensor(int32)<br/> **T** = tensor(bfloat16), tensor(float16)|

    |Inverse|*in* X:**T**<br> *out* Y:**T**|1+|**T** = tensor(double), tensor(float), tensor(float16)|

    |Irfft|*in* X:**T**<br> *out* Y:**T**|1+|**T** = tensor(double), tensor(float), tensor(float16)|

    |LongformerAttention|*in* input:**T**<br> *in* weight:**T**<br> *in* bias:**T**<br> *in* mask:**T**<br> *in* global_weight:**T**<br> *in* global_bias:**T**<br> *in* global:**G**<br> *out* output:**T**|1+|**T** = tensor(float), tensor(float16)|

    @@ -1399,7 +1399,7 @@ Do not modify directly.*
  
    |FusedMatMulActivation|*in* A:**T**<br> *in* B:**T**<br> *out* Y:**T**|1+|**T** = tensor(float), tensor(float16)|

    |Gelu|*in* X:**T**<br> *out* Y:**T**|1+|**T** = tensor(float), tensor(float16)|

    |GroupNorm|*in* X:**T**<br> *in* gamma:**M**<br> *in* beta:**M**<br> *out* Y:**T**|1+|**M** = tensor(float), tensor(float16)<br/> **T** = tensor(float), tensor(float16)|

    |GroupQueryAttention|*in* query:**T**<br> *in* key:**T**<br> *in* value:**T**<br> *in* past_key:**T**<br> *in* past_value:**T**<br> *in* seqlens_k:**M**<br> *in* total_sequence_length:**M**<br> *in* cos_cache:**T**<br> *in* sin_cache:**T**<br> *out* output:**T**<br> *out* present_key:**T**<br> *out* present_value:**T**|1+|**M** = tensor(int32)<br/> **T** = tensor(float), tensor(float16)|

    |GroupQueryAttention|*in* query:**T**<br> *in* key:**T**<br> *in* value:**T**<br> *in* past_key:**T**<br> *in* past_value:**T**<br> *in* seqlens_k:**M**<br> *in* total_sequence_length:**M**<br> *in* cos_cache:**T**<br> *in* sin_cache:**T**<br> *in* position_ids:**tensor(int64)**<br> *in* attention_bias:**T**<br> *out* output:**T**<br> *out* present_key:**T**<br> *out* present_value:**T**|1+|**M** = tensor(int32)<br/> **T** = tensor(float), tensor(float16)|

    |MatMulIntegerToFloat|*in* A:**T1**<br> *in* B:**T2**<br> *in* a_scale:**T3**<br> *in* b_scale:**T3**<br> *in* a_zero_point:**T1**<br> *in* b_zero_point:**T2**<br> *in* bias:**T3**<br> *out* Y:**T3**|1+|**T1** = tensor(int8), tensor(uint8)<br/> **T2** = tensor(int8), tensor(uint8)<br/> **T3** = tensor(float), tensor(float16)|

    |MatMulNBits|*in* A:**T1**<br> *in* B:**T2**<br> *in* scales:**T1**<br> *in* zero_points:**T3**<br> *in* g_idx:**T4**<br> *in* bias:**T1**<br> *out* Y:**T1**|1+|**T1** = tensor(float), tensor(float16)<br/> **T2** = tensor(uint8)|

    |MultiHeadAttention|*in* query:**T**<br> *in* key:**T**<br> *in* value:**T**<br> *in* bias:**T**<br> *in* key_padding_mask:**M**<br> *in* attention_bias:**T**<br> *in* past_key:**T**<br> *in* past_value:**T**<br> *out* output:**T**<br> *out* present_key:**T**<br> *out* present_value:**T**|1+|**M** = tensor(int32)<br/> **T** = tensor(float), tensor(float16)|

onnxruntime/contrib_ops/cpu/bert/attention_helper.h

-Original file line number
+Diff line change
@@ Expand Up @@
       MlasComputeSoftcap(scores, scores, sequence_length, softcap);
     }
+    template <typename T>
+    void ApplyAttentionBias(T* softmax_logits, const T* attention_mask, int N) {
+      MlasEltwiseAdd(softmax_logits, attention_mask, softmax_logits, N);
+    }
     template <typename T>
     void PrepareMask(const int32_t* mask_index,
                      gsl::span<const int64_t> mask_index_dims,
@@ Expand Down @@

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for custom position ids and attention bias to GQA CPU operator #23944

Uh oh!

Diff view

Diff view

There are no files selected for viewing

Uh oh!

Uh oh!