Skip to content

[MLAS] Enable FP16 for Gelu#26815

Merged
hariharans29 merged 23 commits intomicrosoft:mainfrom
MonakaResearch:gelu_fp16
Apr 23, 2026
Merged

[MLAS] Enable FP16 for Gelu#26815
hariharans29 merged 23 commits intomicrosoft:mainfrom
MonakaResearch:gelu_fp16

Conversation

@akote123
Copy link
Copy Markdown
Contributor

@akote123 akote123 commented Dec 17, 2025

Enabled fp16 Gelu for opset20.Gelu uses tanh and ERF functions depending on the approximation method used. Implemented tanh in sve and erf in sve and neon .
Gr3E results: with tanh and erf approximation:

GELU(ms) Tanh_SVE ERF_SVE Tanh_NEON ERF_NEON
Shape F32 F16 F32 F16
100 0.007 0.007 0.007 0.007
1000 0.008 0.007 0.012 0.008
1000000 0.076 0.039 0.203 0.07

Gr4 results: with tanh and erf approximation:

GELU(ms) Tanh_SVE ERF_SVE Tanh_NEON ERF_NEON
Shape F32 F16 F32 F16
100 0.005 0.005 0.005 0.005
1000 0.006 0.006 0.008 0.006
1000000 0.092 0.046 0.224 0.088

This PR is a joint contribution by:
Aruna K(@akote123)
Abhishek Jain(@abhijain1204fujitsu)
Sanket Kale(@sanketkaleoss )

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enables FP16 (half-precision floating-point) support for the GELU (Gaussian Error Linear Unit) activation operator in ONNX Runtime opset 20. The implementation provides optimized compute paths using ARM SVE (Scalable Vector Extension) and NEON intrinsics for both tanh and erf approximation methods, with fallback to scalar FP32 computation when vector intrinsics are not available.

Key changes:

  • Adds FP16 kernel registration for GELU operator alongside the existing FP32 implementation
  • Implements optimized FP16 ERF and TANH kernels using ARM SVE and NEON intrinsics
  • Adds comprehensive test coverage for both tanh and erf approximation modes with FP16 inputs

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 27 comments.

Show a summary per file
File Description
onnxruntime/core/providers/cpu/cpu_execution_provider.cc Registers typed GELU kernels for float and MLFloat16 types
onnxruntime/core/providers/cpu/tensor/gelu.cc Implements FP16 GELU computation with SVE/NEON optimizations and scalar fallback
onnxruntime/core/providers/cpu/math/element_wise_ops.cc Adds FP16 ERF operator support using new SVE/NEON kernels
onnxruntime/test/providers/cpu/activation/activation_op_test.cc Adds FP16 GELU tests for both tanh and erf approximations
onnxruntime/core/mlas/lib/tanh.cpp Adds SVE path for FP16 tanh computation
onnxruntime/core/mlas/lib/sve/mlasi_sve.h Declares SVE FP16 function signatures
onnxruntime/core/mlas/lib/sve/mlas_sve_fp16.h Adds SVE FP16 intrinsic wrapper functions
onnxruntime/core/mlas/lib/sve/Elementwise_sve_fp16.cpp Implements SVE FP16 tanh, erf, and GELU kernels
onnxruntime/core/mlas/lib/fp16_common.h Adds NEON FP16 helper functions for erf computation
onnxruntime/core/mlas/lib/erf.cpp Implements NEON FP16 erf kernel
onnxruntime/core/mlas/inc/mlas.h Exports NEON FP16 erf kernel function
cmake/onnxruntime_providers_cpu.cmake Adds ARM FP16 compile flags for gelu.cc and includes MLAS headers
cmake/onnxruntime_mlas.cmake Adds SVE FP16 elementwise source and compile flags for erf.cpp

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread onnxruntime/core/providers/cpu/cpu_execution_provider.cc Outdated
Comment thread onnxruntime/core/providers/cpu/cpu_execution_provider.cc Outdated
Comment thread onnxruntime/core/mlas/lib/erf.cpp Outdated
Comment thread onnxruntime/core/mlas/lib/erf.cpp Outdated
Comment thread cmake/onnxruntime_providers_cpu.cmake Outdated
Comment thread onnxruntime/core/mlas/lib/erf.cpp Outdated
Comment thread onnxruntime/core/mlas/lib/fp16_common.h
Comment thread onnxruntime/core/mlas/lib/fp16_common.h Outdated
Comment thread onnxruntime/core/mlas/lib/fp16_common.h Outdated
Comment thread onnxruntime/core/providers/cpu/cpu_execution_provider.cc Outdated
@hariharans29 hariharans29 changed the title Enable FP16 for Gelu [MLAS] Enable FP16 for Gelu Dec 18, 2025
@hariharans29
Copy link
Copy Markdown
Member

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 4 pipeline(s).

@hariharans29
Copy link
Copy Markdown
Member

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 4 pipeline(s).

Comment thread cmake/onnxruntime_mlas.cmake Outdated
Comment thread onnxruntime/core/providers/cpu/math/element_wise_ops.cc Outdated
Comment thread onnxruntime/core/providers/cpu/tensor/gelu.cc Outdated
Comment thread onnxruntime/core/mlas/lib/sve/mlasi_sve.h Outdated
Comment thread cmake/onnxruntime_providers_cpu.cmake Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 19 comments.

Comments suppressed due to low confidence (1)

onnxruntime/core/providers/cpu/math/element_wise_ops.cc:2034

  • The allocator is retrieved but no longer used after switching to the native FP16 ERF implementation. The lines getting the temp space allocator (lines 2032-2034) should be removed as they are now unnecessary.
  // get allocator for temporary buffers
  AllocatorPtr alloc;
  ORT_RETURN_IF_ERROR(context->GetTempSpaceAllocator(&alloc));

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread onnxruntime/core/mlas/lib/gelu.h Outdated
Comment thread onnxruntime/core/mlas/lib/gelu.h Outdated
Comment thread onnxruntime/core/mlas/lib/gelu.cpp Outdated
Comment thread onnxruntime/core/mlas/lib/sve/mlasi_sve.h Outdated
Comment thread onnxruntime/core/mlas/lib/sve/mlasi_sve.h Outdated
Comment thread onnxruntime/core/mlas/lib/erf_neon_fp16.cpp Outdated
Comment thread onnxruntime/core/providers/cpu/tensor/gelu.cc Outdated
Comment thread onnxruntime/core/providers/cpu/tensor/gelu.cc Outdated
Comment thread onnxruntime/core/mlas/lib/gelu.cpp
Comment thread onnxruntime/core/mlas/lib/gelu_neon_fp16.cpp Outdated
@abhijain1204fujitsu
Copy link
Copy Markdown

@hariharans29 we have pushed the code to resolve all the above comments
Kindly support for further review and merger of the PR

@hariharans29
Copy link
Copy Markdown
Member

hariharans29 commented Jan 29, 2026

@hariharans29 we have pushed the code to resolve all the above comments Kindly support for further review and merger of the PR

Please manually "resolve" Copilot's comments and add comments if you think Copilot's suggestion is not applicable and you re not taking it in. Thanks.

@hariharans29
Copy link
Copy Markdown
Member

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 4 pipeline(s).

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread onnxruntime/core/mlas/lib/gelu.h Outdated
Comment thread onnxruntime/core/mlas/lib/sve/mlasi_sve.h Outdated
@abhijain1204fujitsu
Copy link
Copy Markdown

Hi @hariharans29 ,
Thank you for your previous review comments
We have pushed the changes as per review, kindly support us to check the CI Pipelines and further review the code for merger.

@hariharans29
Copy link
Copy Markdown
Member

hariharans29 commented Feb 4, 2026

Hi @hariharans29 , Thank you for your previous review comments We have pushed the changes as per review, kindly support us to check the CI Pipelines and further review the code for merger.

There are still some unaddressed Copilot comments. Please manually resolve them with a comment as to whether going with Copilot's recommendation or not.

@sanketkaleoss
Copy link
Copy Markdown
Contributor

Hi @hariharans29 ,
We have pushed the new changes resolving CI failures, kindly support us by restarting the CI pipeline.

@hariharans29
Copy link
Copy Markdown
Member

Hi @hariharans29 , We have pushed the new changes resolving CI failures, kindly support us by restarting the CI pipeline.

As stated above, please resolve all Copilot comments and my old comments. Resolving comments is a gating check for merging a PR.

I ll start another round of CI.

@sanketkaleoss
Copy link
Copy Markdown
Contributor

Hi @hariharans29 , We have pushed the new changes resolving CI failures, kindly support us by restarting the CI pipeline.

As stated above, please resolve all Copilot comments and my old comments. Resolving comments is a gating check for merging a PR.

I ll start another round of CI.

Thanks, will resolve them.

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 2 pipeline(s).

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 19 out of 19 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread onnxruntime/core/mlas/lib/gelu_neon_fp16.cpp
Comment thread onnxruntime/core/mlas/lib/sve/mlas_sve_fp16.h
Comment thread onnxruntime/core/mlas/lib/erf_neon_fp16.cpp Outdated
Comment thread onnxruntime/core/providers/cpu/tensor/gelu.cc Outdated
Comment thread onnxruntime/core/mlas/lib/sve/mlasi_sve.h
@sanketkaleoss
Copy link
Copy Markdown
Contributor

sanketkaleoss commented Apr 21, 2026

@hariharans29 Resolved pending comments and incorporated required suggestion.

@hariharans29
Copy link
Copy Markdown
Member

@hariharans29 Resolved pending comments and incorporated required suggestion.

Overall - LGTM. I just re-opened discussion on one comment (manual aligned alloc and aligned free) - nothing that is a blocker.

I ll seek Copilot's opinion once more and we can merge it soon.

Thanks.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 19 out of 19 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@hariharans29
Copy link
Copy Markdown
Member

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 2 pipeline(s).

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 19 out of 19 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread onnxruntime/core/mlas/lib/erf.cpp Outdated
Comment thread onnxruntime/core/mlas/lib/gelu.cpp
Comment thread onnxruntime/core/mlas/lib/sve/elementwise_sve_fp16.cpp
Comment thread onnxruntime/core/mlas/lib/gelu_neon_fp16.cpp
Comment thread onnxruntime/core/providers/cpu/tensor/gelu.cc
@hariharans29
Copy link
Copy Markdown
Member

Overall looks good to me:

  1. I had a question on the open comment (aligned allocator)
  2. Copilot has a last set of comments - please check if it adds value
  3. I think the pending CI failures have nothing to do with the PR. But can you please double check ?
  4. Please resolve any pending old comments so that it is mergeable

Thanks !

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@sanketkaleoss
Copy link
Copy Markdown
Contributor

Overall looks good to me:

  1. I had a question on the open comment (aligned allocator)
  2. Copilot has a last set of comments - please check if it adds value
  3. I think the pending CI failures have nothing to do with the PR. But can you please double check ?
  4. Please resolve any pending old comments so that it is mergeable

Thanks !

Hi @hariharans29 , resolved the pending comments and incorporated one of the Copilot suggestions. The CI issues appear to be unrelated, and performance has been preserved across these PR iterations. Thanks.

@hariharans29
Copy link
Copy Markdown
Member

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 2 pipeline(s).

@hariharans29 hariharans29 merged commit a02f769 into microsoft:main Apr 23, 2026
87 checks passed
@hariharans29
Copy link
Copy Markdown
Member

Overall looks good to me:

  1. I had a question on the open comment (aligned allocator)
  2. Copilot has a last set of comments - please check if it adds value
  3. I think the pending CI failures have nothing to do with the PR. But can you please double check ?
  4. Please resolve any pending old comments so that it is mergeable

Thanks !

Hi @hariharans29 , resolved the pending comments and incorporated one of the Copilot suggestions. The CI issues appear to be unrelated, and performance has been preserved across these PR iterations. Thanks.

Thanks for the contribution. I merged it. I take it you would have tried onnxruntime_mlas_test.exe and onnxruntime_test_all.exe in your env right ?

@sanketkaleoss
Copy link
Copy Markdown
Contributor

Overall looks good to me:

  1. I had a question on the open comment (aligned allocator)
  2. Copilot has a last set of comments - please check if it adds value
  3. I think the pending CI failures have nothing to do with the PR. But can you please double check ?
  4. Please resolve any pending old comments so that it is mergeable

Thanks !

Hi @hariharans29 , resolved the pending comments and incorporated one of the Copilot suggestions. The CI issues appear to be unrelated, and performance has been preserved across these PR iterations. Thanks.

Thanks for the contribution. I merged it. I take it you would have tried onnxruntime_mlas_test.exe and onnxruntime_test_all.exe in your env right ?

I have executed both test suites in my environment, and all tests are passing. Thank you for merging this, and for your continued reviews and suggestions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants