[MLAS] Add fused Silu and Gelu kernels for AVX512 by hariharans29 · Pull Request #27690 · microsoft/onnxruntime

hariharans29 · 2026-03-17T03:33:45Z

Description

Add fused Silu and Exact Gelu (Erf based) for AVX512f

Silu benchmarks:

GELU exact (Erf) benchmarks:

Motivation and Context

Improve performance on AVX512F

Silu shows small regression at B=1 but I don't think the absolute difference is much

github-actions

You can commit the suggested changes from lintrunner.

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

Copilot

Pull request overview

This PR extends MLAS and the CPU EP to use new fused/unary compute entry points for exact GELU(erf) and SiLU, adding an AVX512-optimized implementation (including an optional minimax erf approximation for GELU) plus associated tests and benchmarks.

Changes:

Add MLAS APIs MlasComputeGeluErf (exact vs minimax mode) and MlasComputeSilu, and wire CPU EP GELU + contrib QuickGelu(alpha=1) to use them.
Add AVX512F intrinsic implementations for GELU(erf) (exact + minimax approximation) and SiLU, with platform dispatch wiring.
Add new MLAS AVX512 transcendental unit tests and a new MLAS transcendental benchmark; add a session option to enable GELU minimax mode.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
onnxruntime/test/providers/cpu/activation/activation_op_test.cc	Adds an ORT-level test for the GELU minimax-enabled session option.
onnxruntime/test/mlas/unittest/test_transcendental_avx512.cpp	Adds AVX512 transcendental unit tests comparing AVX512 kernels vs reference/generic paths.
onnxruntime/test/mlas/bench/bench_transcendental.cpp	Adds benchmarks comparing fused MLAS unary paths vs unfused baselines for SiLU and GELU(erf).
onnxruntime/core/providers/cpu/tensor/gelu.h	Adds reading of the new session option to toggle minimax mode.
onnxruntime/core/providers/cpu/tensor/gelu.cc	Switches exact GELU(erf) implementation to `MlasComputeGeluErf` with mode selection.
onnxruntime/core/mlas/lib/silu.cpp	Adds baseline MLAS SiLU kernel + dispatched entry point.
onnxruntime/core/mlas/lib/platform.cpp	Initializes and dispatches new GELU/SiLU kernel routine pointers; wires AVX512 variants.
onnxruntime/core/mlas/lib/mlasi.h	Adds new kernel declarations and new platform routine pointers.
onnxruntime/core/mlas/lib/intrinsics/avx512/silu_avx512f.cpp	Adds AVX512F SiLU kernel implementation.
onnxruntime/core/mlas/lib/intrinsics/avx512/gelu_avx512f.cpp	Adds AVX512F exact GELU(erf) and minimax approximation implementations.
onnxruntime/core/mlas/lib/gelu.cpp	Adds baseline MLAS exact GELU(erf) kernel and dispatched `MlasComputeGeluErf`.
onnxruntime/core/mlas/inc/mlas.h	Exposes new MLAS public APIs and GELU(erf) mode enum.
onnxruntime/contrib_ops/cpu/activations.h	Uses `MlasComputeSilu` fast path for QuickGelu when `alpha==1`.
include/onnxruntime/core/session/onnxruntime_session_options_config_keys.h	Adds session option key for enabling GELU minimax mode.
cmake/onnxruntime_mlas.cmake	Adds new MLAS sources and AVX512 intrinsics sources to the build.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…/onnxruntime into hari/fused_silu_avx512

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

github-actions

You can commit the suggested changes from lintrunner.

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

tianleiwu

1. AVX512 dispatch coverage (`onnxruntime/test/mlas/unittest/test_transcendental_avx512.cpp`)

Positive:

The new MLAS tests do a good job covering tail lengths, random inputs, and special values for the exact AVX512 GELU/SiLU kernels, which is the right level of stress for new vector math code.

Concern:

⚠️ AVX512F-only machines will skip these tests even though the runtime enables the kernels there: the test gate checks GetMlasPlatform().Avx512Supported_ (test), but the new GELU/SiLU dispatch flips over as soon as plain AVX512F is available (platform dispatch). Avx512Supported_ is only set later for the stricter AVX512 core feature set (BW/DQ/VL) (platform flag). On an AVX512F-only CPU, production code will run the new kernels, but the dedicated unit test will report "AVX512 is not available" and skip them.
```
bool IsAvx512Available() {
  return GetMlasPlatform().GeluKernelRoutine == MlasGeluKernelAvx512F;
}
```

2. Minimax path coverage (`onnxruntime/core/mlas/lib/intrinsics/avx512/gelu_avx512f.cpp`)

Positive:

The minimax path is wired conservatively: the session option only requests it through MlasComputeGeluErf, and non-supporting platforms fall back to the existing exact path instead of failing.

Concern:

⚠️ The riskiest new kernel is only lightly covered: the new table-driven minimax routine adds its own special-value handling, including an explicit -inf -> -0 fixup (kernel), but the new MLAS unit test only exercises the exact AVX512 kernel (test body), and the operator-level test for the session option uses input_values, which include +inf but not -inf (inputs, test). That means the new session-option path can change observable special-value behavior without any direct regression coverage.
```
// Add a direct AVX512 minimax-vs-reference MLAS test, and extend the
// operator/session-option test corpus with -inf.
MlasGeluKernelAvx512FMinimaxApprox(input, avx512_output, size);
```

Summary of Concerns

#	Severity	Component	Issue
1	Suggestion	AVX512 unit tests	Test gating is stricter than runtime dispatch, so AVX512F-only machines skip coverage for kernels they actually run.
2	Suggestion	Minimax GELU coverage	The new minimax kernel lacks direct MLAS stress coverage and the operator test corpus misses `-inf`, so special-value regressions can slip through.

hariharans29 · 2026-03-21T02:14:51Z

Hi @tianleiwu

Thanks for the feedback.

Unit test gating fixed to match the dispatch
Removed minimax Gelu altogether as I have some suspicions on that code and the cusotmer did not see any gains with that anyway. Will add that back later if there is value to it.

Copilot

Pull request overview

This PR extends MLAS with dispatched “fused” unary activation entry points for SiLU and exact GELU (erf formulation) on AVX512F, and wires those entry points into the CPU provider (GELU exact path; QuickGelu’s α=1 path). It also adds MLAS unit tests and a micro-benchmark to validate/measure the new dispatch paths.

Changes:

Add MLAS public APIs MlasComputeSilu and MlasComputeGeluErf, plus baseline kernels and AVX512F implementations, and dispatch them via MLAS_PLATFORM.
Switch CPU GELU exact implementation to call MlasComputeGeluErf, and switch QuickGelu (α=1) to call MlasComputeSilu.
Add AVX512-focused MLAS unit tests and a benchmark for fused vs unfused unary activation paths.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
onnxruntime/test/providers/cpu/activation/activation_op_test.cc	Adds a session-options config keys include (currently unused).
onnxruntime/test/mlas/unittest/test_transcendental_avx512.cpp	New MLAS unit tests for AVX512 GELU/SiLU dispatch and numeric validation.
onnxruntime/test/mlas/bench/bench_transcendental.cpp	New benchmark comparing fused dispatch vs unfused baselines for SiLU and GELU(erf).
onnxruntime/core/providers/cpu/tensor/gelu.h	Adds `#pragma once`.
onnxruntime/core/providers/cpu/tensor/gelu.cc	Replaces manual erf-based GELU loop with `MlasComputeGeluErf`.
onnxruntime/core/mlas/lib/silu.cpp	Adds baseline SiLU kernel and dispatched `MlasComputeSilu`.
onnxruntime/core/mlas/lib/gelu.cpp	Adds baseline exact GELU(erf) kernel and dispatched `MlasComputeGeluErf`.
onnxruntime/core/mlas/lib/platform.cpp	Initializes and AVX512-dispatches GELU/SiLU kernel routine pointers.
onnxruntime/core/mlas/lib/mlasi.h	Adds declarations for new kernels and platform dispatch function pointers.
onnxruntime/core/mlas/lib/intrinsics/avx512/silu_avx512f.cpp	New AVX512F SiLU implementation using approximations and masked tails.
onnxruntime/core/mlas/lib/intrinsics/avx512/gelu_avx512f.cpp	New AVX512F exact GELU(erf) implementation (currently has scalar-tail behavior concerns).
onnxruntime/core/mlas/inc/mlas.h	Exposes `MlasComputeGeluErf`/`MlasComputeSilu` in the public MLAS header with aliasing notes.
onnxruntime/contrib_ops/cpu/activations.h	Uses `MlasComputeSilu` for QuickGelu when `alpha_ == 1.0f`.
cmake/onnxruntime_mlas.cmake	Adds new sources and ensures AVX512 compilation for the new intrinsic files.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Add fused Silu and Gelu kernels for AVX512

101c0f5

hariharans29 requested a review from Copilot March 17, 2026 03:34

Copilot started reviewing on behalf of hariharans29 March 17, 2026 03:35 View session

github-actions Bot reviewed Mar 17, 2026

View reviewed changes

Comment thread onnxruntime/test/mlas/unittest/test_transcendental_avx512.cpp Outdated

Comment thread onnxruntime/test/mlas/unittest/test_transcendental_avx512.cpp Outdated

Comment thread onnxruntime/test/mlas/unittest/test_transcendental_avx512.cpp Outdated

github-advanced-security AI found potential problems Mar 17, 2026

View reviewed changes

Comment thread onnxruntime/test/mlas/unittest/test_transcendental_avx512.cpp Fixed

hariharans29 and others added 3 commits March 16, 2026 20:40

Update onnxruntime/test/mlas/unittest/test_transcendental_avx512.cpp

8b3c23a

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

Update onnxruntime/test/mlas/unittest/test_transcendental_avx512.cpp

26ed025

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

Update onnxruntime/test/mlas/unittest/test_transcendental_avx512.cpp

3bf1f01

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

Copilot AI reviewed Mar 17, 2026

View reviewed changes

hariharans29 and others added 7 commits March 16, 2026 20:52

Slight adjustments in the code

e31395d

Merge branch 'hari/fused_silu_avx512' of https://github.com/microsoft…

0822fc9

…/onnxruntime into hari/fused_silu_avx512

More build changes

9a793a3

More changes

2cd17b1

Potential fix for pull request finding

d99bfd8

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Potential fix for pull request finding

275b69a

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Fix ARM build + Copilot suggestions

3a80418

github-actions Bot reviewed Mar 17, 2026

View reviewed changes

Comment thread onnxruntime/test/mlas/unittest/test_transcendental_avx512.cpp Outdated

Update onnxruntime/test/mlas/unittest/test_transcendental_avx512.cpp

a3f7033

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

tianleiwu reviewed Mar 20, 2026

View reviewed changes

Remove Minimax approx + address PR feedback

08157ab

hariharans29 requested a review from Copilot March 21, 2026 02:16

Copilot started reviewing on behalf of hariharans29 March 21, 2026 02:17 View session

Copilot AI reviewed Mar 21, 2026

View reviewed changes

Comment thread onnxruntime/core/mlas/lib/intrinsics/avx512/gelu_avx512f.cpp Outdated

Comment thread onnxruntime/test/providers/cpu/activation/activation_op_test.cc Outdated

Comment thread onnxruntime/test/mlas/bench/bench_transcendental.cpp

hariharans29 and others added 3 commits March 20, 2026 19:27

Update onnxruntime/test/providers/cpu/activation/activation_op_test.cc

7d23425

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update onnxruntime/test/mlas/bench/bench_transcendental.cpp

9f9bac7

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot comments

dc63f2f

hariharans29 requested a review from Copilot March 21, 2026 02:32

Copilot started reviewing on behalf of hariharans29 March 21, 2026 02:33 View session

Copilot AI reviewed Mar 21, 2026

View reviewed changes

Comment thread onnxruntime/test/mlas/bench/bench_transcendental.cpp

Comment thread onnxruntime/core/mlas/lib/silu.cpp Outdated

Comment thread onnxruntime/core/mlas/lib/gelu.cpp Outdated

hariharans29 requested a review from tianleiwu March 22, 2026 06:01

Copilot comment

7c84be7

hariharans29 requested a review from Copilot March 22, 2026 21:21

Copilot started reviewing on behalf of hariharans29 March 22, 2026 21:22 View session

Copilot AI reviewed Mar 22, 2026

View reviewed changes

Comment thread onnxruntime/core/mlas/lib/intrinsics/avx512/silu_avx512f.cpp Outdated

Copilot comment

c9a549c

hariharans29 marked this pull request as draft March 22, 2026 22:23

hariharans29 added 3 commits March 22, 2026 16:48

Rework Silu

05ca9c2

Experiment

5f844ba

Experiment with Exact Gelu

bf41c63

hariharans29 requested a review from Copilot March 23, 2026 04:14

Copilot started reviewing on behalf of hariharans29 March 23, 2026 04:15 View session

Merge remote-tracking branch 'origin/main' into hari/fused_silu_avx512

bd93543

Copilot AI reviewed Mar 23, 2026

View reviewed changes

Comment thread onnxruntime/test/mlas/bench/bench_transcendental.cpp

Comment thread onnxruntime/test/mlas/bench/bench_transcendental.cpp Outdated

Comment thread onnxruntime/test/mlas/unittest/test_transcendental_avx512.cpp

Copilot comments

28bf32e

hariharans29 marked this pull request as ready for review March 23, 2026 04:24

hariharans29 added 2 commits March 23, 2026 12:35

PR feedback

9476522

Nit

cfa0bf2

tianleiwu reviewed Mar 23, 2026

View reviewed changes

Comment thread onnxruntime/core/mlas/lib/platform.cpp Outdated

hariharans29 requested a review from Copilot March 23, 2026 20:37

Fix alignment

d459cf7

Copilot started reviewing on behalf of hariharans29 March 23, 2026 20:39 View session

Copilot AI reviewed Mar 23, 2026

View reviewed changes

Comment thread onnxruntime/core/mlas/lib/platform.cpp

Comment thread onnxruntime/core/mlas/lib/intrinsics/avx512/silu_avx512f.cpp Outdated

Update onnxruntime/core/mlas/lib/intrinsics/avx512/silu_avx512f.cpp

8652b3d

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

hariharans29 requested a review from tianleiwu March 23, 2026 21:16

Alignment

fcf1f1a

tianleiwu approved these changes Mar 24, 2026

View reviewed changes

hariharans29 merged commit 38a2625 into main Mar 24, 2026
93 of 94 checks passed

hariharans29 deleted the hari/fused_silu_avx512 branch March 24, 2026 17:27

BrewTestBot mentioned this pull request Apr 20, 2026

onnxruntime 1.25.0 Homebrew/homebrew-core#278543

Merged

Conversation

hariharans29 commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tianleiwu left a comment

Choose a reason for hiding this comment

1. AVX512 dispatch coverage (onnxruntime/test/mlas/unittest/test_transcendental_avx512.cpp)

2. Minimax path coverage (onnxruntime/core/mlas/lib/intrinsics/avx512/gelu_avx512f.cpp)

Summary of Concerns

Uh oh!

hariharans29 commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

hariharans29 commented Mar 17, 2026 •

edited

Loading

1. AVX512 dispatch coverage (`onnxruntime/test/mlas/unittest/test_transcendental_avx512.cpp`)

2. Minimax path coverage (`onnxruntime/core/mlas/lib/intrinsics/avx512/gelu_avx512f.cpp`)

hariharans29 commented Mar 21, 2026 •

edited

Loading