Skip to content

[MLAS] Add fused Silu and Gelu kernels for AVX512#27690

Merged
hariharans29 merged 40 commits intomainfrom
hari/fused_silu_avx512
Mar 24, 2026
Merged

[MLAS] Add fused Silu and Gelu kernels for AVX512#27690
hariharans29 merged 40 commits intomainfrom
hari/fused_silu_avx512

Conversation

@hariharans29
Copy link
Copy Markdown
Member

@hariharans29 hariharans29 commented Mar 17, 2026

Description

Add fused Silu and Exact Gelu (Erf based) for AVX512f

Silu benchmarks:
image

GELU exact (Erf) benchmarks:
image

Motivation and Context

Improve performance on AVX512F

Silu shows small regression at B=1 but I don't think the absolute difference is much

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can commit the suggested changes from lintrunner.

Comment thread onnxruntime/test/mlas/unittest/test_transcendental_avx512.cpp Outdated
Comment thread onnxruntime/test/mlas/unittest/test_transcendental_avx512.cpp Outdated
Comment thread onnxruntime/test/mlas/unittest/test_transcendental_avx512.cpp Outdated
Comment thread onnxruntime/test/mlas/unittest/test_transcendental_avx512.cpp Fixed
hariharans29 and others added 3 commits March 16, 2026 20:40
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends MLAS and the CPU EP to use new fused/unary compute entry points for exact GELU(erf) and SiLU, adding an AVX512-optimized implementation (including an optional minimax erf approximation for GELU) plus associated tests and benchmarks.

Changes:

  • Add MLAS APIs MlasComputeGeluErf (exact vs minimax mode) and MlasComputeSilu, and wire CPU EP GELU + contrib QuickGelu(alpha=1) to use them.
  • Add AVX512F intrinsic implementations for GELU(erf) (exact + minimax approximation) and SiLU, with platform dispatch wiring.
  • Add new MLAS AVX512 transcendental unit tests and a new MLAS transcendental benchmark; add a session option to enable GELU minimax mode.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
onnxruntime/test/providers/cpu/activation/activation_op_test.cc Adds an ORT-level test for the GELU minimax-enabled session option.
onnxruntime/test/mlas/unittest/test_transcendental_avx512.cpp Adds AVX512 transcendental unit tests comparing AVX512 kernels vs reference/generic paths.
onnxruntime/test/mlas/bench/bench_transcendental.cpp Adds benchmarks comparing fused MLAS unary paths vs unfused baselines for SiLU and GELU(erf).
onnxruntime/core/providers/cpu/tensor/gelu.h Adds reading of the new session option to toggle minimax mode.
onnxruntime/core/providers/cpu/tensor/gelu.cc Switches exact GELU(erf) implementation to MlasComputeGeluErf with mode selection.
onnxruntime/core/mlas/lib/silu.cpp Adds baseline MLAS SiLU kernel + dispatched entry point.
onnxruntime/core/mlas/lib/platform.cpp Initializes and dispatches new GELU/SiLU kernel routine pointers; wires AVX512 variants.
onnxruntime/core/mlas/lib/mlasi.h Adds new kernel declarations and new platform routine pointers.
onnxruntime/core/mlas/lib/intrinsics/avx512/silu_avx512f.cpp Adds AVX512F SiLU kernel implementation.
onnxruntime/core/mlas/lib/intrinsics/avx512/gelu_avx512f.cpp Adds AVX512F exact GELU(erf) and minimax approximation implementations.
onnxruntime/core/mlas/lib/gelu.cpp Adds baseline MLAS exact GELU(erf) kernel and dispatched MlasComputeGeluErf.
onnxruntime/core/mlas/inc/mlas.h Exposes new MLAS public APIs and GELU(erf) mode enum.
onnxruntime/contrib_ops/cpu/activations.h Uses MlasComputeSilu fast path for QuickGelu when alpha==1.
include/onnxruntime/core/session/onnxruntime_session_options_config_keys.h Adds session option key for enabling GELU minimax mode.
cmake/onnxruntime_mlas.cmake Adds new MLAS sources and AVX512 intrinsics sources to the build.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread onnxruntime/test/providers/cpu/activation/activation_op_test.cc Outdated
Comment thread onnxruntime/test/mlas/bench/bench_transcendental.cpp Outdated
Comment thread onnxruntime/core/mlas/inc/mlas.h
Comment thread onnxruntime/test/mlas/unittest/test_transcendental_avx512.cpp
Comment thread onnxruntime/test/mlas/unittest/test_transcendental_avx512.cpp
Comment thread onnxruntime/core/providers/cpu/tensor/gelu.h
hariharans29 and others added 7 commits March 16, 2026 20:52
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can commit the suggested changes from lintrunner.

Comment thread onnxruntime/test/mlas/unittest/test_transcendental_avx512.cpp Outdated
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@tianleiwu tianleiwu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1. AVX512 dispatch coverage (onnxruntime/test/mlas/unittest/test_transcendental_avx512.cpp)

Positive:

  • The new MLAS tests do a good job covering tail lengths, random inputs, and special values for the exact AVX512 GELU/SiLU kernels, which is the right level of stress for new vector math code.

Concern:

  • ⚠️ AVX512F-only machines will skip these tests even though the runtime enables the kernels there: the test gate checks GetMlasPlatform().Avx512Supported_ (test), but the new GELU/SiLU dispatch flips over as soon as plain AVX512F is available (platform dispatch). Avx512Supported_ is only set later for the stricter AVX512 core feature set (BW/DQ/VL) (platform flag). On an AVX512F-only CPU, production code will run the new kernels, but the dedicated unit test will report "AVX512 is not available" and skip them.
    bool IsAvx512Available() {
      return GetMlasPlatform().GeluKernelRoutine == MlasGeluKernelAvx512F;
    }

2. Minimax path coverage (onnxruntime/core/mlas/lib/intrinsics/avx512/gelu_avx512f.cpp)

Positive:

  • The minimax path is wired conservatively: the session option only requests it through MlasComputeGeluErf, and non-supporting platforms fall back to the existing exact path instead of failing.

Concern:

  • ⚠️ The riskiest new kernel is only lightly covered: the new table-driven minimax routine adds its own special-value handling, including an explicit -inf -> -0 fixup (kernel), but the new MLAS unit test only exercises the exact AVX512 kernel (test body), and the operator-level test for the session option uses input_values, which include +inf but not -inf (inputs, test). That means the new session-option path can change observable special-value behavior without any direct regression coverage.
    // Add a direct AVX512 minimax-vs-reference MLAS test, and extend the
    // operator/session-option test corpus with -inf.
    MlasGeluKernelAvx512FMinimaxApprox(input, avx512_output, size);

Summary of Concerns

# Severity Component Issue
1 Suggestion AVX512 unit tests Test gating is stricter than runtime dispatch, so AVX512F-only machines skip coverage for kernels they actually run.
2 Suggestion Minimax GELU coverage The new minimax kernel lacks direct MLAS stress coverage and the operator test corpus misses -inf, so special-value regressions can slip through.

@hariharans29
Copy link
Copy Markdown
Member Author

hariharans29 commented Mar 21, 2026

Hi @tianleiwu

Thanks for the feedback.

  1. Unit test gating fixed to match the dispatch
  2. Removed minimax Gelu altogether as I have some suspicions on that code and the cusotmer did not see any gains with that anyway. Will add that back later if there is value to it.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends MLAS with dispatched “fused” unary activation entry points for SiLU and exact GELU (erf formulation) on AVX512F, and wires those entry points into the CPU provider (GELU exact path; QuickGelu’s α=1 path). It also adds MLAS unit tests and a micro-benchmark to validate/measure the new dispatch paths.

Changes:

  • Add MLAS public APIs MlasComputeSilu and MlasComputeGeluErf, plus baseline kernels and AVX512F implementations, and dispatch them via MLAS_PLATFORM.
  • Switch CPU GELU exact implementation to call MlasComputeGeluErf, and switch QuickGelu (α=1) to call MlasComputeSilu.
  • Add AVX512-focused MLAS unit tests and a benchmark for fused vs unfused unary activation paths.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
onnxruntime/test/providers/cpu/activation/activation_op_test.cc Adds a session-options config keys include (currently unused).
onnxruntime/test/mlas/unittest/test_transcendental_avx512.cpp New MLAS unit tests for AVX512 GELU/SiLU dispatch and numeric validation.
onnxruntime/test/mlas/bench/bench_transcendental.cpp New benchmark comparing fused dispatch vs unfused baselines for SiLU and GELU(erf).
onnxruntime/core/providers/cpu/tensor/gelu.h Adds #pragma once.
onnxruntime/core/providers/cpu/tensor/gelu.cc Replaces manual erf-based GELU loop with MlasComputeGeluErf.
onnxruntime/core/mlas/lib/silu.cpp Adds baseline SiLU kernel and dispatched MlasComputeSilu.
onnxruntime/core/mlas/lib/gelu.cpp Adds baseline exact GELU(erf) kernel and dispatched MlasComputeGeluErf.
onnxruntime/core/mlas/lib/platform.cpp Initializes and AVX512-dispatches GELU/SiLU kernel routine pointers.
onnxruntime/core/mlas/lib/mlasi.h Adds declarations for new kernels and platform dispatch function pointers.
onnxruntime/core/mlas/lib/intrinsics/avx512/silu_avx512f.cpp New AVX512F SiLU implementation using approximations and masked tails.
onnxruntime/core/mlas/lib/intrinsics/avx512/gelu_avx512f.cpp New AVX512F exact GELU(erf) implementation (currently has scalar-tail behavior concerns).
onnxruntime/core/mlas/inc/mlas.h Exposes MlasComputeGeluErf/MlasComputeSilu in the public MLAS header with aliasing notes.
onnxruntime/contrib_ops/cpu/activations.h Uses MlasComputeSilu for QuickGelu when alpha_ == 1.0f.
cmake/onnxruntime_mlas.cmake Adds new sources and ensures AVX512 compilation for the new intrinsic files.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread onnxruntime/core/mlas/lib/intrinsics/avx512/gelu_avx512f.cpp Outdated
Comment thread onnxruntime/test/providers/cpu/activation/activation_op_test.cc Outdated
Comment thread onnxruntime/test/mlas/bench/bench_transcendental.cpp
hariharans29 and others added 3 commits March 20, 2026 19:27
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread onnxruntime/test/mlas/bench/bench_transcendental.cpp
Comment thread onnxruntime/core/mlas/lib/silu.cpp Outdated
Comment thread onnxruntime/core/mlas/lib/gelu.cpp Outdated
@hariharans29 hariharans29 requested a review from tianleiwu March 22, 2026 06:01
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread onnxruntime/core/mlas/lib/intrinsics/avx512/silu_avx512f.cpp Outdated
@hariharans29 hariharans29 marked this pull request as draft March 22, 2026 22:23
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread onnxruntime/test/mlas/bench/bench_transcendental.cpp
Comment thread onnxruntime/test/mlas/bench/bench_transcendental.cpp Outdated
Comment thread onnxruntime/test/mlas/unittest/test_transcendental_avx512.cpp
@hariharans29 hariharans29 marked this pull request as ready for review March 23, 2026 04:24
Comment thread onnxruntime/core/mlas/lib/platform.cpp Outdated
@hariharans29 hariharans29 requested a review from Copilot March 23, 2026 20:37
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread onnxruntime/core/mlas/lib/platform.cpp
Comment thread onnxruntime/core/mlas/lib/intrinsics/avx512/silu_avx512f.cpp Outdated
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@hariharans29 hariharans29 requested a review from tianleiwu March 23, 2026 21:16
@hariharans29 hariharans29 merged commit 38a2625 into main Mar 24, 2026
93 of 94 checks passed
@hariharans29 hariharans29 deleted the hari/fused_silu_avx512 branch March 24, 2026 17:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants