[MLAS] Add fused Silu and Gelu kernels for AVX512#27690
Conversation
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR extends MLAS and the CPU EP to use new fused/unary compute entry points for exact GELU(erf) and SiLU, adding an AVX512-optimized implementation (including an optional minimax erf approximation for GELU) plus associated tests and benchmarks.
Changes:
- Add MLAS APIs
MlasComputeGeluErf(exact vs minimax mode) andMlasComputeSilu, and wire CPU EP GELU + contrib QuickGelu(alpha=1) to use them. - Add AVX512F intrinsic implementations for GELU(erf) (exact + minimax approximation) and SiLU, with platform dispatch wiring.
- Add new MLAS AVX512 transcendental unit tests and a new MLAS transcendental benchmark; add a session option to enable GELU minimax mode.
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| onnxruntime/test/providers/cpu/activation/activation_op_test.cc | Adds an ORT-level test for the GELU minimax-enabled session option. |
| onnxruntime/test/mlas/unittest/test_transcendental_avx512.cpp | Adds AVX512 transcendental unit tests comparing AVX512 kernels vs reference/generic paths. |
| onnxruntime/test/mlas/bench/bench_transcendental.cpp | Adds benchmarks comparing fused MLAS unary paths vs unfused baselines for SiLU and GELU(erf). |
| onnxruntime/core/providers/cpu/tensor/gelu.h | Adds reading of the new session option to toggle minimax mode. |
| onnxruntime/core/providers/cpu/tensor/gelu.cc | Switches exact GELU(erf) implementation to MlasComputeGeluErf with mode selection. |
| onnxruntime/core/mlas/lib/silu.cpp | Adds baseline MLAS SiLU kernel + dispatched entry point. |
| onnxruntime/core/mlas/lib/platform.cpp | Initializes and dispatches new GELU/SiLU kernel routine pointers; wires AVX512 variants. |
| onnxruntime/core/mlas/lib/mlasi.h | Adds new kernel declarations and new platform routine pointers. |
| onnxruntime/core/mlas/lib/intrinsics/avx512/silu_avx512f.cpp | Adds AVX512F SiLU kernel implementation. |
| onnxruntime/core/mlas/lib/intrinsics/avx512/gelu_avx512f.cpp | Adds AVX512F exact GELU(erf) and minimax approximation implementations. |
| onnxruntime/core/mlas/lib/gelu.cpp | Adds baseline MLAS exact GELU(erf) kernel and dispatched MlasComputeGeluErf. |
| onnxruntime/core/mlas/inc/mlas.h | Exposes new MLAS public APIs and GELU(erf) mode enum. |
| onnxruntime/contrib_ops/cpu/activations.h | Uses MlasComputeSilu fast path for QuickGelu when alpha==1. |
| include/onnxruntime/core/session/onnxruntime_session_options_config_keys.h | Adds session option key for enabling GELU minimax mode. |
| cmake/onnxruntime_mlas.cmake | Adds new MLAS sources and AVX512 intrinsics sources to the build. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…/onnxruntime into hari/fused_silu_avx512
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
tianleiwu
left a comment
There was a problem hiding this comment.
1. AVX512 dispatch coverage (onnxruntime/test/mlas/unittest/test_transcendental_avx512.cpp)
Positive:
- The new MLAS tests do a good job covering tail lengths, random inputs, and special values for the exact AVX512 GELU/SiLU kernels, which is the right level of stress for new vector math code.
Concern:
⚠️ AVX512F-only machines will skip these tests even though the runtime enables the kernels there: the test gate checksGetMlasPlatform().Avx512Supported_(test), but the new GELU/SiLU dispatch flips over as soon as plain AVX512F is available (platform dispatch).Avx512Supported_is only set later for the stricter AVX512 core feature set (BW/DQ/VL) (platform flag). On an AVX512F-only CPU, production code will run the new kernels, but the dedicated unit test will report "AVX512 is not available" and skip them.bool IsAvx512Available() { return GetMlasPlatform().GeluKernelRoutine == MlasGeluKernelAvx512F; }
2. Minimax path coverage (onnxruntime/core/mlas/lib/intrinsics/avx512/gelu_avx512f.cpp)
Positive:
- The minimax path is wired conservatively: the session option only requests it through
MlasComputeGeluErf, and non-supporting platforms fall back to the existing exact path instead of failing.
Concern:
⚠️ The riskiest new kernel is only lightly covered: the new table-driven minimax routine adds its own special-value handling, including an explicit-inf -> -0fixup (kernel), but the new MLAS unit test only exercises the exact AVX512 kernel (test body), and the operator-level test for the session option usesinput_values, which include+infbut not-inf(inputs, test). That means the new session-option path can change observable special-value behavior without any direct regression coverage.// Add a direct AVX512 minimax-vs-reference MLAS test, and extend the // operator/session-option test corpus with -inf. MlasGeluKernelAvx512FMinimaxApprox(input, avx512_output, size);
Summary of Concerns
| # | Severity | Component | Issue |
|---|---|---|---|
| 1 | Suggestion | AVX512 unit tests | Test gating is stricter than runtime dispatch, so AVX512F-only machines skip coverage for kernels they actually run. |
| 2 | Suggestion | Minimax GELU coverage | The new minimax kernel lacks direct MLAS stress coverage and the operator test corpus misses -inf, so special-value regressions can slip through. |
|
Hi @tianleiwu Thanks for the feedback.
|
There was a problem hiding this comment.
Pull request overview
This PR extends MLAS with dispatched “fused” unary activation entry points for SiLU and exact GELU (erf formulation) on AVX512F, and wires those entry points into the CPU provider (GELU exact path; QuickGelu’s α=1 path). It also adds MLAS unit tests and a micro-benchmark to validate/measure the new dispatch paths.
Changes:
- Add MLAS public APIs
MlasComputeSiluandMlasComputeGeluErf, plus baseline kernels and AVX512F implementations, and dispatch them viaMLAS_PLATFORM. - Switch CPU GELU exact implementation to call
MlasComputeGeluErf, and switch QuickGelu (α=1) to callMlasComputeSilu. - Add AVX512-focused MLAS unit tests and a benchmark for fused vs unfused unary activation paths.
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| onnxruntime/test/providers/cpu/activation/activation_op_test.cc | Adds a session-options config keys include (currently unused). |
| onnxruntime/test/mlas/unittest/test_transcendental_avx512.cpp | New MLAS unit tests for AVX512 GELU/SiLU dispatch and numeric validation. |
| onnxruntime/test/mlas/bench/bench_transcendental.cpp | New benchmark comparing fused dispatch vs unfused baselines for SiLU and GELU(erf). |
| onnxruntime/core/providers/cpu/tensor/gelu.h | Adds #pragma once. |
| onnxruntime/core/providers/cpu/tensor/gelu.cc | Replaces manual erf-based GELU loop with MlasComputeGeluErf. |
| onnxruntime/core/mlas/lib/silu.cpp | Adds baseline SiLU kernel and dispatched MlasComputeSilu. |
| onnxruntime/core/mlas/lib/gelu.cpp | Adds baseline exact GELU(erf) kernel and dispatched MlasComputeGeluErf. |
| onnxruntime/core/mlas/lib/platform.cpp | Initializes and AVX512-dispatches GELU/SiLU kernel routine pointers. |
| onnxruntime/core/mlas/lib/mlasi.h | Adds declarations for new kernels and platform dispatch function pointers. |
| onnxruntime/core/mlas/lib/intrinsics/avx512/silu_avx512f.cpp | New AVX512F SiLU implementation using approximations and masked tails. |
| onnxruntime/core/mlas/lib/intrinsics/avx512/gelu_avx512f.cpp | New AVX512F exact GELU(erf) implementation (currently has scalar-tail behavior concerns). |
| onnxruntime/core/mlas/inc/mlas.h | Exposes MlasComputeGeluErf/MlasComputeSilu in the public MLAS header with aliasing notes. |
| onnxruntime/contrib_ops/cpu/activations.h | Uses MlasComputeSilu for QuickGelu when alpha_ == 1.0f. |
| cmake/onnxruntime_mlas.cmake | Adds new sources and ensures AVX512 compilation for the new intrinsic files. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 13 out of 13 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 13 out of 13 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Description
Add fused Silu and Exact Gelu (Erf based) for AVX512f
Silu benchmarks:

GELU exact (Erf) benchmarks:

Motivation and Context
Improve performance on AVX512F
Silu shows small regression at B=1 but I don't think the absolute difference is much