[CPU] Update the NCHWc transformer to handle more patterns by hariharans29 · Pull Request #27691 · microsoft/onnxruntime

hariharans29 · 2026-03-17T03:47:39Z

Description

2 main changes:

Handle activations that are seen in modern CNNs in the NCHWc transformer (QuickGelu, Gelu) and avoid reorder nodes getting inserted before and after them to do the NCHWc <-> NCHW data layout transforms. These can be avoided as these are elemtnwise ops that are otherwise data layout agnostic
Rewrites a channel scaling Mul (or scaling input shape 1,C,1,1 or C,1,1) into a depthwise conv NCHWc operation. This avoid reorder nodes and enables fusions of any subsequent Add operations into the new Conv node.

Motivation and Context

Avoid unnecessary data layout operations and enable more NCHWc compatible compute and fusions

Copilot

Pull request overview

This PR extends the CPU NCHWc graph transformer to recognize and optimize additional post-convolution patterns, primarily targeting elementwise activations and channel-wise scaling with Mul, and adds unit tests to validate the new transformations.

Changes:

Add TransformMul to rewrite Mul(NCHWc_tensor, constant_channel_scale) into an equivalent depthwise com.microsoft.nchwc.Conv, eliminating reorders and removing the Mul node.
Extend activation handling so Gelu and QuickGelu (MS domain) are treated as NCHWc-compatible elementwise activations but are not fused into com.microsoft.nchwc.Conv via the activation attribute.
Update and add NCHWc optimizer tests for the new Mul scaling pattern and the expanded activation set.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
onnxruntime/test/optimizer/nchwc_optimizer_test.cc	Adds a Conv→Mul(channel-scale)→Conv test and extends activation tests to include MS-domain Gelu/QuickGelu; updates test helper to support non-ONNX domains.
onnxruntime/core/optimizer/nchwc_transformer.cc	Implements `TransformMul` for channel-scale Mul elimination; updates activation fusion gating and expands supported activation op/domain set.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

github-actions

You can commit the suggested changes from lintrunner.

Copilot

Pull request overview

This PR expands the CPU NCHWc (blocked-channel) graph transformer to better accommodate patterns common in modern CNNs by (a) treating additional activation ops as layout-agnostic to avoid unnecessary reorders and (b) converting certain channel-scale Mul patterns into an equivalent depthwise NCHWc Conv to reduce layout conversions and enable downstream fusions.

Changes:

Added NCHWc transformer handling for additional activations (HardSigmoid, Gelu, QuickGelu) so they can operate directly on NCHWc tensors without inserting NCHWc↔NCHW reorder nodes.
Added a Mul transform that rewrites static channel-scale multiplies into a depthwise com.microsoft.nchwc.Conv.
Extended optimizer tests to cover the new activation patterns and the Mul→depthwise-Conv rewrite, including pre-optimization graph assertions.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File	Description
onnxruntime/test/optimizer/nchwc_optimizer_test.cc	Adds test coverage for new activation handling and `Mul` channel-scale rewrite; extends test helper utilities to support MS domain ops and optional pre-optimization checks.
onnxruntime/core/optimizer/nchwc_transformer.cc	Implements `TransformMul` and extends activation handling to keep more elementwise activations in NCHWc, reducing reorder insertion and enabling more NCHWc-compatible compute.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tianleiwu

1. `TransformMul` — channel-scale Mul → depthwise NCHWc Conv (`onnxruntime/core/optimizer/nchwc_transformer.cc`)

Positive:

The new function mirrors TransformBatchNormalization exactly: pad the 1D weight to nchwc_channels, create a [nchwc_channels, 1, 1, 1] initializer, emit a depthwise NCHWc Conv with group = nchwc_channels. The weight shape, group attribute, and remaining_original_uses_-- bookkeeping all match the established pattern.
Dispatching Mul before the input_edges == 0 gate with an explanatory comment is the correct design: a constant-initializer scale has no live input edge, so the gate would never fire for the interesting case.
The two-NCHWc-input fast-exit to TransformBinary(node, false) avoids duplicating existing logic.

Concern:

⚠️ scale_dims check does not cover shape [C]: The PR description says the optimization handles scale shapes 1,C,1,1 or C,1,1 or C, but the code only handles rank-3 [C,1,1] and rank-4 [1,C,1,1]. A 1-D tensor [C] broadcasts to the last dimension in ONNX (i.e., to width, not channels), so it cannot actually represent a channel scale on an NCHW tensor — the description is therefore misleading rather than the code being wrong. Worth a one-line clarification in the comment.

2. `TransformActivation` — HardSigmoid fusion gap (`onnxruntime/core/optimizer/nchwc_transformer.cc`)

Positive:

Gating the activation-into-Conv fusion behind can_fuse_activation and keeping the else path (CreateNchwcArgument) for unsupported activations is clean: Gelu and QuickGelu are correctly kept as NCHWc pass-through nodes because GetFusedActivationAttr in fused_activation.cc does not recognize them and would return an error.

Concern:

⚠️ HardSigmoid is missing from can_fuse_activation and activation_params is never written — ConvMulChannelScaleHardSigmoid will fail and a fused kernel would crash: GetFusedActivationAttr (fused_activation.cc:29-44) handles HardSigmoid as a parameterized activation that requires an activation_params float-list attribute [alpha, beta]. If it is absent the function returns INVALID_ARGUMENT, and NchwcConv::NchwcConv calls ORT_ENFORCE(GetFusedActivationAttr(...).IsOK()) — crash at kernel construction time.

The current code:

const bool can_fuse_activation = (node.OpType() == "Relu") ||
                                 (node.OpType() == "Sigmoid") ||
                                 (node.OpType() == "Tanh");
if (...can_fuse_activation...) {
  nchwc_node.AddAttribute("activation", node.OpType());
  FuseNchwcArgument(node, *nchwc_input);
  removed_nodes_.push_front(node.Index());
}

leaves HardSigmoid unfused. The test ConvMulChannelScaleHardSigmoid expects op_to_count["HardSigmoid"] == 0 and hard_sigmoid_fused_count == 1 — both will be wrong.

The fix requires two changes in TransformActivation:

const bool can_fuse_activation = (node.OpType() == "Relu") ||
                                 (node.OpType() == "Sigmoid") ||
                                 (node.OpType() == "Tanh") ||
                                 (node.OpType() == "HardSigmoid");  // ADD THIS
if (... && can_fuse_activation && ...) {
  nchwc_node.AddAttribute("activation", node.OpType());
  // ADD: write activation_params for parameterized activations
  if (node.OpType() == "HardSigmoid") {
    auto* alpha_attr = graph_utils::GetNodeAttribute(node, "alpha");
    auto* beta_attr  = graph_utils::GetNodeAttribute(node, "beta");
    float alpha = (alpha_attr != nullptr ? alpha_attr->f() : 0.2f);
    float beta  = (beta_attr  != nullptr ? beta_attr->f()  : 0.5f);
    nchwc_node.AddAttribute("activation_params",
                            std::vector<float>{alpha, beta});
  }
  FuseNchwcArgument(node, *nchwc_input);
  removed_nodes_.push_front(node.Index());
}

The default values alpha=0.2, beta=0.5 match the ONNX spec defaults for HardSigmoid and are what conv_activation_fusion.cc uses for the same case.

3. NCHWc optimizer tests (`onnxruntime/test/optimizer/nchwc_optimizer_test.cc`)

Positive:

Adding a domain parameter to AddNode and threading kMSDomain through the test harness is minimal and correct.
Adding check_pre_optimization_graph as an optional callback to NchwcOptimizerTester lets tests assert the starting graph shape, making it much easier to diagnose if a pre-optimization step changes.
ConvHardSigmoidTwoConsumers specifically exercises the starting_original_uses_ == 1 guard and confirms HardSigmoid is kept as a separate NCHWc pass-through node when Conv has two consumers — this is exactly the right boundary case.
ConvMulChannelScale exercises all four combinations of explicit-batch-dim × scale-first order, which pins the operand-order symmetry in TransformMul.

Concern:

⚠️ ConvMulChannelScaleHardSigmoid assertions are inconsistent with the implementation: The test expects HardSigmoid to be eliminated and its activation_params to appear on a nchwc.Conv node. As written today, the implementation does not fuse HardSigmoid, so these assertions will fail. This is directly caused by the gap described in §2 above; fixing §2 will make these assertions pass.

Summary of Concerns

#	Severity	Component	Issue
1	High	`TransformActivation`	`HardSigmoid` excluded from `can_fuse_activation`; `activation_params` never set; `ConvMulChannelScaleHardSigmoid` test will fail and a (hypothetically) fused kernel would crash at init.
2	Nitpick	`TransformMul` comment / PR description	PR description claims `[C]` scale shape is handled; code correctly rejects it (wrong broadcast semantics), but description should be corrected to avoid confusion.

Verdict

REQUEST CHANGES — the HardSigmoid activation fusion is incomplete: can_fuse_activation must include HardSigmoid and TransformActivation must write the activation_params attribute before calling FuseNchwcArgument, otherwise the test fails and any machine that does reach a fused path will crash at kernel construction time.

hariharans29 · 2026-03-21T02:16:36Z

Hi @tianleiwu

Thanks for the feedback.

Removed HardSigmoid for now. Will add it back later after more testing.
The comment in my PR about supporting [C] shaped initializer for Mul was an oversight. I havge removed it now.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tianleiwu

LGTM.

It will be nice to add some tests exercising the can_fuse_activation guard for Gelu/QuickGelu with single-consumer Conv.

hariharans29 · 2026-03-22T06:35:43Z

LGTM.

It will be nice to add some tests exercising the can_fuse_activation guard for Gelu/QuickGelu with single-consumer Conv.

Thanks. Added the test.

…c transformer suite (#27821) ### Description As title ### Motivation and Context Tiny continuation to #27691 --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

hariharans29 added 3 commits March 16, 2026 20:47

Update the NCHWc transformer to handle more patterns

25e841e

Build break

b2ca1a1

Fix

3069a5d

hariharans29 requested a review from Copilot March 17, 2026 16:56

Copilot started reviewing on behalf of hariharans29 March 17, 2026 16:57 View session

Copilot AI reviewed Mar 17, 2026

View reviewed changes

Comment thread onnxruntime/test/optimizer/nchwc_optimizer_test.cc Outdated

Comment thread onnxruntime/test/optimizer/nchwc_optimizer_test.cc

Comment thread onnxruntime/test/optimizer/nchwc_optimizer_test.cc Outdated

hariharans29 added 4 commits March 17, 2026 14:29

Fixes

3e16b18

Fix test

a5d282d

Fixes

167a70b

Test debug

4c7af22

github-actions Bot reviewed Mar 18, 2026

View reviewed changes

Comment thread onnxruntime/test/optimizer/nchwc_optimizer_test.cc Outdated

hariharans29 added 5 commits March 17, 2026 20:33

More debugging

cc97e59

Debug

ef2b747

Fix test once and for all hopefully

18b025a

Add HardSigmoid

0ecd1ed

Fix

d9fa998

hariharans29 requested review from Copilot and tianleiwu March 20, 2026 03:37

Copilot started reviewing on behalf of hariharans29 March 20, 2026 03:38 View session

Copilot AI reviewed Mar 20, 2026

View reviewed changes

tianleiwu reviewed Mar 20, 2026

View reviewed changes

Remove HardSigmoid related changes as it introduces some complications

fcbdaf6

hariharans29 requested a review from Copilot March 21, 2026 02:16

Copilot started reviewing on behalf of hariharans29 March 21, 2026 02:17 View session

Copilot AI reviewed Mar 21, 2026

View reviewed changes

Comment thread onnxruntime/core/optimizer/nchwc_transformer.cc

Comment thread onnxruntime/core/optimizer/nchwc_transformer.cc Outdated

Comment thread onnxruntime/test/optimizer/nchwc_optimizer_test.cc Outdated

Copilot comments

914ec4d

hariharans29 requested a review from Copilot March 21, 2026 02:43

Copilot started reviewing on behalf of hariharans29 March 21, 2026 02:44 View session

Copilot AI reviewed Mar 21, 2026

View reviewed changes

hariharans29 requested a review from tianleiwu March 21, 2026 05:07

tianleiwu previously approved these changes Mar 21, 2026

View reviewed changes

PR comment

d7979da

hariharans29 dismissed tianleiwu’s stale review via d7979da March 22, 2026 05:34

Fix build

ba935dd

tianleiwu approved these changes Mar 22, 2026

View reviewed changes

hariharans29 merged commit aa6f2e3 into main Mar 22, 2026
91 checks passed

hariharans29 deleted the hari/nchwc_transformer_fixes branch March 22, 2026 21:15

hariharans29 mentioned this pull request Mar 24, 2026

[CPU] Handle ONNX domain Gelu and HardSigmoid activations in the NCHWc transformer suite #27821

Merged

BrewTestBot mentioned this pull request Apr 20, 2026

onnxruntime 1.25.0 Homebrew/homebrew-core#278543

Merged

Conversation

hariharans29 commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

tianleiwu left a comment

Choose a reason for hiding this comment

1. TransformMul — channel-scale Mul → depthwise NCHWc Conv (onnxruntime/core/optimizer/nchwc_transformer.cc)

2. TransformActivation — HardSigmoid fusion gap (onnxruntime/core/optimizer/nchwc_transformer.cc)

3. NCHWc optimizer tests (onnxruntime/test/optimizer/nchwc_optimizer_test.cc)

Summary of Concerns

Verdict

Uh oh!

hariharans29 commented Mar 21, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

tianleiwu left a comment

Choose a reason for hiding this comment

Uh oh!

hariharans29 commented Mar 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hariharans29 commented Mar 17, 2026 •

edited

Loading

1. `TransformMul` — channel-scale Mul → depthwise NCHWc Conv (`onnxruntime/core/optimizer/nchwc_transformer.cc`)

2. `TransformActivation` — HardSigmoid fusion gap (`onnxruntime/core/optimizer/nchwc_transformer.cc`)

3. NCHWc optimizer tests (`onnxruntime/test/optimizer/nchwc_optimizer_test.cc`)

hariharans29 commented Mar 22, 2026 •

edited

Loading