Skip to content

[CPU] Update the NCHWc transformer to handle more patterns#27691

Merged
hariharans29 merged 16 commits intomainfrom
hari/nchwc_transformer_fixes
Mar 22, 2026
Merged

[CPU] Update the NCHWc transformer to handle more patterns#27691
hariharans29 merged 16 commits intomainfrom
hari/nchwc_transformer_fixes

Conversation

@hariharans29
Copy link
Copy Markdown
Member

@hariharans29 hariharans29 commented Mar 17, 2026

Description

2 main changes:

  1. Handle activations that are seen in modern CNNs in the NCHWc transformer (QuickGelu, Gelu) and avoid reorder nodes getting inserted before and after them to do the NCHWc <-> NCHW data layout transforms. These can be avoided as these are elemtnwise ops that are otherwise data layout agnostic

  2. Rewrites a channel scaling Mul (or scaling input shape 1,C,1,1 or C,1,1) into a depthwise conv NCHWc operation. This avoid reorder nodes and enables fusions of any subsequent Add operations into the new Conv node.

Motivation and Context

Avoid unnecessary data layout operations and enable more NCHWc compatible compute and fusions

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends the CPU NCHWc graph transformer to recognize and optimize additional post-convolution patterns, primarily targeting elementwise activations and channel-wise scaling with Mul, and adds unit tests to validate the new transformations.

Changes:

  • Add TransformMul to rewrite Mul(NCHWc_tensor, constant_channel_scale) into an equivalent depthwise com.microsoft.nchwc.Conv, eliminating reorders and removing the Mul node.
  • Extend activation handling so Gelu and QuickGelu (MS domain) are treated as NCHWc-compatible elementwise activations but are not fused into com.microsoft.nchwc.Conv via the activation attribute.
  • Update and add NCHWc optimizer tests for the new Mul scaling pattern and the expanded activation set.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
onnxruntime/test/optimizer/nchwc_optimizer_test.cc Adds a Conv→Mul(channel-scale)→Conv test and extends activation tests to include MS-domain Gelu/QuickGelu; updates test helper to support non-ONNX domains.
onnxruntime/core/optimizer/nchwc_transformer.cc Implements TransformMul for channel-scale Mul elimination; updates activation fusion gating and expands supported activation op/domain set.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread onnxruntime/test/optimizer/nchwc_optimizer_test.cc Outdated
Comment thread onnxruntime/test/optimizer/nchwc_optimizer_test.cc
Comment thread onnxruntime/test/optimizer/nchwc_optimizer_test.cc Outdated
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can commit the suggested changes from lintrunner.

Comment thread onnxruntime/test/optimizer/nchwc_optimizer_test.cc Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR expands the CPU NCHWc (blocked-channel) graph transformer to better accommodate patterns common in modern CNNs by (a) treating additional activation ops as layout-agnostic to avoid unnecessary reorders and (b) converting certain channel-scale Mul patterns into an equivalent depthwise NCHWc Conv to reduce layout conversions and enable downstream fusions.

Changes:

  • Added NCHWc transformer handling for additional activations (HardSigmoid, Gelu, QuickGelu) so they can operate directly on NCHWc tensors without inserting NCHWc↔NCHW reorder nodes.
  • Added a Mul transform that rewrites static channel-scale multiplies into a depthwise com.microsoft.nchwc.Conv.
  • Extended optimizer tests to cover the new activation patterns and the Mul→depthwise-Conv rewrite, including pre-optimization graph assertions.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
onnxruntime/test/optimizer/nchwc_optimizer_test.cc Adds test coverage for new activation handling and Mul channel-scale rewrite; extends test helper utilities to support MS domain ops and optional pre-optimization checks.
onnxruntime/core/optimizer/nchwc_transformer.cc Implements TransformMul and extends activation handling to keep more elementwise activations in NCHWc, reducing reorder insertion and enabling more NCHWc-compatible compute.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

@tianleiwu tianleiwu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1. TransformMul — channel-scale Mul → depthwise NCHWc Conv (onnxruntime/core/optimizer/nchwc_transformer.cc)

Positive:

  • The new function mirrors TransformBatchNormalization exactly: pad the 1D weight to nchwc_channels, create a [nchwc_channels, 1, 1, 1] initializer, emit a depthwise NCHWc Conv with group = nchwc_channels. The weight shape, group attribute, and remaining_original_uses_-- bookkeeping all match the established pattern.
  • Dispatching Mul before the input_edges == 0 gate with an explanatory comment is the correct design: a constant-initializer scale has no live input edge, so the gate would never fire for the interesting case.
  • The two-NCHWc-input fast-exit to TransformBinary(node, false) avoids duplicating existing logic.

Concern:

  • ⚠️ scale_dims check does not cover shape [C]: The PR description says the optimization handles scale shapes 1,C,1,1 or C,1,1 or C, but the code only handles rank-3 [C,1,1] and rank-4 [1,C,1,1]. A 1-D tensor [C] broadcasts to the last dimension in ONNX (i.e., to width, not channels), so it cannot actually represent a channel scale on an NCHW tensor — the description is therefore misleading rather than the code being wrong. Worth a one-line clarification in the comment.

2. TransformActivation — HardSigmoid fusion gap (onnxruntime/core/optimizer/nchwc_transformer.cc)

Positive:

  • Gating the activation-into-Conv fusion behind can_fuse_activation and keeping the else path (CreateNchwcArgument) for unsupported activations is clean: Gelu and QuickGelu are correctly kept as NCHWc pass-through nodes because GetFusedActivationAttr in fused_activation.cc does not recognize them and would return an error.

Concern:

  • ⚠️ HardSigmoid is missing from can_fuse_activation and activation_params is never written — ConvMulChannelScaleHardSigmoid will fail and a fused kernel would crash: GetFusedActivationAttr (fused_activation.cc:29-44) handles HardSigmoid as a parameterized activation that requires an activation_params float-list attribute [alpha, beta]. If it is absent the function returns INVALID_ARGUMENT, and NchwcConv::NchwcConv calls ORT_ENFORCE(GetFusedActivationAttr(...).IsOK()) — crash at kernel construction time.

    The current code:

    const bool can_fuse_activation = (node.OpType() == "Relu") ||
                                     (node.OpType() == "Sigmoid") ||
                                     (node.OpType() == "Tanh");
    if (...can_fuse_activation...) {
      nchwc_node.AddAttribute("activation", node.OpType());
      FuseNchwcArgument(node, *nchwc_input);
      removed_nodes_.push_front(node.Index());
    }

    leaves HardSigmoid unfused. The test ConvMulChannelScaleHardSigmoid expects op_to_count["HardSigmoid"] == 0 and hard_sigmoid_fused_count == 1 — both will be wrong.

    The fix requires two changes in TransformActivation:

    const bool can_fuse_activation = (node.OpType() == "Relu") ||
                                     (node.OpType() == "Sigmoid") ||
                                     (node.OpType() == "Tanh") ||
                                     (node.OpType() == "HardSigmoid");  // ADD THIS
    if (... && can_fuse_activation && ...) {
      nchwc_node.AddAttribute("activation", node.OpType());
      // ADD: write activation_params for parameterized activations
      if (node.OpType() == "HardSigmoid") {
        auto* alpha_attr = graph_utils::GetNodeAttribute(node, "alpha");
        auto* beta_attr  = graph_utils::GetNodeAttribute(node, "beta");
        float alpha = (alpha_attr != nullptr ? alpha_attr->f() : 0.2f);
        float beta  = (beta_attr  != nullptr ? beta_attr->f()  : 0.5f);
        nchwc_node.AddAttribute("activation_params",
                                std::vector<float>{alpha, beta});
      }
      FuseNchwcArgument(node, *nchwc_input);
      removed_nodes_.push_front(node.Index());
    }

    The default values alpha=0.2, beta=0.5 match the ONNX spec defaults for HardSigmoid and are what conv_activation_fusion.cc uses for the same case.

3. NCHWc optimizer tests (onnxruntime/test/optimizer/nchwc_optimizer_test.cc)

Positive:

  • Adding a domain parameter to AddNode and threading kMSDomain through the test harness is minimal and correct.
  • Adding check_pre_optimization_graph as an optional callback to NchwcOptimizerTester lets tests assert the starting graph shape, making it much easier to diagnose if a pre-optimization step changes.
  • ConvHardSigmoidTwoConsumers specifically exercises the starting_original_uses_ == 1 guard and confirms HardSigmoid is kept as a separate NCHWc pass-through node when Conv has two consumers — this is exactly the right boundary case.
  • ConvMulChannelScale exercises all four combinations of explicit-batch-dim × scale-first order, which pins the operand-order symmetry in TransformMul.

Concern:

  • ⚠️ ConvMulChannelScaleHardSigmoid assertions are inconsistent with the implementation: The test expects HardSigmoid to be eliminated and its activation_params to appear on a nchwc.Conv node. As written today, the implementation does not fuse HardSigmoid, so these assertions will fail. This is directly caused by the gap described in §2 above; fixing §2 will make these assertions pass.

Summary of Concerns

# Severity Component Issue
1 High TransformActivation HardSigmoid excluded from can_fuse_activation; activation_params never set; ConvMulChannelScaleHardSigmoid test will fail and a (hypothetically) fused kernel would crash at init.
2 Nitpick TransformMul comment / PR description PR description claims [C] scale shape is handled; code correctly rejects it (wrong broadcast semantics), but description should be corrected to avoid confusion.

Verdict

REQUEST CHANGES — the HardSigmoid activation fusion is incomplete: can_fuse_activation must include HardSigmoid and TransformActivation must write the activation_params attribute before calling FuseNchwcArgument, otherwise the test fails and any machine that does reach a fused path will crash at kernel construction time.

@hariharans29
Copy link
Copy Markdown
Member Author

Hi @tianleiwu

Thanks for the feedback.

  1. Removed HardSigmoid for now. Will add it back later after more testing.
  2. The comment in my PR about supporting [C] shaped initializer for Mul was an oversight. I havge removed it now.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread onnxruntime/core/optimizer/nchwc_transformer.cc
Comment thread onnxruntime/core/optimizer/nchwc_transformer.cc Outdated
Comment thread onnxruntime/test/optimizer/nchwc_optimizer_test.cc Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@hariharans29 hariharans29 requested a review from tianleiwu March 21, 2026 05:07
tianleiwu
tianleiwu previously approved these changes Mar 21, 2026
Copy link
Copy Markdown
Contributor

@tianleiwu tianleiwu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

It will be nice to add some tests exercising the can_fuse_activation guard for Gelu/QuickGelu with single-consumer Conv.

@hariharans29
Copy link
Copy Markdown
Member Author

hariharans29 commented Mar 22, 2026

LGTM.

It will be nice to add some tests exercising the can_fuse_activation guard for Gelu/QuickGelu with single-consumer Conv.

Thanks. Added the test.

@hariharans29 hariharans29 merged commit aa6f2e3 into main Mar 22, 2026
91 checks passed
@hariharans29 hariharans29 deleted the hari/nchwc_transformer_fixes branch March 22, 2026 21:15
hariharans29 added a commit that referenced this pull request Mar 24, 2026
…c transformer suite (#27821)

### Description
As title



### Motivation and Context
Tiny continuation to #27691

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants