[CPU] Update the NCHWc transformer to handle more patterns#27691
[CPU] Update the NCHWc transformer to handle more patterns#27691hariharans29 merged 16 commits intomainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR extends the CPU NCHWc graph transformer to recognize and optimize additional post-convolution patterns, primarily targeting elementwise activations and channel-wise scaling with Mul, and adds unit tests to validate the new transformations.
Changes:
- Add
TransformMulto rewriteMul(NCHWc_tensor, constant_channel_scale)into an equivalent depthwisecom.microsoft.nchwc.Conv, eliminating reorders and removing theMulnode. - Extend activation handling so
GeluandQuickGelu(MS domain) are treated as NCHWc-compatible elementwise activations but are not fused intocom.microsoft.nchwc.Convvia theactivationattribute. - Update and add NCHWc optimizer tests for the new
Mulscaling pattern and the expanded activation set.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| onnxruntime/test/optimizer/nchwc_optimizer_test.cc | Adds a Conv→Mul(channel-scale)→Conv test and extends activation tests to include MS-domain Gelu/QuickGelu; updates test helper to support non-ONNX domains. |
| onnxruntime/core/optimizer/nchwc_transformer.cc | Implements TransformMul for channel-scale Mul elimination; updates activation fusion gating and expands supported activation op/domain set. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
This PR expands the CPU NCHWc (blocked-channel) graph transformer to better accommodate patterns common in modern CNNs by (a) treating additional activation ops as layout-agnostic to avoid unnecessary reorders and (b) converting certain channel-scale Mul patterns into an equivalent depthwise NCHWc Conv to reduce layout conversions and enable downstream fusions.
Changes:
- Added NCHWc transformer handling for additional activations (
HardSigmoid,Gelu,QuickGelu) so they can operate directly on NCHWc tensors without inserting NCHWc↔NCHW reorder nodes. - Added a
Multransform that rewrites static channel-scale multiplies into a depthwisecom.microsoft.nchwc.Conv. - Extended optimizer tests to cover the new activation patterns and the
Mul→depthwise-Convrewrite, including pre-optimization graph assertions.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| onnxruntime/test/optimizer/nchwc_optimizer_test.cc | Adds test coverage for new activation handling and Mul channel-scale rewrite; extends test helper utilities to support MS domain ops and optional pre-optimization checks. |
| onnxruntime/core/optimizer/nchwc_transformer.cc | Implements TransformMul and extends activation handling to keep more elementwise activations in NCHWc, reducing reorder insertion and enabling more NCHWc-compatible compute. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
tianleiwu
left a comment
There was a problem hiding this comment.
1. TransformMul — channel-scale Mul → depthwise NCHWc Conv (onnxruntime/core/optimizer/nchwc_transformer.cc)
Positive:
- The new function mirrors
TransformBatchNormalizationexactly: pad the 1D weight tonchwc_channels, create a[nchwc_channels, 1, 1, 1]initializer, emit a depthwise NCHWcConvwithgroup = nchwc_channels. The weight shape,groupattribute, andremaining_original_uses_--bookkeeping all match the established pattern. - Dispatching
Mulbefore theinput_edges == 0gate with an explanatory comment is the correct design: a constant-initializer scale has no live input edge, so the gate would never fire for the interesting case. - The two-NCHWc-input fast-exit to
TransformBinary(node, false)avoids duplicating existing logic.
Concern:
⚠️ scale_dimscheck does not cover shape[C]: The PR description says the optimization handles scale shapes1,C,1,1orC,1,1orC, but the code only handles rank-3[C,1,1]and rank-4[1,C,1,1]. A 1-D tensor[C]broadcasts to the last dimension in ONNX (i.e., to width, not channels), so it cannot actually represent a channel scale on an NCHW tensor — the description is therefore misleading rather than the code being wrong. Worth a one-line clarification in the comment.
2. TransformActivation — HardSigmoid fusion gap (onnxruntime/core/optimizer/nchwc_transformer.cc)
Positive:
- Gating the activation-into-Conv fusion behind
can_fuse_activationand keeping theelsepath (CreateNchwcArgument) for unsupported activations is clean:GeluandQuickGeluare correctly kept as NCHWc pass-through nodes becauseGetFusedActivationAttrinfused_activation.ccdoes not recognize them and would return an error.
Concern:
-
⚠️ HardSigmoidis missing fromcan_fuse_activationandactivation_paramsis never written —ConvMulChannelScaleHardSigmoidwill fail and a fused kernel would crash:GetFusedActivationAttr(fused_activation.cc:29-44) handlesHardSigmoidas a parameterized activation that requires anactivation_paramsfloat-list attribute[alpha, beta]. If it is absent the function returnsINVALID_ARGUMENT, andNchwcConv::NchwcConvcallsORT_ENFORCE(GetFusedActivationAttr(...).IsOK())— crash at kernel construction time.The current code:
const bool can_fuse_activation = (node.OpType() == "Relu") || (node.OpType() == "Sigmoid") || (node.OpType() == "Tanh"); if (...can_fuse_activation...) { nchwc_node.AddAttribute("activation", node.OpType()); FuseNchwcArgument(node, *nchwc_input); removed_nodes_.push_front(node.Index()); }
leaves
HardSigmoidunfused. The testConvMulChannelScaleHardSigmoidexpectsop_to_count["HardSigmoid"] == 0andhard_sigmoid_fused_count == 1— both will be wrong.The fix requires two changes in
TransformActivation:const bool can_fuse_activation = (node.OpType() == "Relu") || (node.OpType() == "Sigmoid") || (node.OpType() == "Tanh") || (node.OpType() == "HardSigmoid"); // ADD THIS if (... && can_fuse_activation && ...) { nchwc_node.AddAttribute("activation", node.OpType()); // ADD: write activation_params for parameterized activations if (node.OpType() == "HardSigmoid") { auto* alpha_attr = graph_utils::GetNodeAttribute(node, "alpha"); auto* beta_attr = graph_utils::GetNodeAttribute(node, "beta"); float alpha = (alpha_attr != nullptr ? alpha_attr->f() : 0.2f); float beta = (beta_attr != nullptr ? beta_attr->f() : 0.5f); nchwc_node.AddAttribute("activation_params", std::vector<float>{alpha, beta}); } FuseNchwcArgument(node, *nchwc_input); removed_nodes_.push_front(node.Index()); }
The default values
alpha=0.2, beta=0.5match the ONNX spec defaults forHardSigmoidand are whatconv_activation_fusion.ccuses for the same case.
3. NCHWc optimizer tests (onnxruntime/test/optimizer/nchwc_optimizer_test.cc)
Positive:
- Adding a
domainparameter toAddNodeand threadingkMSDomainthrough the test harness is minimal and correct. - Adding
check_pre_optimization_graphas an optional callback toNchwcOptimizerTesterlets tests assert the starting graph shape, making it much easier to diagnose if a pre-optimization step changes. ConvHardSigmoidTwoConsumersspecifically exercises thestarting_original_uses_ == 1guard and confirms HardSigmoid is kept as a separate NCHWc pass-through node when Conv has two consumers — this is exactly the right boundary case.ConvMulChannelScaleexercises all four combinations of explicit-batch-dim × scale-first order, which pins the operand-order symmetry inTransformMul.
Concern:
⚠️ ConvMulChannelScaleHardSigmoidassertions are inconsistent with the implementation: The test expectsHardSigmoidto be eliminated and itsactivation_paramsto appear on anchwc.Convnode. As written today, the implementation does not fuseHardSigmoid, so these assertions will fail. This is directly caused by the gap described in §2 above; fixing §2 will make these assertions pass.
Summary of Concerns
| # | Severity | Component | Issue |
|---|---|---|---|
| 1 | High | TransformActivation |
HardSigmoid excluded from can_fuse_activation; activation_params never set; ConvMulChannelScaleHardSigmoid test will fail and a (hypothetically) fused kernel would crash at init. |
| 2 | Nitpick | TransformMul comment / PR description |
PR description claims [C] scale shape is handled; code correctly rejects it (wrong broadcast semantics), but description should be corrected to avoid confusion. |
Verdict
REQUEST CHANGES — the HardSigmoid activation fusion is incomplete: can_fuse_activation must include HardSigmoid and TransformActivation must write the activation_params attribute before calling FuseNchwcArgument, otherwise the test fails and any machine that does reach a fused path will crash at kernel construction time.
|
Hi @tianleiwu Thanks for the feedback.
|
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
tianleiwu
left a comment
There was a problem hiding this comment.
LGTM.
It will be nice to add some tests exercising the can_fuse_activation guard for Gelu/QuickGelu with single-consumer Conv.
Thanks. Added the test. |
…c transformer suite (#27821) ### Description As title ### Motivation and Context Tiny continuation to #27691 --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Description
2 main changes:
Handle activations that are seen in modern CNNs in the NCHWc transformer (
QuickGelu,Gelu) and avoid reorder nodes getting inserted before and after them to do the NCHWc <-> NCHW data layout transforms. These can be avoided as these are elemtnwise ops that are otherwise data layout agnosticRewrites a channel scaling Mul (or scaling input shape 1,C,1,1 or C,1,1) into a depthwise conv NCHWc operation. This avoid reorder nodes and enables fusions of any subsequent
Addoperations into the newConvnode.Motivation and Context
Avoid unnecessary data layout operations and enable more NCHWc compatible compute and fusions