Skip to content

[CoreML EP] Add FusedConv support#28289

Merged
yuslepukhin merged 9 commits into
microsoft:mainfrom
maxwbuckley:coreml-fusedconv
May 9, 2026
Merged

[CoreML EP] Add FusedConv support#28289
yuslepukhin merged 9 commits into
microsoft:mainfrom
maxwbuckley:coreml-fusedconv

Conversation

@maxwbuckley
Copy link
Copy Markdown
Contributor

@maxwbuckley maxwbuckley commented Apr 30, 2026

Description

Adds support for `com.microsoft:FusedConv` to the CoreML EP's MLProgram and NeuralNetwork paths. `FusedConv` is produced by ORT's `ConvActivationFusion` pass when a model is optimized with the CPU EP (or any EP in `cpu_acl_js_webgpu_eps`) and saved via `session.optimized_model_filepath` or the ORT-format conversion tool. The saved graph contains `com.microsoft:FusedConv` nodes that — before this patch — the CoreML EP could not claim, fragmenting the partition.

ORT's in-process pipeline does not currently run `ConvActivationFusion` when CoreML EP is the target (the fusion's compat set excludes CoreML), so `FusedConv` typically reaches the CoreML EP only via pre-optimized graphs. That's a real and common workflow: anyone shipping a pre-optimized model artifact (mobile pipelines, ORT-format models, session-cached optimized graphs) that's then loaded with the CoreML EP hits this path.

There's no pre-existing issue tracking this; it was discovered via DWPose / ResNet50 partitioning analysis on Apple Silicon.

Empirical impact

ResNet50-v2 from the ONNX model zoo, CPU-optimized at `ORT_ENABLE_EXTENDED` and reloaded on the CoreML EP (108 nodes total, 33 of them `FusedConv` with Relu activation). M3 Max, MLProgram, batch 1, 100-iter timed runs, 3 interleaved rounds (n=597 per variant):

Partitions Nodes on CoreML Mean StdDev P99 Max
Without this patch 18 75 / 108 23.34 ms 1.01 27.68 30.59
With this patch 1 108 / 108 2.94 ms 0.16 3.75 4.32

7.94× mean speedup. The 33 FusedConv nodes that previously fell back to CPU now stay on the ANE/GPU. Variance also tightens 6× (stddev 1.01 → 0.16).

Partition counts on other Conv-heavy ONNX-zoo models post CPU-optimization:

Model Without With Notes
ResNet50-v2 18 1 33 FusedConv (Relu)
FCN-ResNet50 18 1 35 FusedConv (Relu); fails to compile on CoreML for unrelated reasons
YOLOv3 (full) 27 4 72 FusedConv (LeakyRelu); detection post-proc fails on CoreML for unrelated dynamic-shape reasons
YOLOv3-tiny 13 7 11 FusedConv (LeakyRelu); same

Partition reduction is robust across architectures. ResNet50 is the configuration that runs end-to-end on this exact ONNX-zoo collection on the CoreML EP today; the FCN/YOLO failures are orthogonal CoreML-EP limitations on segmentation upsampling and detection post-processing.

Implementation

Reuses `ConvOpBuilder`, which now branches on `op_type`:

  • `Conv`: behaviour unchanged.
  • `FusedConv`: emit the `conv` MIL op into an intermediate, then chain the activation MIL op on top. Supports all six activation types `ConvActivationFusion` produces:
ONNX activation MIL op params
Relu `relu`
Sigmoid `sigmoid`
Tanh `tanh`
LeakyRelu `leaky_relu` alpha (from `activation_params`)
Clip `clip` alpha=min, beta=max (from `activation_params`)
HardSigmoid `sigmoid_hard` alpha, beta (from `activation_params`)

`IsOpSupportedImpl` rejects `FusedConv` in NeuralNetwork mode (which would emit an unfused Conv and silently lose the activation) and rejects any unrecognized activation string.

Tests

Six new tests in `onnxruntime/test/providers/coreml/coreml_basic_test.cc`, one per supported activation class (param-less, single-param, two-param-positional, two-param-named):

  • `FusedConvTestRelu` — no `activation_params` attribute
  • `FusedConvTestSigmoid` — same shape, exercises sigmoid op-name dispatch
  • `FusedConvTestTanh` — same shape, exercises tanh op-name dispatch
  • `FusedConvTestLeakyRelu` — single param (alpha); the YOLOv3 case
  • `FusedConvTestClip` — two params (min, max)
  • `FusedConvTestHardSigmoid` — two params (alpha, beta); depends on the HardSigmoid CoreML builder landed in [CoreML EP] Add HardSigmoid support #28182

Each verifies CoreML output against the CPU EP reference and asserts `ExpectedEPNodeAssignment::All`. All pass locally on macOS 26.3 / M3 Max.

Also adds the supported-ops doc entry.

maxwbuckley and others added 2 commits April 30, 2026 10:42
Adds support for `com.microsoft:FusedConv` to the CoreML EP's MLProgram
and NeuralNetwork paths. FusedConv is produced by ORT's
`ConvActivationFusion` pass when a model is optimized with the CPU EP
(or any EP in `cpu_acl_js_webgpu_eps`) and saved via
`session.optimized_model_filepath` or the ORT-format conversion tool.
That saved graph contains `com.microsoft:FusedConv` nodes that — before
this patch — the CoreML EP could not claim, fragmenting the partition.

ORT's in-process pipeline does not currently run `ConvActivationFusion`
when CoreML EP is the target (the fusion's compat set excludes CoreML),
so FusedConv typically reaches the CoreML EP only via pre-optimized
graphs. That's a real and common workflow: anyone shipping a
pre-optimized model artifact (mobile pipelines, ORT-format models,
session-cached optimized graphs) that's then loaded with the CoreML EP
hits this path.

## Empirical impact (M3 Max, MLProgram, batch 1)

ResNet50-v2 from the ONNX model zoo, CPU-optimized at ORT_ENABLE_EXTENDED
and reloaded on CoreML EP (108 nodes total, 33 of them FusedConv with
Relu activation):

|                              | Partitions | Nodes on CoreML | Mean      | StdDev | P99      | Max     |
|------------------------------|------------|-----------------|-----------|--------|----------|---------|
| Without this patch           | 18         | 75 / 108        | 23.34 ms  | 1.01   | 27.68    | 30.59   |
| With this patch              |  1         | 108 / 108       |  2.94 ms  | 0.16   |  3.75    |  4.32   |

7.94× mean speedup; the 33 FusedConv nodes that previously fell back to
CPU now stay on the ANE/GPU. Variance also tightens 6× (stddev
1.01 → 0.16). 597 timed iterations per variant, 3 interleaved rounds.

Partition counts on other Conv-heavy ONNX-zoo models with FusedConv
content (post CPU optimization):

| Model              | Without | With | Notes                                  |
|--------------------|---------|------|----------------------------------------|
| ResNet50-v2        |    18   |   1  | 33 FusedConv (Relu)                    |
| FCN-ResNet50       |    18   |   1  | 35 FusedConv (Relu); fails to compile  |
|                    |         |      |   on CoreML for unrelated reasons      |
| YOLOv3 (full)      |    27   |   4  | 72 FusedConv (LeakyRelu); detection    |
|                    |         |      |   post-proc fails on CoreML for        |
|                    |         |      |   unrelated dynamic-shape reasons      |
| YOLOv3-tiny        |    13   |   7  | 11 FusedConv (LeakyRelu); same         |

The partition-count reduction is robust across architectures; ResNet50
is the configuration that runs end-to-end on this exact ONNX-zoo model
collection.

## Implementation

Reuses `ConvOpBuilder`, which now branches on `op_type`:

  - `Conv`: behavior unchanged.
  - `FusedConv`: emit the `conv` MIL op into an intermediate, then chain
    the activation MIL op on top. Supports all six activation types
    `ConvActivationFusion` produces:
      Relu        -> relu
      Sigmoid     -> sigmoid
      Tanh        -> tanh
      LeakyRelu   -> leaky_relu      (alpha from activation_params)
      Clip        -> clip            (min/max from activation_params)
      HardSigmoid -> sigmoid_hard    (alpha/beta from activation_params)

`IsOpSupportedImpl` rejects FusedConv in NeuralNetwork mode (which would
emit an unfused Conv and lose the activation) and rejects any
unrecognized activation string.

## Tests

Six new tests in `coreml_basic_test.cc`, one per supported activation
class (param-less, single-param, two-param-positional, two-param-named):

  FusedConvTestRelu          — no `activation_params` attribute
  FusedConvTestSigmoid       — same shape, exercises sigmoid op-name dispatch
  FusedConvTestTanh          — same shape, exercises tanh op-name dispatch
  FusedConvTestLeakyRelu     — single param (alpha)
  FusedConvTestClip          — two params (min, max)
  FusedConvTestHardSigmoid   — two params (alpha, beta)

Each verifies CoreML output against the CPU EP reference and asserts
`ExpectedEPNodeAssignment::All`. All pass locally on macOS 26.3 / M3 Max.

Also adds the supported-ops doc entry.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The activation list and the function name already convey what's
allowed; the cross-reference to a specific line range in
conv_activation_fusion.cc would rot the moment that file gets touched.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@maxwbuckley
Copy link
Copy Markdown
Contributor Author

@yuslepukhin this is also a lovely one :) ResNet50-v2 goes Brrrr......!

@yuslepukhin yuslepukhin requested a review from Copilot May 6, 2026 21:00
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds CoreML EP support for com.microsoft:FusedConv so pre-optimized ORT graphs (via ConvActivationFusion) can be fully claimed/compiled on CoreML MLProgram, reducing partition fragmentation and improving performance.

Changes:

  • Register com.microsoft:FusedConv with the CoreML op builder factory.
  • Extend ConvOpBuilder to emit conv + fused activation MIL ops for MLProgram.
  • Add CoreML tests covering six fused activation variants and document the newly supported op.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File Description
tools/ci_build/github/apple/coreml_supported_mlprogram_ops.md Documents com.microsoft:FusedConv as supported in MLProgram.
onnxruntime/test/providers/coreml/coreml_basic_test.cc Adds single-node FusedConv model generator and 6 activation-focused MLProgram tests.
onnxruntime/core/providers/coreml/builders/op_builder_factory.cc Registers FusedConv to use the Conv builder implementation.
onnxruntime/core/providers/coreml/builders/impl/conv_op_builder.cc Implements MLProgram lowering for FusedConv as conv + activation and adds support checks.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread onnxruntime/core/providers/coreml/builders/impl/conv_op_builder.cc
Comment thread onnxruntime/core/providers/coreml/builders/impl/conv_op_builder.cc Outdated
Comment thread onnxruntime/core/providers/coreml/builders/impl/conv_op_builder.cc Outdated
Comment thread onnxruntime/test/providers/coreml/coreml_basic_test.cc
Comment thread onnxruntime/core/providers/coreml/builders/op_builder_factory.cc Outdated
Copy link
Copy Markdown
Member

@yuslepukhin yuslepukhin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The happy-path implementation does match the PR description for pure Conv + activation MLProgram nodes, but it does not safely support the full FusedConv surface that the registration now exposes. Test coverage also misses both problems: the new CoreML tests only build X/W models with supported activations in coreml_basic_test.cc:1168, so there is no B/Z case and no negative coverage for malformed attributes or rejection paths.

Comment thread onnxruntime/core/providers/coreml/builders/impl/conv_op_builder.cc
Comment thread onnxruntime/core/providers/coreml/builders/impl/conv_op_builder.cc Outdated
maxwbuckley and others added 6 commits May 7, 2026 09:54
Resolves conflict in onnxruntime/test/providers/coreml/coreml_basic_test.cc
where this branch's FusedConv test helpers + 6 tests landed in the same
file region as the Split11/13/7 tests merged via microsoft#28270. Both sets are
preserved sequentially.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the duplicated activation lists in IsSupportedFusedConvActivation
and the if/else MIL-op chain in AddToModelBuilderImpl with a single
constexpr table mapping each ONNX activation name to its MIL op, expected
activation_params arity, and MIL input port names. Both the support gate
and the dispatch path now consult that table.

Also tightens IsOpSupportedImpl to reject FusedConv nodes whose
activation_params arity does not match what the activation expects (0 for
Relu/Sigmoid/Tanh, 1 for LeakyRelu, 2 for Clip/HardSigmoid). The CPU EP
already rejects mismatches in fused_activation.cc; CoreML now matches that
behaviour instead of silently inventing defaults.

Addresses review feedback from yuslepukhin and copilot-pull-request-reviewer
on microsoft#28289.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds two early rejections in IsOpSupportedImpl that the previous
implementation was silently letting through:

1. The optional 4th input 'Z' (residual sum) — FusedConv with Z is
   Y = activation(Conv(X,W,B) + Z), but the MLProgram lowering only emits
   conv + activation and never reads input[3]. Without this guard a
   pre-optimized Conv+Add+Act graph would be fully assigned to CoreML and
   produce the wrong result by dropping the residual add. Reported by
   yuslepukhin on microsoft#28289.

2. Non-float element types — FusedConv schema's `T` permits double, but
   the activation-param lambda only handles FLOAT and FLOAT16. CoreML does
   not support double anyway; reject double explicitly so the fallback to
   CPU is what actually runs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ting

Adds two ExpectedEPNodeAssignment::None tests covering the support-gating
paths added in the previous commit:

- FusedConvNeuralNetworkNotSupported — FusedConv on the NeuralNetwork EP
  is rejected so the node falls back to CPU rather than emit an unfused
  Conv that silently drops the activation.

- FusedConvWithZInputNotSupported — FusedConv with the optional residual
  Z input is rejected to prevent the silent drop of Conv+Add+Act
  semantics that yuslepukhin flagged on microsoft#28289.

The unsupported-activation and wrong-arity rejections are also live but
not testable end-to-end: the CPU FusedConv kernel rejects those same
malformed graphs at kernel construction, so TestModelLoad's Initialize
fails before partition assignment can be observed.

MakeFusedConvModel grows an `add_z` knob to wire the optional 4th input.
A small RunFusedConvNegativeTest helper packages the
serialize-then-TestModelLoad pattern.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous comment said FusedConv "reuses the existing ConvOpBuilder",
which Copilot flagged as misleading because CreateConvOpBuilder registers
a new instance under the FusedConv op type rather than literally reusing
the Conv-registered instance. Reword to "handled by the same ConvOpBuilder
class" so it's clear the reuse is at the class/dispatch level, not the
instance.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a TODO above the FusedConv Z-input rejection pointing at the
straightforward MIL lowering (`add(conv_out, Z)` between conv and
activation) and noting which optimizer pass produces the Z form
(ConvAddActivationFusion at TransformerLevel::Level3, gated to cpu_ep).
This way the next person looking at residual-block coverage on CoreML
finds the implementation hint without re-discovering the schema and
optimizer pass independently.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@maxwbuckley
Copy link
Copy Markdown
Contributor Author

@yuslepukhin friendly ping — the latest stack (bb7f4b10f85140) addresses both of your inline comments (Z residual input rejection, activation_params arity validation) plus the Copilot suggestions (single activation table, FLOAT16/DOUBLE handling, factory comment), with two new negative tests for the gating paths and an inline TODO for proper Z lowering as a follow-up. Re-benchmarked ResNet50-v2 post-refactor and the 8× speedup is intact. CI looks like it needs a maintainer trigger for the full matrix on the new commits. Ready for another look when you have a moment 🙏

@yuslepukhin yuslepukhin enabled auto-merge (squash) May 7, 2026 18:45
@yuslepukhin
Copy link
Copy Markdown
Member

Merge from main, pls.

@maxwbuckley
Copy link
Copy Markdown
Contributor Author

Thanks a lot Dmitiri. We are making Mac computer vision go Brrrr.....! 🔥

@yuslepukhin yuslepukhin merged commit aa92574 into microsoft:main May 9, 2026
89 of 90 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants