[CoreML EP] Add FusedConv support#28289
Conversation
Adds support for `com.microsoft:FusedConv` to the CoreML EP's MLProgram
and NeuralNetwork paths. FusedConv is produced by ORT's
`ConvActivationFusion` pass when a model is optimized with the CPU EP
(or any EP in `cpu_acl_js_webgpu_eps`) and saved via
`session.optimized_model_filepath` or the ORT-format conversion tool.
That saved graph contains `com.microsoft:FusedConv` nodes that — before
this patch — the CoreML EP could not claim, fragmenting the partition.
ORT's in-process pipeline does not currently run `ConvActivationFusion`
when CoreML EP is the target (the fusion's compat set excludes CoreML),
so FusedConv typically reaches the CoreML EP only via pre-optimized
graphs. That's a real and common workflow: anyone shipping a
pre-optimized model artifact (mobile pipelines, ORT-format models,
session-cached optimized graphs) that's then loaded with the CoreML EP
hits this path.
## Empirical impact (M3 Max, MLProgram, batch 1)
ResNet50-v2 from the ONNX model zoo, CPU-optimized at ORT_ENABLE_EXTENDED
and reloaded on CoreML EP (108 nodes total, 33 of them FusedConv with
Relu activation):
| | Partitions | Nodes on CoreML | Mean | StdDev | P99 | Max |
|------------------------------|------------|-----------------|-----------|--------|----------|---------|
| Without this patch | 18 | 75 / 108 | 23.34 ms | 1.01 | 27.68 | 30.59 |
| With this patch | 1 | 108 / 108 | 2.94 ms | 0.16 | 3.75 | 4.32 |
7.94× mean speedup; the 33 FusedConv nodes that previously fell back to
CPU now stay on the ANE/GPU. Variance also tightens 6× (stddev
1.01 → 0.16). 597 timed iterations per variant, 3 interleaved rounds.
Partition counts on other Conv-heavy ONNX-zoo models with FusedConv
content (post CPU optimization):
| Model | Without | With | Notes |
|--------------------|---------|------|----------------------------------------|
| ResNet50-v2 | 18 | 1 | 33 FusedConv (Relu) |
| FCN-ResNet50 | 18 | 1 | 35 FusedConv (Relu); fails to compile |
| | | | on CoreML for unrelated reasons |
| YOLOv3 (full) | 27 | 4 | 72 FusedConv (LeakyRelu); detection |
| | | | post-proc fails on CoreML for |
| | | | unrelated dynamic-shape reasons |
| YOLOv3-tiny | 13 | 7 | 11 FusedConv (LeakyRelu); same |
The partition-count reduction is robust across architectures; ResNet50
is the configuration that runs end-to-end on this exact ONNX-zoo model
collection.
## Implementation
Reuses `ConvOpBuilder`, which now branches on `op_type`:
- `Conv`: behavior unchanged.
- `FusedConv`: emit the `conv` MIL op into an intermediate, then chain
the activation MIL op on top. Supports all six activation types
`ConvActivationFusion` produces:
Relu -> relu
Sigmoid -> sigmoid
Tanh -> tanh
LeakyRelu -> leaky_relu (alpha from activation_params)
Clip -> clip (min/max from activation_params)
HardSigmoid -> sigmoid_hard (alpha/beta from activation_params)
`IsOpSupportedImpl` rejects FusedConv in NeuralNetwork mode (which would
emit an unfused Conv and lose the activation) and rejects any
unrecognized activation string.
## Tests
Six new tests in `coreml_basic_test.cc`, one per supported activation
class (param-less, single-param, two-param-positional, two-param-named):
FusedConvTestRelu — no `activation_params` attribute
FusedConvTestSigmoid — same shape, exercises sigmoid op-name dispatch
FusedConvTestTanh — same shape, exercises tanh op-name dispatch
FusedConvTestLeakyRelu — single param (alpha)
FusedConvTestClip — two params (min, max)
FusedConvTestHardSigmoid — two params (alpha, beta)
Each verifies CoreML output against the CPU EP reference and asserts
`ExpectedEPNodeAssignment::All`. All pass locally on macOS 26.3 / M3 Max.
Also adds the supported-ops doc entry.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The activation list and the function name already convey what's allowed; the cross-reference to a specific line range in conv_activation_fusion.cc would rot the moment that file gets touched. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@yuslepukhin this is also a lovely one :) ResNet50-v2 goes Brrrr......! |
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds CoreML EP support for com.microsoft:FusedConv so pre-optimized ORT graphs (via ConvActivationFusion) can be fully claimed/compiled on CoreML MLProgram, reducing partition fragmentation and improving performance.
Changes:
- Register
com.microsoft:FusedConvwith the CoreML op builder factory. - Extend
ConvOpBuilderto emitconv+ fused activation MIL ops for MLProgram. - Add CoreML tests covering six fused activation variants and document the newly supported op.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| tools/ci_build/github/apple/coreml_supported_mlprogram_ops.md | Documents com.microsoft:FusedConv as supported in MLProgram. |
| onnxruntime/test/providers/coreml/coreml_basic_test.cc | Adds single-node FusedConv model generator and 6 activation-focused MLProgram tests. |
| onnxruntime/core/providers/coreml/builders/op_builder_factory.cc | Registers FusedConv to use the Conv builder implementation. |
| onnxruntime/core/providers/coreml/builders/impl/conv_op_builder.cc | Implements MLProgram lowering for FusedConv as conv + activation and adds support checks. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
yuslepukhin
left a comment
There was a problem hiding this comment.
The happy-path implementation does match the PR description for pure Conv + activation MLProgram nodes, but it does not safely support the full FusedConv surface that the registration now exposes. Test coverage also misses both problems: the new CoreML tests only build X/W models with supported activations in coreml_basic_test.cc:1168, so there is no B/Z case and no negative coverage for malformed attributes or rejection paths.
Resolves conflict in onnxruntime/test/providers/coreml/coreml_basic_test.cc where this branch's FusedConv test helpers + 6 tests landed in the same file region as the Split11/13/7 tests merged via microsoft#28270. Both sets are preserved sequentially. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the duplicated activation lists in IsSupportedFusedConvActivation and the if/else MIL-op chain in AddToModelBuilderImpl with a single constexpr table mapping each ONNX activation name to its MIL op, expected activation_params arity, and MIL input port names. Both the support gate and the dispatch path now consult that table. Also tightens IsOpSupportedImpl to reject FusedConv nodes whose activation_params arity does not match what the activation expects (0 for Relu/Sigmoid/Tanh, 1 for LeakyRelu, 2 for Clip/HardSigmoid). The CPU EP already rejects mismatches in fused_activation.cc; CoreML now matches that behaviour instead of silently inventing defaults. Addresses review feedback from yuslepukhin and copilot-pull-request-reviewer on microsoft#28289. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds two early rejections in IsOpSupportedImpl that the previous implementation was silently letting through: 1. The optional 4th input 'Z' (residual sum) — FusedConv with Z is Y = activation(Conv(X,W,B) + Z), but the MLProgram lowering only emits conv + activation and never reads input[3]. Without this guard a pre-optimized Conv+Add+Act graph would be fully assigned to CoreML and produce the wrong result by dropping the residual add. Reported by yuslepukhin on microsoft#28289. 2. Non-float element types — FusedConv schema's `T` permits double, but the activation-param lambda only handles FLOAT and FLOAT16. CoreML does not support double anyway; reject double explicitly so the fallback to CPU is what actually runs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ting Adds two ExpectedEPNodeAssignment::None tests covering the support-gating paths added in the previous commit: - FusedConvNeuralNetworkNotSupported — FusedConv on the NeuralNetwork EP is rejected so the node falls back to CPU rather than emit an unfused Conv that silently drops the activation. - FusedConvWithZInputNotSupported — FusedConv with the optional residual Z input is rejected to prevent the silent drop of Conv+Add+Act semantics that yuslepukhin flagged on microsoft#28289. The unsupported-activation and wrong-arity rejections are also live but not testable end-to-end: the CPU FusedConv kernel rejects those same malformed graphs at kernel construction, so TestModelLoad's Initialize fails before partition assignment can be observed. MakeFusedConvModel grows an `add_z` knob to wire the optional 4th input. A small RunFusedConvNegativeTest helper packages the serialize-then-TestModelLoad pattern. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous comment said FusedConv "reuses the existing ConvOpBuilder", which Copilot flagged as misleading because CreateConvOpBuilder registers a new instance under the FusedConv op type rather than literally reusing the Conv-registered instance. Reword to "handled by the same ConvOpBuilder class" so it's clear the reuse is at the class/dispatch level, not the instance. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a TODO above the FusedConv Z-input rejection pointing at the straightforward MIL lowering (`add(conv_out, Z)` between conv and activation) and noting which optimizer pass produces the Z form (ConvAddActivationFusion at TransformerLevel::Level3, gated to cpu_ep). This way the next person looking at residual-block coverage on CoreML finds the implementation hint without re-discovering the schema and optimizer pass independently. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@yuslepukhin friendly ping — the latest stack ( |
|
Merge from main, pls. |
|
Thanks a lot Dmitiri. We are making Mac computer vision go Brrrr.....! 🔥 |
Description
Adds support for `com.microsoft:FusedConv` to the CoreML EP's MLProgram and NeuralNetwork paths. `FusedConv` is produced by ORT's `ConvActivationFusion` pass when a model is optimized with the CPU EP (or any EP in `cpu_acl_js_webgpu_eps`) and saved via `session.optimized_model_filepath` or the ORT-format conversion tool. The saved graph contains `com.microsoft:FusedConv` nodes that — before this patch — the CoreML EP could not claim, fragmenting the partition.
ORT's in-process pipeline does not currently run `ConvActivationFusion` when CoreML EP is the target (the fusion's compat set excludes CoreML), so `FusedConv` typically reaches the CoreML EP only via pre-optimized graphs. That's a real and common workflow: anyone shipping a pre-optimized model artifact (mobile pipelines, ORT-format models, session-cached optimized graphs) that's then loaded with the CoreML EP hits this path.
There's no pre-existing issue tracking this; it was discovered via DWPose / ResNet50 partitioning analysis on Apple Silicon.
Empirical impact
ResNet50-v2 from the ONNX model zoo, CPU-optimized at `ORT_ENABLE_EXTENDED` and reloaded on the CoreML EP (108 nodes total, 33 of them `FusedConv` with Relu activation). M3 Max, MLProgram, batch 1, 100-iter timed runs, 3 interleaved rounds (n=597 per variant):
7.94× mean speedup. The 33 FusedConv nodes that previously fell back to CPU now stay on the ANE/GPU. Variance also tightens 6× (stddev 1.01 → 0.16).
Partition counts on other Conv-heavy ONNX-zoo models post CPU-optimization:
Partition reduction is robust across architectures. ResNet50 is the configuration that runs end-to-end on this exact ONNX-zoo collection on the CoreML EP today; the FCN/YOLO failures are orthogonal CoreML-EP limitations on segmentation upsampling and detection post-processing.
Implementation
Reuses `ConvOpBuilder`, which now branches on `op_type`:
`IsOpSupportedImpl` rejects `FusedConv` in NeuralNetwork mode (which would emit an unfused Conv and silently lose the activation) and rejects any unrecognized activation string.
Tests
Six new tests in `onnxruntime/test/providers/coreml/coreml_basic_test.cc`, one per supported activation class (param-less, single-param, two-param-positional, two-param-named):
Each verifies CoreML output against the CPU EP reference and asserts `ExpectedEPNodeAssignment::All`. All pass locally on macOS 26.3 / M3 Max.
Also adds the supported-ops doc entry.