onnx-coreml backend#2319
Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR adds support for the CoreML execution provider to the ONNX backend, enabling hardware acceleration on Apple Silicon devices. The changes register a new "onnx-coreml" backend option and configure it to use the MLProgram model format with compute plan profiling.
- Adds COREML as a new OnnxProvider enum value
- Implements CoreML provider configuration with MLProgram format and profiling enabled
- Registers the "onnx-coreml" backend with priority 59
- Updates CI pipeline to build with ONNX runtime and test the CoreML backend on macOS ARM64
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| src/neural/backends/network_onnx.cc | Adds CoreML provider enum, configuration logic, and backend registration |
| .circleci/config.yml | Adds ONNX runtime installation, build configuration, and CoreML backend testing on macOS ARM64 |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
091e474 to
0f71b12
Compare
|
Some preliminary tests using fp32: fp16 fp16 with PR26442 applied: |
|
Just going to add, with the current metal backend, I am getting only 62 NPS on an M4 Pro, but it has a warmup period from MPS Graph. After a 15 second warmup, it was getting ~6k NPS on the start pos |
CoreML vs Metal Backend Performance ReportDate: 2026-02-07 Summary
CoreML Model Compilation TimesCoreML compiles a separate model for each batch size (32 models total). This is a one-time cost per session.
Metal has no compilation overhead. Detailed ResultsClassical Network (11248.pb.gz) - 20 residual blocks, 256 filters
Notes:
SE Network (744204.pb.gz) - 15 SE residual blocks, 192 filters
Notes:
BT3 Transformer (768x15x24h-swa-2790000.pb.gz) - 768 embedding, 15 blocks, 24 heads
Notes:
BT4 Transformer (1024x15x32h-swa-6147500.pb.gz) - 1024 embedding, 15 blocks, 32 heads
Notes:
AnalysisWhere CoreML Excels
Where Metal Excels
CoreML Batch BucketingCoreML appears to compile models for batch ranges rather than individual sizes:
Hardware UtilizationCoreML profiling shows it distributes operations across:
For BT4, the mixed GPU/Neural Engine execution with frequent data transfers appears to create bottlenecks that pure Metal GPU execution avoids. Recommendation
Note: Generated by Claude Code |
| case OnnxProvider::COREML: { | ||
| std::unordered_map<std::string, std::string> provider_options; | ||
| provider_options["ModelFormat"] = "MLProgram"; | ||
| provider_options["ProfileComputePlan"] = "1"; |
There was a problem hiding this comment.
| provider_options["ProfileComputePlan"] = "1"; | |
| provider_options["MLComputeUnits"] = compute_units_; |
Onnx runtime fails optimizing performance by this. It would be better to let lc0 to benchmark MLComputeUnits. Then, add compute_units_ to the member field like this: https://github.com/borg323/lc0/pull/10/changes#diff-e18d2fce5100e036be2f7e918b15b27c2453f75cbc8d4143b9ad86bcd03317f7R212
Currently the onnxruntime coreml provider doesn't support everything required, the following three patches are needed for both fp32 and fp16 with fixed batch size (default for now).
microsoft/onnxruntime#26443(merged)microsoft/onnxruntime#26442(merged)microsoft/onnxruntime#26462(merged)For variable batch size, hopefully the fix for issue microsoft/onnxruntime#26328 is simple.
If someone wants to try it out, the default onnxruntime branch should work.
The last outstanding patch is forGatherfp16 support, which is the last kernel before the policy output, so doing it on the cpu shouldn't cause a huge performance drop.