Skip to content

onnx-coreml backend#2319

Merged
borg323 merged 12 commits into
LeelaChessZero:masterfrom
borg323:onnx-coreml
Mar 3, 2026
Merged

onnx-coreml backend#2319
borg323 merged 12 commits into
LeelaChessZero:masterfrom
borg323:onnx-coreml

Conversation

@borg323
Copy link
Copy Markdown
Member

@borg323 borg323 commented Oct 16, 2025

Currently the onnxruntime coreml provider doesn't support everything required, the following three patches are needed for both fp32 and fp16 with fixed batch size (default for now).
microsoft/onnxruntime#26443 (merged)
microsoft/onnxruntime#26442 (merged)
microsoft/onnxruntime#26462 (merged)

For variable batch size, hopefully the fix for issue microsoft/onnxruntime#26328 is simple.

If someone wants to try it out, the default onnxruntime branch should work. The last outstanding patch is for Gather fp16 support, which is the last kernel before the policy output, so doing it on the cpu shouldn't cause a huge performance drop.

@borg323 borg323 requested a review from Copilot October 16, 2025 09:50
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds support for the CoreML execution provider to the ONNX backend, enabling hardware acceleration on Apple Silicon devices. The changes register a new "onnx-coreml" backend option and configure it to use the MLProgram model format with compute plan profiling.

  • Adds COREML as a new OnnxProvider enum value
  • Implements CoreML provider configuration with MLProgram format and profiling enabled
  • Registers the "onnx-coreml" backend with priority 59
  • Updates CI pipeline to build with ONNX runtime and test the CoreML backend on macOS ARM64

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
src/neural/backends/network_onnx.cc Adds CoreML provider enum, configuration logic, and backend registration
.circleci/config.yml Adds ONNX runtime installation, build configuration, and CoreML backend testing on macOS ARM64

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@borg323 borg323 force-pushed the onnx-coreml branch 8 times, most recently from 091e474 to 0f71b12 Compare October 17, 2025 17:20
@borg323
Copy link
Copy Markdown
Member Author

borg323 commented Nov 2, 2025

Some preliminary tests using lc0 bench with 791556 on a Apple M3 Pro.

fp32:

Total time (ms) : 5217
Nodes searched  : 13762
Nodes/second    : 2637

fp16

Total time (ms) : 5203
Nodes searched  : 20807
Nodes/second    : 3998

fp16 with PR26442 applied:

Total time (ms) : 5179
Nodes searched  : 26833
Nodes/second    : 5180

@john-sp
Copy link
Copy Markdown
Member

john-sp commented Nov 20, 2025

Just going to add, with the current metal backend, I am getting only 62 NPS on an M4 Pro, but it has a warmup period from MPS Graph. After a 15 second warmup, it was getting ~6k NPS on the start pos

@borg323 borg323 marked this pull request as ready for review January 28, 2026 10:39
@ChinChangYang
Copy link
Copy Markdown
Contributor

CoreML vs Metal Backend Performance Report

Date: 2026-02-07
Device: Apple M3 Max
Build: lc0 v0.33.0-dev (commit 3a38df5, branch fix/pr-2319-coreml)
ONNX Runtime: v1.25.0 (RelWithDebInfo)
CoreML API: ios18 ops
Benchmark tool: backendbench --max-batch-size=32

Summary

Network Type onnx-coreml (batch=32) metal (batch=32) CoreML Advantage
11248 Classical (20 blocks) 4,263 nps 3,501 nps +22%
744204 SE (15 blocks) 15,948 nps 10,235 nps +56%
BT3 (768x15x24h) Transformer 905 nps 809 nps +12%
BT4 (1024x15x32h) Transformer 203 nps 487 nps -58% (Metal wins)

CoreML Model Compilation Times

CoreML compiles a separate model for each batch size (32 models total). This is a one-time cost per session.

Network Compilation Time Total Time (compile + bench) Peak Memory
11248 (Classical) ~37s ~62s 1.0 GB
BT3 (768x15x24h) ~15 min ~19 min 9.1 GB
BT4 (1024x15x32h) ~33 min ~38 min N/A

Metal has no compilation overhead.

Detailed Results

Classical Network (11248.pb.gz) - 20 residual blocks, 256 filters

Batch onnx-coreml NPS metal NPS CoreML Advantage
1 241 356 -32%
4 959 1,268 -24%
8 1,918 2,053 -7%
16 3,876 2,702 +43%
32 4,263 3,501 +22%

Notes:

  • Metal is faster at small batch sizes (1-8)
  • CoreML takes the lead starting around batch 12-16
  • CoreML time/batch: 4.2ms (batch 1-16), 7.5ms (batch 17-32) - shows batch bucketing
  • Metal time/batch: 2.8ms (batch 1) to 9.1ms (batch 32) - gradual scaling

SE Network (744204.pb.gz) - 15 SE residual blocks, 192 filters

Batch onnx-coreml NPS metal NPS CoreML Advantage
1 706 461 +53%
4 2,804 1,878 +49%
8 5,556 4,209 +32%
16 11,087 6,687 +66%
32 15,948 10,235 +56%

Notes:

  • CoreML is consistently faster across all batch sizes
  • Largest advantage at batch 16 (+66%)
  • CoreML time/batch: 1.4ms (batch 1-16), 2.0ms (batch 17-32)
  • Metal time/batch: 2.2ms (batch 1) to 3.1ms (batch 32)

BT3 Transformer (768x15x24h-swa-2790000.pb.gz) - 768 embedding, 15 blocks, 24 heads

Batch onnx-coreml NPS metal NPS CoreML Advantage
1 49 190 -74%
4 195 478 -59%
8 388 629 -38%
16 776 768 +1%
32 905 809 +12%

Notes:

  • Metal dominates at small batch sizes
  • Crossover point around batch 16
  • CoreML has very high per-inference latency (~20ms batch 1-16, ~35ms batch 17-32)
  • Metal: 5.3ms (batch 1) to 39.5ms (batch 32)
  • CoreML uses mixed GPU + Neural Engine execution for transformers

BT4 Transformer (1024x15x32h-swa-6147500.pb.gz) - 1024 embedding, 15 blocks, 32 heads

Batch onnx-coreml NPS metal NPS CoreML Advantage
1 19 155 -88%
4 75 335 -78%
8 150 405 -63%
16 307 480 -36%
32 203 487 -58%

Notes:

  • Metal is significantly faster at all batch sizes
  • CoreML performance actually degrades above batch 16 (203 nps at batch 32 vs 307 at batch 16)
  • CoreML latency: 52ms (batch 1-16), then jumps to 157ms (batch 17-32)
  • Metal latency: 6.4ms (batch 1) to 65.7ms (batch 32) - reasonable scaling
  • The large model appears to exceed CoreML's efficient execution threshold

Analysis

Where CoreML Excels

  • Small/medium convolutional networks (SE, Classical at higher batch sizes)
  • Batch 16+ for most network types
  • SE networks show the largest CoreML advantage (up to +66%)

Where Metal Excels

  • Large transformer models (BT3 at low batch, BT4 at all batch sizes)
  • Low-latency single-position evaluation (batch=1)
  • No compilation overhead - immediate startup

CoreML Batch Bucketing

CoreML appears to compile models for batch ranges rather than individual sizes:

  • Batch 1-16: One compiled model (~4ms for Classical, ~1.4ms for SE)
  • Batch 17-32: Another compiled model (higher latency per batch)
    This creates a latency step at batch 17.

Hardware Utilization

CoreML profiling shows it distributes operations across:

  • Apple GPU (MLGPUComputeDevice) - matmul, layer_norm, conv
  • Neural Engine (MLNeuralEngineComputeDevice) - linear, transpose, reshape
  • CPU (MLCPUComputeDevice) - gather operations

For BT4, the mixed GPU/Neural Engine execution with frequent data transfers appears to create bottlenecks that pure Metal GPU execution avoids.

Recommendation

  • For SE and Classical networks: use CoreML (significant speedup, especially at batch 16+)
  • For BT4 transformers: use Metal (2-8x faster, no compilation overhead)
  • For BT3 transformers: use Metal for low-latency play; CoreML only marginally faster at batch 32
  • If startup time matters: use Metal (CoreML compilation can take 15-33 minutes for transformers)

Note: Generated by Claude Code

case OnnxProvider::COREML: {
std::unordered_map<std::string, std::string> provider_options;
provider_options["ModelFormat"] = "MLProgram";
provider_options["ProfileComputePlan"] = "1";
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
provider_options["ProfileComputePlan"] = "1";
provider_options["MLComputeUnits"] = compute_units_;

Onnx runtime fails optimizing performance by this. It would be better to let lc0 to benchmark MLComputeUnits. Then, add compute_units_ to the member field like this: https://github.com/borg323/lc0/pull/10/changes#diff-e18d2fce5100e036be2f7e918b15b27c2453f75cbc8d4143b9ad86bcd03317f7R212

@borg323 borg323 merged commit 701bd83 into LeelaChessZero:master Mar 3, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants