onnx-coreml backend by borg323 · Pull Request #2319 · LeelaChessZero/lc0

borg323 · 2025-10-16T09:50:26Z

Currently the onnxruntime coreml provider doesn't support everything required, the following three patches are needed for both fp32 and fp16 with fixed batch size (default for now).
~~microsoft/onnxruntime#26443~~ (merged)
~~microsoft/onnxruntime#26442~~ (merged)
~~microsoft/onnxruntime#26462~~ (merged)

For variable batch size, hopefully the fix for issue microsoft/onnxruntime#26328 is simple.

If someone wants to try it out, the default onnxruntime branch should work. ~~The last outstanding patch is for Gather fp16 support, which is the last kernel before the policy output, so doing it on the cpu shouldn't cause a huge performance drop.~~

Copilot

Pull Request Overview

This PR adds support for the CoreML execution provider to the ONNX backend, enabling hardware acceleration on Apple Silicon devices. The changes register a new "onnx-coreml" backend option and configure it to use the MLProgram model format with compute plan profiling.

Adds COREML as a new OnnxProvider enum value
Implements CoreML provider configuration with MLProgram format and profiling enabled
Registers the "onnx-coreml" backend with priority 59
Updates CI pipeline to build with ONNX runtime and test the CoreML backend on macOS ARM64

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File	Description
src/neural/backends/network_onnx.cc	Adds CoreML provider enum, configuration logic, and backend registration
.circleci/config.yml	Adds ONNX runtime installation, build configuration, and CoreML backend testing on macOS ARM64

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

borg323 · 2025-11-02T14:34:25Z

Some preliminary tests using lc0 bench with 791556 on a Apple M3 Pro.

fp32:

Total time (ms) : 5217
Nodes searched  : 13762
Nodes/second    : 2637

fp16

Total time (ms) : 5203
Nodes searched  : 20807
Nodes/second    : 3998

fp16 with PR26442 applied:

Total time (ms) : 5179
Nodes searched  : 26833
Nodes/second    : 5180

john-sp · 2025-11-20T01:26:32Z

Just going to add, with the current metal backend, I am getting only 62 NPS on an M4 Pro, but it has a warmup period from MPS Graph. After a 15 second warmup, it was getting ~6k NPS on the start pos

ChinChangYang · 2026-02-07T11:48:55Z

CoreML vs Metal Backend Performance Report

Date: 2026-02-07
Device: Apple M3 Max
Build: lc0 v0.33.0-dev (commit 3a38df5, branch fix/pr-2319-coreml)
ONNX Runtime: v1.25.0 (RelWithDebInfo)
CoreML API: ios18 ops
Benchmark tool: backendbench --max-batch-size=32

Summary

Network	Type	onnx-coreml (batch=32)	metal (batch=32)	CoreML Advantage
11248	Classical (20 blocks)	4,263 nps	3,501 nps	+22%
744204	SE (15 blocks)	15,948 nps	10,235 nps	+56%
BT3 (768x15x24h)	Transformer	905 nps	809 nps	+12%
BT4 (1024x15x32h)	Transformer	203 nps	487 nps	-58% (Metal wins)

CoreML Model Compilation Times

CoreML compiles a separate model for each batch size (32 models total). This is a one-time cost per session.

Network	Compilation Time	Total Time (compile + bench)	Peak Memory
11248 (Classical)	~37s	~62s	1.0 GB
BT3 (768x15x24h)	~15 min	~19 min	9.1 GB
BT4 (1024x15x32h)	~33 min	~38 min	N/A

Metal has no compilation overhead.

Detailed Results

Classical Network (11248.pb.gz) - 20 residual blocks, 256 filters

Batch	onnx-coreml NPS	metal NPS	CoreML Advantage
1	241	356	-32%
4	959	1,268	-24%
8	1,918	2,053	-7%
16	3,876	2,702	+43%
32	4,263	3,501	+22%

Notes:

Metal is faster at small batch sizes (1-8)
CoreML takes the lead starting around batch 12-16
CoreML time/batch: 4.2ms (batch 1-16), 7.5ms (batch 17-32) - shows batch bucketing
Metal time/batch: 2.8ms (batch 1) to 9.1ms (batch 32) - gradual scaling

SE Network (744204.pb.gz) - 15 SE residual blocks, 192 filters

Batch	onnx-coreml NPS	metal NPS	CoreML Advantage
1	706	461	+53%
4	2,804	1,878	+49%
8	5,556	4,209	+32%
16	11,087	6,687	+66%
32	15,948	10,235	+56%

Notes:

CoreML is consistently faster across all batch sizes
Largest advantage at batch 16 (+66%)
CoreML time/batch: 1.4ms (batch 1-16), 2.0ms (batch 17-32)
Metal time/batch: 2.2ms (batch 1) to 3.1ms (batch 32)

BT3 Transformer (768x15x24h-swa-2790000.pb.gz) - 768 embedding, 15 blocks, 24 heads

Batch	onnx-coreml NPS	metal NPS	CoreML Advantage
1	49	190	-74%
4	195	478	-59%
8	388	629	-38%
16	776	768	+1%
32	905	809	+12%

Notes:

Metal dominates at small batch sizes
Crossover point around batch 16
CoreML has very high per-inference latency (~20ms batch 1-16, ~35ms batch 17-32)
Metal: 5.3ms (batch 1) to 39.5ms (batch 32)
CoreML uses mixed GPU + Neural Engine execution for transformers

BT4 Transformer (1024x15x32h-swa-6147500.pb.gz) - 1024 embedding, 15 blocks, 32 heads

Batch	onnx-coreml NPS	metal NPS	CoreML Advantage
1	19	155	-88%
4	75	335	-78%
8	150	405	-63%
16	307	480	-36%
32	203	487	-58%

Notes:

Metal is significantly faster at all batch sizes
CoreML performance actually degrades above batch 16 (203 nps at batch 32 vs 307 at batch 16)
CoreML latency: 52ms (batch 1-16), then jumps to 157ms (batch 17-32)
Metal latency: 6.4ms (batch 1) to 65.7ms (batch 32) - reasonable scaling
The large model appears to exceed CoreML's efficient execution threshold

Analysis

Where CoreML Excels

Small/medium convolutional networks (SE, Classical at higher batch sizes)
Batch 16+ for most network types
SE networks show the largest CoreML advantage (up to +66%)

Where Metal Excels

Large transformer models (BT3 at low batch, BT4 at all batch sizes)
Low-latency single-position evaluation (batch=1)
No compilation overhead - immediate startup

CoreML Batch Bucketing

CoreML appears to compile models for batch ranges rather than individual sizes:

Batch 1-16: One compiled model (~4ms for Classical, ~1.4ms for SE)
Batch 17-32: Another compiled model (higher latency per batch)
This creates a latency step at batch 17.

Hardware Utilization

CoreML profiling shows it distributes operations across:

Apple GPU (MLGPUComputeDevice) - matmul, layer_norm, conv
Neural Engine (MLNeuralEngineComputeDevice) - linear, transpose, reshape
CPU (MLCPUComputeDevice) - gather operations

For BT4, the mixed GPU/Neural Engine execution with frequent data transfers appears to create bottlenecks that pure Metal GPU execution avoids.

Recommendation

For SE and Classical networks: use CoreML (significant speedup, especially at batch 16+)
For BT4 transformers: use Metal (2-8x faster, no compilation overhead)
For BT3 transformers: use Metal for low-latency play; CoreML only marginally faster at batch 32
If startup time matters: use Metal (CoreML compilation can take 15-33 minutes for transformers)

Note: Generated by Claude Code

ChinChangYang · 2026-02-22T00:30:52Z

+    case OnnxProvider::COREML: {
+      std::unordered_map<std::string, std::string> provider_options;
+      provider_options["ModelFormat"] = "MLProgram";
+      provider_options["ProfileComputePlan"] = "1";


Suggested change

provider_options["ProfileComputePlan"] = "1";

provider_options["MLComputeUnits"] = compute_units_;

Onnx runtime fails optimizing performance by this. It would be better to let lc0 to benchmark MLComputeUnits. Then, add compute_units_ to the member field like this: https://github.com/borg323/lc0/pull/10/changes#diff-e18d2fce5100e036be2f7e918b15b27c2453f75cbc8d4143b9ad86bcd03317f7R212

onnx-coreml test version

00c0f66

borg323 requested a review from Copilot October 16, 2025 09:50

Copilot AI reviewed Oct 16, 2025

View reviewed changes

borg323 force-pushed the onnx-coreml branch 8 times, most recently from 091e474 to 0f71b12 Compare October 17, 2025 17:20

borg323 added 4 commits October 23, 2025 03:26

verbose output for tests

5ef2c6f

try AllowLowPrecisionAccumulationOnGPU

8b66ffe

use fixed batch size by default

ef094cb

coreml doesn't have selu so use an alternative

95c893b

borg323 force-pushed the onnx-coreml branch from 69d0ed2 to 95c893b Compare October 23, 2025 00:28

Merge branch 'master' into onnx-coreml

f3237a8

try alt mish in native float format

6a3858d

borg323 force-pushed the onnx-coreml branch from 678001e to 6a3858d Compare November 3, 2025 14:36

borg323 added 3 commits December 14, 2025 16:48

Merge branch 'master' into onnx-coreml

f0ab87a

cleanup

489d9ed

warn if using with old onnxruntime

aa917a1

borg323 mentioned this pull request Dec 17, 2025

coreml: Fix Gather fp16 support microsoft/onnxruntime#26442

Merged

borg323 added 2 commits December 24, 2025 18:08

Merge branch 'master' into onnx-coreml

3a38df5

fix thinko

d0a0084

borg323 marked this pull request as ready for review January 28, 2026 10:39

ChinChangYang mentioned this pull request Feb 9, 2026

Integration of CoreML Backend into Leela Chess Zero #1950

Closed

ChinChangYang reviewed Feb 22, 2026

View reviewed changes

mooskagh approved these changes Feb 23, 2026

View reviewed changes

borg323 merged commit 701bd83 into LeelaChessZero:master Mar 3, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

onnx-coreml backend#2319

onnx-coreml backend#2319
borg323 merged 12 commits into
LeelaChessZero:masterfrom
borg323:onnx-coreml

borg323 commented Oct 16, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

borg323 commented Nov 2, 2025

Uh oh!

john-sp commented Nov 20, 2025

Uh oh!

ChinChangYang commented Feb 7, 2026

Uh oh!

ChinChangYang Feb 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

	provider_options["ProfileComputePlan"] = "1";
	provider_options["MLComputeUnits"] = compute_units_;

Conversation

borg323 commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

borg323 commented Nov 2, 2025

Uh oh!

john-sp commented Nov 20, 2025

Uh oh!

ChinChangYang commented Feb 7, 2026

CoreML vs Metal Backend Performance Report

Summary

CoreML Model Compilation Times

Detailed Results

Classical Network (11248.pb.gz) - 20 residual blocks, 256 filters

SE Network (744204.pb.gz) - 15 SE residual blocks, 192 filters

BT3 Transformer (768x15x24h-swa-2790000.pb.gz) - 768 embedding, 15 blocks, 24 heads

BT4 Transformer (1024x15x32h-swa-6147500.pb.gz) - 1024 embedding, 15 blocks, 32 heads

Analysis

Where CoreML Excels

Where Metal Excels

CoreML Batch Bucketing

Hardware Utilization

Recommendation

Uh oh!

ChinChangYang Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

borg323 commented Oct 16, 2025 •

edited

Loading