Add CoreML backend with hybrid CPU+GPU+ANE inference for macOS #1148

ChinChangYang · 2026-01-05T23:53:04Z

Summary

This PR adds a new CoreML neural network backend for macOS that leverages Apple's full compute stack—CPU, GPU, and Apple Neural Engine (ANE) simultaneously—through a hybrid architecture.

Key Features

Hybrid Inference Architecture: Runs CoreML on CPU+ANE and MPSGraph on GPU in parallel, with adaptive batch splitting based on throughput tracking
Native Model Conversion: Uses the https://github.com/chinchangyang/katagocoreml-cpp C++ library for on-the-fly conversion from KataGo's .bin.gz format to CoreML .mlpackage (no Python dependency)
Configurable Precision: Supports FP16 (default, faster, uses Neural Engine) and FP32 (higher precision) via the useFP16Mode config option
Model Caching: Converted models are cached in ~/Documents/KataGo/CoreMLModels/ to avoid repeated conversion

Performance

b18c384nbt: Achieves ~577 nnEvals/s at 16 threads on Apple Silicon, compared to ~374 with CoreML-only inference—a 54% improvement from utilizing all compute units in parallel.

Files Changed

File	Description
cpp/neuralnet/coremlbackend.cpp	C++ interface: batch processing, model conversion, data marshaling
cpp/neuralnet/coremlbackend.h	Backend header and layer descriptors
cpp/neuralnet/coremlbackend.swift	CoreML model loading and hybrid inference orchestration
cpp/neuralnet/mpsgraphlayers.swift	MPSGraph layer implementations for GPU path
cpp/CMakeLists.txt	Build configuration for CoreML backend
Compiling.md	Build instructions

Build Requirements

macOS 13.0+
Xcode Command Line Tools with Swift 5.9+
Ninja build system: brew install ninja
katagocoreml library: brew tap chinchangyang/katagocoreml-cpp && brew install katagocoreml

  cd cpp
  cmake -G Ninja -DUSE_BACKEND=COREML -DBUILD_DISTRIBUTED=1
  ninja

Test Plan

Run ./katago runtests to verify backend integration
Run ./katago benchmark -model <network>.bin.gz to verify performance
Test with FP16 and FP32 precision modes
Verify model caching works correctly on repeated runs

Add dedicated mask buffer to fix incorrect mask offset calculation in batched inference. The Swift code assumed mask buffer stride of H*W per batch element, but was receiving spatial input buffer with stride of numInputChannels*H*W, causing batch elements > 0 to read garbage data. Changes: - Add userInputMaskBuffer to InputBuffers with correct stride - Copy first channel of spatial input (mask) to dedicated buffer - Pass mask buffer to Swift instead of reusing spatial buffer Batched winrate error: 19% → 0.037% (now matches single evaluation) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

This update enhances the CMake configuration to support the Core ML backend alongside existing options. Key changes include: - Updated project definition to include Core ML in backend options. - Added necessary checks for Swift compiler version and generator type. - Introduced a new library for Core ML and updated target properties. - Modified output messages to reflect the selected backend during runtime. This integration allows for improved compatibility and functionality when using Core ML for neural network evaluations.

This update adds an entry for the CoreML backend to the .gitignore file, ensuring that generated files related to the CoreML integration are not tracked by Git. This change helps maintain a cleaner repository by excluding unnecessary build artifacts.

Core ML may return non-contiguous MLMultiArray outputs after GPU computation, especially for spatial tensors. The previous code used direct dataPointer access with linear indexing, which read data from wrong memory locations when strides were non-contiguous. This fix adds stride-aware extraction that checks MLMultiArray.strides and handles both contiguous (fast path) and non-contiguous (recursive copy) cases. Also fixes hard-coded passChannels=2 to use numPolicyChannels. Before: Policy KL Div ~9.19, Ownership Error ~54c After: Policy KL Div ~0.003, Ownership Error ~0.02c 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

The CoreML model exports pass policy as "policy_pass" but the code was looking for "policy_pass_mul2", causing the pass policy buffer to remain at 0. This resulted in systematically inflated pass move probabilities after softmax (up to 14% error vs reference). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

The CoreML backend now respects the useFP16 config option, allowing users to choose between FP16 (default, faster, uses Neural Engine) and FP32 (higher precision). FP16 has ~0.87% max winrate error while FP32 achieves ~0.0006% by matching the Eigen reference. Cache keys include precision suffix to store FP16 and FP32 models separately. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Eliminate Python dependency for CoreML model conversion by using the native C++ katagocoreml library instead of calling Python subprocess. Changes: - CMakeLists.txt: Add pkg-config detection for katagocoreml library - coremlbackend.cpp: Add CoreMLConversion namespace with native converter wrapper, caching logic, and directory management functions - coremlbackend.swift: Remove CoreMLConverter and ModelCacheManager structs, simplify createCoreMLComputeHandle to only load pre-converted models The native converter uses katagocoreml::KataGoConverter::convert() and caches converted models with a "_native" suffix to distinguish from previously Python-converted models. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Implement a hybrid inference system that runs CoreML on CPU + Neural Engine and MPSGraph on GPU simultaneously, with adaptive batch sizing: - Add mpsgraphlayers.swift: Shared MPSGraph layer implementations - Add HybridComputeHandle: Dispatches work to both backends in parallel - Add ThroughputTracker: Adaptively adjusts batch split ratio using EMA - Parallelize CoreML batch processing with DispatchQueue.concurrentPerform - Optimize data copying with memcpy for inputs and outputs - Clean up CMakeLists.txt: Remove redundant SOURCES from _swift_generate_cxx_header Performance: Achieves 577 nnEvals/s at 16 threads (vs ~374 before), exceeding the 500 nnEvals/s target for CPU+GPU+ANE utilization. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

When requireExactNNLen is true (all mask values are 1), skip unnecessary mask operations in MPSGraph layers: - BatchNormLayer: Skip output * maskTensor multiplication - GlobalPoolingLayer: Skip mask-1 trick for max pooling - MaskSumLayer and derived layers: Use precomputed constants instead of computing from mask tensor The optimization is enabled by passing requireExactNNLen to MPSGraphModelHandle, which propagates it through the layer hierarchy. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Document the CoreML backend as an alternative to Metal for macOS, including Homebrew installation of the katagocoreml library dependency. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

ChinChangYang · 2026-01-06T10:43:28Z

Cross-Validation Test Report for CoreML Backend

Test Configuration

Model: b18c384nbt-uec-20221121b.bin.gz (18-block, 384-channel model)
Board Size: 19x19
Test Positions: 2,247 positions from built-in test dataset
Reference: Eigen backend (CPU, FP32)
Test Command: ./katago testgpuerror

Hardware

Apple M3 Max
macOS with CoreML + MPSGraph hybrid backend

Results

CoreML FP32 vs Eigen FP32 Reference

Metric	Average	90th %	99th %	Max	Threshold (99th/Max)
winrateError	0.00006%	0.00017%	0.00035%	0.00066%	0.45% / 1.35%
leadError	0.00002	0.00003	0.00011	0.00033	0.225 / 0.90
scoreMeanError	0.00002	0.00004	0.00012	0.00030	0.225 / 0.90
scoreStdevError	0.00001	0.00001	0.00004	0.00009	0.135 / 0.54
topPolicyDelta	0.00007%	0.00017%	0.00036%	0.00068%	0.45% / 1.35%
policyKLDiv	0.000000	0.000000	0.000000	0.000000	0.0006 / 0.0012
ownershipError	0.00003c	0.00006c	0.00018c	0.00174c	—

Status: PASS — CoreML FP32 matches Eigen FP32 with near-zero error.

CoreML FP16 vs Eigen FP32 Reference

Metric	Average	90th %	99th %	Max	Threshold (99th/Max)
winrateError	0.0997%	0.2553%	0.4857%	0.9147%	2.0% / 5.0%
leadError	0.0277	0.0615	0.1754	0.3852	1.00 / 3.00
scoreMeanError	0.0332	0.0718	0.1721	0.3739	1.00 / 3.00
scoreStdevError	0.0129	0.0274	0.0658	0.1646	0.60 / 1.80
topPolicyDelta	0.0944%	0.2124%	0.4547%	0.8814%	2.50% / 6.00%
policyKLDiv	0.000017	0.000035	0.000119	0.000604	0.0020 / 0.0040
ownershipError	0.0443c	0.1000c	0.2420c	4.3359c	—

Status: PASS — CoreML FP16 is well within acceptable error bounds for half-precision inference.

Summary

Configuration	Max Winrate Error	Max Policy KL Div	Result
CoreML FP32	0.00066%	0.000000	PASS
CoreML FP16	0.91%	0.000604	PASS

The CoreML backend passes all validation checks:

FP32 mode: Numerically equivalent to Eigen reference (errors < 0.001%)
FP16 mode: Max winrate error of 0.91% is well below the 5% threshold, consistent with expected half-precision behavior

Conclusion

The hybrid CoreML + MPSGraph backend produces numerically correct results across 2,247 test positions when compared against the Eigen CPU reference implementation. Both FP16 and FP32 precision modes meet KataGo's accuracy requirements for neural network inference.

KvanTTT · 2026-01-06T13:51:24Z

.github/workflows/build.yml

+  build-macos-coreml:
+    runs-on: macos-latest
+    permissions:
+      contents: read
+
+    steps:
+      - name: Checkout code
+        uses: actions/checkout@v4
+
+      - name: Install dependencies
+        run: |
+          brew install ninja zlib libzip
+          brew tap chinchangyang/katagocoreml-cpp
+          brew install katagocoreml
+
+      - name: Cache CMake build
+        uses: actions/cache@v4
+        with:
+          path: |
+            cpp/CMakeCache.txt
+            cpp/CMakeFiles
+            cpp/build.ninja
+            cpp/.ninja_deps
+            cpp/.ninja_log
+          key: ${{ runner.os }}-cmake-coreml-${{ hashFiles('**/CMakeLists.txt') }}
+          restore-keys: |
+            ${{ runner.os }}-cmake-coreml-
+
+      - name: Configure CMake
+        working-directory: cpp
+        run: |
+          cmake . -G Ninja -DUSE_BACKEND=COREML -DCMAKE_BUILD_TYPE=Release
+
+      - name: Build
+        working-directory: cpp
+        run: |
+          ninja
+
+      - name: Run tests
+        working-directory: cpp
+        run: |
+          ./katago runtests
+
+      - name: Upload artifact
+        if: github.event_name == 'push' && github.ref == 'refs/heads/master'
+        uses: actions/upload-artifact@v4
+        with:
+          name: katago-macos-coreml
+          path: cpp/katago
+


Is it possible to merge macOS build into a single configuration? As I see some build steps are overlapping.

Absolutely! It can be merged into a single configuration. Would you like to review it? ChinChangYang#9

ChinChangYang and others added 13 commits December 31, 2025 17:15

Add KataGoCoreML-swift.h to .gitignore

2c77de1

Remove unused maskSize variable in HybridComputeHandle

270651b

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Add CoreML backend CI job to GitHub Actions workflow

f952519

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

KvanTTT reviewed Jan 6, 2026

View reviewed changes

ChinChangYang force-pushed the coreml-backend branch from b940e99 to f952519 Compare January 13, 2026 09:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add CoreML backend with hybrid CPU+GPU+ANE inference for macOS #1148

Add CoreML backend with hybrid CPU+GPU+ANE inference for macOS #1148

Uh oh!

ChinChangYang commented Jan 5, 2026

Uh oh!

ChinChangYang commented Jan 6, 2026

Uh oh!

KvanTTT Jan 6, 2026

Uh oh!

ChinChangYang Jan 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add CoreML backend with hybrid CPU+GPU+ANE inference for macOS #1148

Are you sure you want to change the base?

Add CoreML backend with hybrid CPU+GPU+ANE inference for macOS #1148

Uh oh!

Conversation

ChinChangYang commented Jan 5, 2026

Uh oh!

ChinChangYang commented Jan 6, 2026

Cross-Validation Test Report for CoreML Backend

Test Configuration

Hardware

Results

CoreML FP32 vs Eigen FP32 Reference

CoreML FP16 vs Eigen FP32 Reference

Summary

Conclusion

Uh oh!

KvanTTT Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

ChinChangYang Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants