Skip to content

Conversation

@ChinChangYang
Copy link
Contributor

Summary

This PR adds a new CoreML neural network backend for macOS that leverages Apple's full compute stack—CPU, GPU, and Apple Neural Engine (ANE) simultaneously—through a hybrid architecture.

Key Features

  • Hybrid Inference Architecture: Runs CoreML on CPU+ANE and MPSGraph on GPU in parallel, with adaptive batch splitting based on throughput tracking
  • Native Model Conversion: Uses the https://github.com/chinchangyang/katagocoreml-cpp C++ library for on-the-fly conversion from KataGo's .bin.gz format to CoreML .mlpackage (no Python dependency)
  • Configurable Precision: Supports FP16 (default, faster, uses Neural Engine) and FP32 (higher precision) via the useFP16Mode config option
  • Model Caching: Converted models are cached in ~/Documents/KataGo/CoreMLModels/ to avoid repeated conversion

Performance

b18c384nbt: Achieves ~577 nnEvals/s at 16 threads on Apple Silicon, compared to ~374 with CoreML-only inference—a 54% improvement from utilizing all compute units in parallel.

Files Changed

File Description
cpp/neuralnet/coremlbackend.cpp C++ interface: batch processing, model conversion, data marshaling
cpp/neuralnet/coremlbackend.h Backend header and layer descriptors
cpp/neuralnet/coremlbackend.swift CoreML model loading and hybrid inference orchestration
cpp/neuralnet/mpsgraphlayers.swift MPSGraph layer implementations for GPU path
cpp/CMakeLists.txt Build configuration for CoreML backend
Compiling.md Build instructions

Build Requirements

  • macOS 13.0+
  • Xcode Command Line Tools with Swift 5.9+
  • Ninja build system: brew install ninja
  • katagocoreml library: brew tap chinchangyang/katagocoreml-cpp && brew install katagocoreml
  cd cpp
  cmake -G Ninja -DUSE_BACKEND=COREML -DBUILD_DISTRIBUTED=1
  ninja

Test Plan

  • Run ./katago runtests to verify backend integration
  • Run ./katago benchmark -model <network>.bin.gz to verify performance
  • Test with FP16 and FP32 precision modes
  • Verify model caching works correctly on repeated runs

ChinChangYang and others added 13 commits December 31, 2025 17:15
Add dedicated mask buffer to fix incorrect mask offset calculation in
batched inference. The Swift code assumed mask buffer stride of H*W per
batch element, but was receiving spatial input buffer with stride of
numInputChannels*H*W, causing batch elements > 0 to read garbage data.

Changes:
- Add userInputMaskBuffer to InputBuffers with correct stride
- Copy first channel of spatial input (mask) to dedicated buffer
- Pass mask buffer to Swift instead of reusing spatial buffer

Batched winrate error: 19% → 0.037% (now matches single evaluation)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This update enhances the CMake configuration to support the Core ML backend alongside existing options. Key changes include:
- Updated project definition to include Core ML in backend options.
- Added necessary checks for Swift compiler version and generator type.
- Introduced a new library for Core ML and updated target properties.
- Modified output messages to reflect the selected backend during runtime.

This integration allows for improved compatibility and functionality when using Core ML for neural network evaluations.
This update adds an entry for the CoreML backend to the .gitignore file, ensuring that generated files related to the CoreML integration are not tracked by Git. This change helps maintain a cleaner repository by excluding unnecessary build artifacts.
Core ML may return non-contiguous MLMultiArray outputs after GPU computation,
especially for spatial tensors. The previous code used direct dataPointer access
with linear indexing, which read data from wrong memory locations when strides
were non-contiguous.

This fix adds stride-aware extraction that checks MLMultiArray.strides and
handles both contiguous (fast path) and non-contiguous (recursive copy) cases.
Also fixes hard-coded passChannels=2 to use numPolicyChannels.

Before: Policy KL Div ~9.19, Ownership Error ~54c
After:  Policy KL Div ~0.003, Ownership Error ~0.02c

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The CoreML model exports pass policy as "policy_pass" but the code was
looking for "policy_pass_mul2", causing the pass policy buffer to remain
at 0. This resulted in systematically inflated pass move probabilities
after softmax (up to 14% error vs reference).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The CoreML backend now respects the useFP16 config option, allowing users
to choose between FP16 (default, faster, uses Neural Engine) and FP32
(higher precision). FP16 has ~0.87% max winrate error while FP32 achieves
~0.0006% by matching the Eigen reference. Cache keys include precision
suffix to store FP16 and FP32 models separately.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Eliminate Python dependency for CoreML model conversion by using the
native C++ katagocoreml library instead of calling Python subprocess.

Changes:
- CMakeLists.txt: Add pkg-config detection for katagocoreml library
- coremlbackend.cpp: Add CoreMLConversion namespace with native converter
  wrapper, caching logic, and directory management functions
- coremlbackend.swift: Remove CoreMLConverter and ModelCacheManager
  structs, simplify createCoreMLComputeHandle to only load pre-converted
  models

The native converter uses katagocoreml::KataGoConverter::convert() and
caches converted models with a "_native" suffix to distinguish from
previously Python-converted models.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Implement a hybrid inference system that runs CoreML on CPU + Neural Engine
and MPSGraph on GPU simultaneously, with adaptive batch sizing:

- Add mpsgraphlayers.swift: Shared MPSGraph layer implementations
- Add HybridComputeHandle: Dispatches work to both backends in parallel
- Add ThroughputTracker: Adaptively adjusts batch split ratio using EMA
- Parallelize CoreML batch processing with DispatchQueue.concurrentPerform
- Optimize data copying with memcpy for inputs and outputs
- Clean up CMakeLists.txt: Remove redundant SOURCES from _swift_generate_cxx_header

Performance: Achieves 577 nnEvals/s at 16 threads (vs ~374 before),
exceeding the 500 nnEvals/s target for CPU+GPU+ANE utilization.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When requireExactNNLen is true (all mask values are 1), skip unnecessary
mask operations in MPSGraph layers:

- BatchNormLayer: Skip output * maskTensor multiplication
- GlobalPoolingLayer: Skip mask-1 trick for max pooling
- MaskSumLayer and derived layers: Use precomputed constants instead of
  computing from mask tensor

The optimization is enabled by passing requireExactNNLen to
MPSGraphModelHandle, which propagates it through the layer hierarchy.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Document the CoreML backend as an alternative to Metal for macOS, including
Homebrew installation of the katagocoreml library dependency.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@ChinChangYang
Copy link
Contributor Author

Cross-Validation Test Report for CoreML Backend

Test Configuration

  • Model: b18c384nbt-uec-20221121b.bin.gz (18-block, 384-channel model)
  • Board Size: 19x19
  • Test Positions: 2,247 positions from built-in test dataset
  • Reference: Eigen backend (CPU, FP32)
  • Test Command: ./katago testgpuerror

Hardware

  • Apple M3 Max
  • macOS with CoreML + MPSGraph hybrid backend

Results

CoreML FP32 vs Eigen FP32 Reference

Metric Average 90th % 99th % Max Threshold (99th/Max)
winrateError 0.00006% 0.00017% 0.00035% 0.00066% 0.45% / 1.35%
leadError 0.00002 0.00003 0.00011 0.00033 0.225 / 0.90
scoreMeanError 0.00002 0.00004 0.00012 0.00030 0.225 / 0.90
scoreStdevError 0.00001 0.00001 0.00004 0.00009 0.135 / 0.54
topPolicyDelta 0.00007% 0.00017% 0.00036% 0.00068% 0.45% / 1.35%
policyKLDiv 0.000000 0.000000 0.000000 0.000000 0.0006 / 0.0012
ownershipError 0.00003c 0.00006c 0.00018c 0.00174c

Status: PASS — CoreML FP32 matches Eigen FP32 with near-zero error.

CoreML FP16 vs Eigen FP32 Reference

Metric Average 90th % 99th % Max Threshold (99th/Max)
winrateError 0.0997% 0.2553% 0.4857% 0.9147% 2.0% / 5.0%
leadError 0.0277 0.0615 0.1754 0.3852 1.00 / 3.00
scoreMeanError 0.0332 0.0718 0.1721 0.3739 1.00 / 3.00
scoreStdevError 0.0129 0.0274 0.0658 0.1646 0.60 / 1.80
topPolicyDelta 0.0944% 0.2124% 0.4547% 0.8814% 2.50% / 6.00%
policyKLDiv 0.000017 0.000035 0.000119 0.000604 0.0020 / 0.0040
ownershipError 0.0443c 0.1000c 0.2420c 4.3359c

Status: PASS — CoreML FP16 is well within acceptable error bounds for half-precision inference.

Summary

Configuration Max Winrate Error Max Policy KL Div Result
CoreML FP32 0.00066% 0.000000 PASS
CoreML FP16 0.91% 0.000604 PASS

The CoreML backend passes all validation checks:

  • FP32 mode: Numerically equivalent to Eigen reference (errors < 0.001%)
  • FP16 mode: Max winrate error of 0.91% is well below the 5% threshold, consistent with expected half-precision behavior

Conclusion

The hybrid CoreML + MPSGraph backend produces numerically correct results across 2,247 test positions when compared against the Eigen CPU reference implementation. Both FP16 and FP32 precision modes meet KataGo's accuracy requirements for neural network inference.

Comment on lines +100 to +149
build-macos-coreml:
runs-on: macos-latest
permissions:
contents: read

steps:
- name: Checkout code
uses: actions/checkout@v4

- name: Install dependencies
run: |
brew install ninja zlib libzip
brew tap chinchangyang/katagocoreml-cpp
brew install katagocoreml
- name: Cache CMake build
uses: actions/cache@v4
with:
path: |
cpp/CMakeCache.txt
cpp/CMakeFiles
cpp/build.ninja
cpp/.ninja_deps
cpp/.ninja_log
key: ${{ runner.os }}-cmake-coreml-${{ hashFiles('**/CMakeLists.txt') }}
restore-keys: |
${{ runner.os }}-cmake-coreml-
- name: Configure CMake
working-directory: cpp
run: |
cmake . -G Ninja -DUSE_BACKEND=COREML -DCMAKE_BUILD_TYPE=Release
- name: Build
working-directory: cpp
run: |
ninja
- name: Run tests
working-directory: cpp
run: |
./katago runtests
- name: Upload artifact
if: github.event_name == 'push' && github.ref == 'refs/heads/master'
uses: actions/upload-artifact@v4
with:
name: katago-macos-coreml
path: cpp/katago

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to merge macOS build into a single configuration? As I see some build steps are overlapping.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely! It can be merged into a single configuration. Would you like to review it? ChinChangYang#9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants