-
Notifications
You must be signed in to change notification settings - Fork 659
Add CoreML backend with hybrid CPU+GPU+ANE inference for macOS #1148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Add dedicated mask buffer to fix incorrect mask offset calculation in batched inference. The Swift code assumed mask buffer stride of H*W per batch element, but was receiving spatial input buffer with stride of numInputChannels*H*W, causing batch elements > 0 to read garbage data. Changes: - Add userInputMaskBuffer to InputBuffers with correct stride - Copy first channel of spatial input (mask) to dedicated buffer - Pass mask buffer to Swift instead of reusing spatial buffer Batched winrate error: 19% → 0.037% (now matches single evaluation) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This update enhances the CMake configuration to support the Core ML backend alongside existing options. Key changes include: - Updated project definition to include Core ML in backend options. - Added necessary checks for Swift compiler version and generator type. - Introduced a new library for Core ML and updated target properties. - Modified output messages to reflect the selected backend during runtime. This integration allows for improved compatibility and functionality when using Core ML for neural network evaluations.
This update adds an entry for the CoreML backend to the .gitignore file, ensuring that generated files related to the CoreML integration are not tracked by Git. This change helps maintain a cleaner repository by excluding unnecessary build artifacts.
Core ML may return non-contiguous MLMultiArray outputs after GPU computation, especially for spatial tensors. The previous code used direct dataPointer access with linear indexing, which read data from wrong memory locations when strides were non-contiguous. This fix adds stride-aware extraction that checks MLMultiArray.strides and handles both contiguous (fast path) and non-contiguous (recursive copy) cases. Also fixes hard-coded passChannels=2 to use numPolicyChannels. Before: Policy KL Div ~9.19, Ownership Error ~54c After: Policy KL Div ~0.003, Ownership Error ~0.02c 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The CoreML model exports pass policy as "policy_pass" but the code was looking for "policy_pass_mul2", causing the pass policy buffer to remain at 0. This resulted in systematically inflated pass move probabilities after softmax (up to 14% error vs reference). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The CoreML backend now respects the useFP16 config option, allowing users to choose between FP16 (default, faster, uses Neural Engine) and FP32 (higher precision). FP16 has ~0.87% max winrate error while FP32 achieves ~0.0006% by matching the Eigen reference. Cache keys include precision suffix to store FP16 and FP32 models separately. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Eliminate Python dependency for CoreML model conversion by using the native C++ katagocoreml library instead of calling Python subprocess. Changes: - CMakeLists.txt: Add pkg-config detection for katagocoreml library - coremlbackend.cpp: Add CoreMLConversion namespace with native converter wrapper, caching logic, and directory management functions - coremlbackend.swift: Remove CoreMLConverter and ModelCacheManager structs, simplify createCoreMLComputeHandle to only load pre-converted models The native converter uses katagocoreml::KataGoConverter::convert() and caches converted models with a "_native" suffix to distinguish from previously Python-converted models. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Implement a hybrid inference system that runs CoreML on CPU + Neural Engine and MPSGraph on GPU simultaneously, with adaptive batch sizing: - Add mpsgraphlayers.swift: Shared MPSGraph layer implementations - Add HybridComputeHandle: Dispatches work to both backends in parallel - Add ThroughputTracker: Adaptively adjusts batch split ratio using EMA - Parallelize CoreML batch processing with DispatchQueue.concurrentPerform - Optimize data copying with memcpy for inputs and outputs - Clean up CMakeLists.txt: Remove redundant SOURCES from _swift_generate_cxx_header Performance: Achieves 577 nnEvals/s at 16 threads (vs ~374 before), exceeding the 500 nnEvals/s target for CPU+GPU+ANE utilization. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When requireExactNNLen is true (all mask values are 1), skip unnecessary mask operations in MPSGraph layers: - BatchNormLayer: Skip output * maskTensor multiplication - GlobalPoolingLayer: Skip mask-1 trick for max pooling - MaskSumLayer and derived layers: Use precomputed constants instead of computing from mask tensor The optimization is enabled by passing requireExactNNLen to MPSGraphModelHandle, which propagates it through the layer hierarchy. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Document the CoreML backend as an alternative to Metal for macOS, including Homebrew installation of the katagocoreml library dependency. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Cross-Validation Test Report for CoreML BackendTest Configuration
Hardware
ResultsCoreML FP32 vs Eigen FP32 Reference
Status: PASS — CoreML FP32 matches Eigen FP32 with near-zero error. CoreML FP16 vs Eigen FP32 Reference
Status: PASS — CoreML FP16 is well within acceptable error bounds for half-precision inference. Summary
The CoreML backend passes all validation checks:
ConclusionThe hybrid CoreML + MPSGraph backend produces numerically correct results across 2,247 test positions when compared against the Eigen CPU reference implementation. Both FP16 and FP32 precision modes meet KataGo's accuracy requirements for neural network inference. |
| build-macos-coreml: | ||
| runs-on: macos-latest | ||
| permissions: | ||
| contents: read | ||
|
|
||
| steps: | ||
| - name: Checkout code | ||
| uses: actions/checkout@v4 | ||
|
|
||
| - name: Install dependencies | ||
| run: | | ||
| brew install ninja zlib libzip | ||
| brew tap chinchangyang/katagocoreml-cpp | ||
| brew install katagocoreml | ||
| - name: Cache CMake build | ||
| uses: actions/cache@v4 | ||
| with: | ||
| path: | | ||
| cpp/CMakeCache.txt | ||
| cpp/CMakeFiles | ||
| cpp/build.ninja | ||
| cpp/.ninja_deps | ||
| cpp/.ninja_log | ||
| key: ${{ runner.os }}-cmake-coreml-${{ hashFiles('**/CMakeLists.txt') }} | ||
| restore-keys: | | ||
| ${{ runner.os }}-cmake-coreml- | ||
| - name: Configure CMake | ||
| working-directory: cpp | ||
| run: | | ||
| cmake . -G Ninja -DUSE_BACKEND=COREML -DCMAKE_BUILD_TYPE=Release | ||
| - name: Build | ||
| working-directory: cpp | ||
| run: | | ||
| ninja | ||
| - name: Run tests | ||
| working-directory: cpp | ||
| run: | | ||
| ./katago runtests | ||
| - name: Upload artifact | ||
| if: github.event_name == 'push' && github.ref == 'refs/heads/master' | ||
| uses: actions/upload-artifact@v4 | ||
| with: | ||
| name: katago-macos-coreml | ||
| path: cpp/katago | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to merge macOS build into a single configuration? As I see some build steps are overlapping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely! It can be merged into a single configuration. Would you like to review it? ChinChangYang#9
b940e99 to
f952519
Compare
Summary
This PR adds a new CoreML neural network backend for macOS that leverages Apple's full compute stack—CPU, GPU, and Apple Neural Engine (ANE) simultaneously—through a hybrid architecture.
Key Features
useFP16Modeconfig option~/Documents/KataGo/CoreMLModels/to avoid repeated conversionPerformance
b18c384nbt: Achieves ~577 nnEvals/s at 16 threads on Apple Silicon, compared to ~374 with CoreML-only inference—a 54% improvement from utilizing all compute units in parallel.
Files Changed
Build Requirements
brew install ninjabrew tap chinchangyang/katagocoreml-cpp && brew install katagocoremlTest Plan
./katago runteststo verify backend integration./katago benchmark -model <network>.bin.gzto verify performance