Give any Docker container full Apple Silicon Metal GPU access.
100+ GPU operations • LLM inference • Training • Image gen • Audio • Embeddings
Zero CUDA. Zero NVIDIA. Just Metal.
- One-Line Install
- Quick Start
- System Requirements
- Why
- Architecture
- CLI
- Model Presets
- GPU Compute (100+ ops)
- API Endpoints
- Performance Benchmarks
- Scaffold a GPU Project
- Examples
- Limitations
- Troubleshooting
- FAQ
- Contributing
curl -fsSL https://raw.githubusercontent.com/RobotFlow-Labs/docker_mlx_cpp/main/install.sh | bashThis installs everything: MLX, all GPU engines, the daemon, the Docker gateway. One command.
┌──────────────────────────────────────────────────────┐
│ ANY Docker Container │
│ (Python, Node, Rust, Go, curl — anything) │
│ Uses: OpenAI SDK, docker_mlx SDK, or raw HTTP │
└────────────────────────┬─────────────────────────────┘
│ HTTP :8080
┌────────────────────────▼─────────────────────────────┐
│ mlx-gateway (container) │
│ ├── /v1/* → inference (LLM, VLM) │
│ ├── /v1/embeddings → embedding generation │
│ ├── /v1/audio/* → Whisper STT + TTS │
│ ├── /v1/images/* → Stable Diffusion / FLUX │
│ ├── /train/* → LoRA / QLoRA fine-tuning │
│ └── /models/* → model management │
└────────────────────────┬─────────────────────────────┘
│ host.docker.internal:12435
┌────────────────────────▼─────────────────────────────┐
│ MLX Daemon (host-side, native macOS) │
│ ├── Inference: mlx-lm (50+ architectures) │
│ ├── Vision: mlx-vlm (images + video) │
│ ├── Training: LoRA, QLoRA, DPO │
│ ├── Image Gen: Stable Diffusion, SDXL, FLUX │
│ ├── Audio: Whisper STT + TTS │
│ ├── Embeddings: Jina v5, BGE, mlx-embeddings │
│ └── Models: pull, cache, convert, presets │
└────────────────────────┬─────────────────────────────┘
│ Metal API
┌────────────────────────▼─────────────────────────────┐
│ Apple Silicon M1/M2/M3/M4/M5 — Metal GPU │
│ Unified Memory • 20-30% faster than llama.cpp │
└──────────────────────────────────────────────────────┘
On Linux, nvidia-docker gives containers --gpus all and full CUDA access. On Mac, nothing equivalent exists — Metal GPU can't be passed into Docker's Linux VM.
docker_mlx_cpp solves this by running a host-side MLX daemon that exposes the full Apple Silicon GPU stack to any container through standard APIs. Your containers speak OpenAI API. Your Mac does the Metal compute.
| Capability | NVIDIA Container Toolkit | docker_mlx_cpp |
|---|---|---|
| GPU from containers | --gpus all (CUDA) |
http://mlx-gateway:8080 (Metal) |
| LLM inference | vLLM, TGI, Triton | mlx-lm (50+ architectures) |
| Training | PyTorch + NCCL | LoRA, QLoRA, DPO via MLX |
| Image generation | Stable Diffusion (CUDA) | SD, SDXL, FLUX (Metal) |
| Audio | Whisper (CUDA) | Whisper + TTS (Metal) |
| Embeddings | sentence-transformers | mlx-embeddings, Jina v5 |
| Model format | Framework-specific | MLX Safetensors (HuggingFace) |
| Setup | nvidia-container-toolkit | pip install docker-mlx-cpp |
# 1. Install
pip install -e ".[all]"
# 2. Start the MLX Daemon (host-side GPU service)
mlx-cpp serve
# 3. Start the gateway (in Docker)
docker compose up -d mlx-gateway
# 4. Pull a model
mlx-cpp models pull mlx-community/SmolLM2-360M-Instruct-4bit
# 5. Use from ANY container
docker run --rm --network mlx-network curlimages/curl:8.5.0 \
curl -s http://mlx-gateway:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"mlx-community/SmolLM2-360M-Instruct-4bit","messages":[{"role":"user","content":"Hello from Docker!"}]}'mlx-cpp serve # Start GPU daemon
mlx-cpp run <model> "prompt" # Quick inference
mlx-cpp models list # Show cached models
mlx-cpp models pull <model> # Pull from HuggingFace
mlx-cpp health # Check daemon + GPU status
mlx-cpp gpu # Show GPU info
mlx-cpp benchmark <model> # Performance benchmark
mlx-cpp train lora --model ... # LoRA fine-tuningUse human-readable presets instead of full model IDs:
mlx-cpp run chat-small "hello" # SmolLM2-360M (8GB Mac)
mlx-cpp run chat-default "hello" # Llama-3.2-3B (8GB+)
mlx-cpp run code "write a function" # Qwen2.5-Coder-7B (16GB+)
mlx-cpp run vision "describe image" # Qwen2-VL-7B (16GB+)See all presets: models/presets.yaml
services:
your-app:
image: your-app
environment:
- OPENAI_BASE_URL=http://mlx-gateway:8080/v1
- OPENAI_API_KEY=not-needed
- OPENAI_MODEL=mlx-community/Llama-3.2-3B-Instruct-4bit
networks:
- mlx-network
networks:
mlx-network:
external: trueZero code changes needed — any app using the OpenAI SDK works out of the box.
| Endpoint | Description |
|---|---|
POST /v1/chat/completions |
Chat inference (LLM/VLM) |
POST /v1/completions |
Text completions |
POST /v1/embeddings |
Text embeddings |
GET /v1/models |
List available models |
| Endpoint | Description |
|---|---|
POST /v1/audio/transcriptions |
Whisper speech-to-text |
POST /v1/audio/speech |
Text-to-speech |
| Endpoint | Description |
|---|---|
POST /v1/images/generations |
Stable Diffusion / FLUX |
| Endpoint | Description |
|---|---|
POST /train/lora |
Start LoRA/QLoRA fine-tuning |
GET /train/jobs |
List training jobs |
GET /train/jobs/{id} |
Get job status |
| Endpoint | Description |
|---|---|
POST /models/pull |
Pull model from HuggingFace |
POST /models/delete |
Remove cached model |
GET /health |
Gateway + daemon + GPU status |
GET /metrics |
Prometheus metrics |
docker_mlx_cpp/
├── daemon/ # Host-side MLX daemon
│ ├── mlx_daemon.py # FastAPI main app (port 12435)
│ ├── model_manager.py # Model pull/cache/convert
│ ├── engines/
│ │ ├── inference.py # LLM/VLM inference (mlx-lm)
│ │ ├── training.py # LoRA/QLoRA fine-tuning
│ │ ├── embeddings.py # Text embeddings
│ │ ├── audio.py # Whisper STT + TTS
│ │ └── image_gen.py # Stable Diffusion / FLUX
│ └── com.robotflow.mlx-daemon.plist # macOS auto-start
├── gateway/ # Docker container gateway
│ ├── Dockerfile
│ ├── server.py # Unified reverse proxy
│ └── requirements.txt
├── cli/
│ └── mlx_cpp.py # CLI tool (mlx-cpp)
├── models/
│ └── presets.yaml # Curated model presets
├── examples/
│ ├── python-client/ # Python OpenAI SDK example
│ └── curl-test.sh # Quick smoke test
├── scripts/
│ ├── setup.sh # First-time setup
│ └── benchmark.sh # Performance benchmark
├── sdk/ # Client SDKs (coming)
│ └── python/docker_mlx/
├── tests/
├── docker-compose.yml
├── pyproject.toml
└── README.md
Metal GPU cannot be passed into Docker containers on macOS (confirmed by Docker, Apple, and Red Hat). The VM boundary blocks it.
docker_mlx_cpp uses the same pattern as NVIDIA's container toolkit, adapted for Mac:
- MLX Daemon runs natively on macOS with direct Metal GPU access
- Gateway container routes HTTP requests from Docker network to the daemon
- Any container calls the gateway using standard OpenAI API — no special runtime needed
MLX is 20-30% faster than llama.cpp on Apple Silicon and supports the full ML stack: inference, training, image generation, audio, embeddings, and custom Metal kernels.
| Requirement | Minimum |
|---|---|
| macOS | 14.0+ (Sonoma) on Apple Silicon |
| Chip | M1 / M2 / M3 / M4 / M5 (any variant) |
| RAM | 8 GB (16 GB+ recommended for larger models) |
| Docker | Docker Desktop 4.62+ |
| Python | 3.11+ |
| Intel Mac | Not supported (no Metal GPU) |
Tested on Apple M5 (24 GB), MLX 0.31.1:
| Operation | Latency | Notes |
|---|---|---|
| Matmul 1024x1024 | ~95 TFLOPS | Raw GPU compute |
| Flash Attention (b=2, h=4, s=128) | 1.6 ms | scaled_dot_product_attention |
| Conv2d (3→32, 32x32) | 0.4 ms | Neural network layer |
| FFT2 128x128 | 0.5 ms | Signal processing |
| Sort 100K elements | 1.2 ms | GPU-accelerated sort |
| Softmax 1024x1024 | 1.8 ms | Activation function |
| LayerNorm (8, 64, 256) | 0.9 ms | Normalization |
| RMSNorm (8, 64, 256) | 0.4 ms | LLM normalization |
All 107 operations pass on Metal GPU. See tests/test_compute.py for the full suite.
Any container can run these operations on the Metal GPU:
# From any container on mlx-network:
curl -X POST http://mlx-gateway:8080/compute/eval \
-H "Content-Type: application/json" \
-d '{"op": "matmul", "args": {"a": {"shape": [1024, 1024]}, "b": {"shape": [1024, 1024]}}}'15 categories, 107 operations: Arithmetic (16) | Linear Algebra (12) | Reductions (12) | Transforms (13) | Activations (13) | Convolutions (2) | Pooling (4) | Attention (1) | Normalization (5) | Random (6) | FFT (6) | Sorting (4) | Comparison (10) | Metal Memory (4) | Benchmarks (3)
List all ops: GET /compute/ops
mlx-cpp docker init my-gpu-app
cd my-gpu-app
docker compose upCreates a ready-to-run Dockerfile + compose + Python app that uses Metal GPU from inside Docker.
| Example | Language | What it does |
|---|---|---|
examples/python-client/ |
Python | OpenAI SDK → Metal GPU inference |
examples/node-client/ |
Node.js | fetch() → Metal GPU matmul + chat |
examples/streaming/ |
Python | SSE token streaming from container |
examples/gpu-test/ |
Python | Tests ALL 107 GPU operations |
docker compose --profile examples up example-app # Python
docker compose --profile gpu-test up gpu-test # Full GPU test- No direct Metal inside containers. Metal GPU requires macOS host access. Containers call the daemon over HTTP. This adds ~1-5ms per call.
- SVD, QR, Cholesky, Inverse run on CPU (MLX linalg constraint). All other ops run on GPU.
hard_swishactivation not available in MLX 0.31.1.- Single GPU serialization. Concurrent requests from multiple containers are serialized (not parallel).
- No Intel Mac support. Requires Apple Silicon for Metal GPU.
mlx-cpp serve fails with "No module named mlx"
pip install mlx # Requires Apple Silicon MacGateway can't reach daemon: "502 MLX Daemon unreachable"
# Ensure daemon is running on host
mlx-cpp serve # Must run on macOS host, not in Docker"Connection refused" from container
# Ensure your container is on mlx-network
docker network ls | grep mlx
# Ensure gateway is healthy
curl http://localhost:8080/healthModel download hangs
# Check HuggingFace access
python -c "from huggingface_hub import HfApi; print(HfApi().whoami())"Out of memory with large models
# Use a smaller preset
mlx-cpp run chat-small "hello" # 360M params, fits in 8GB
# Or clear cache
curl -X POST http://localhost:12435/compute/clear-cacheQ: Can I import mlx directly inside a container?
No. MLX requires Metal (macOS). Containers run Linux. Use the HTTP API or the docker_mlx SDK instead.
Q: What's the overhead vs native MLX? ~1-5ms per HTTP round-trip. For inference (100ms+), negligible. For tight loops, batch operations.
Q: Can multiple containers share the GPU? Yes. Requests are serialized through the daemon. No crashes, but sequential not parallel.
Q: Does this work on Intel Mac? No. Metal GPU acceleration requires Apple Silicon (M1/M2/M3/M4/M5).
Q: Is this production-ready? It's v0.1.0 — great for development, prototyping, and local ML workflows. For production, add authentication to the gateway.
Q: How is this different from Docker Model Runner? Docker Model Runner is inference-only and proprietary. docker_mlx_cpp is open source and supports inference + training + image gen + audio + embeddings + raw GPU compute.
Q: What models are supported? Any model the MLX ecosystem supports: 50+ LLM architectures, VLMs, Whisper, Kokoro TTS, FLUX, and all 1000+ models on HuggingFace mlx-community.
See CONTRIBUTING.md for how to get started. We welcome:
- New GPU operations
- Language examples (Rust, Go, Java, etc.)
- Documentation improvements
- Bug reports and feature requests
MIT — See LICENSE
Built by RobotFlow Labs — Making Mac GPUs work like NVIDIA for Docker.
