docker_mlx_cpp

The NVIDIA Container Toolkit — for Mac

Give any Docker container full Apple Silicon Metal GPU access.
100+ GPU operations • LLM inference • Training • Image gen • Audio • Embeddings
Zero CUDA. Zero NVIDIA. Just Metal.

One-Line Install

curl -fsSL https://raw.githubusercontent.com/RobotFlow-Labs/docker_mlx_cpp/main/install.sh | bash

This installs everything: MLX, all GPU engines, the daemon, the Docker gateway. One command.

┌──────────────────────────────────────────────────────┐
│  ANY Docker Container                                 │
│  (Python, Node, Rust, Go, curl — anything)           │
│  Uses: OpenAI SDK, docker_mlx SDK, or raw HTTP       │
└────────────────────────┬─────────────────────────────┘
                         │ HTTP :8080
┌────────────────────────▼─────────────────────────────┐
│  mlx-gateway (container)                              │
│  ├── /v1/*           → inference (LLM, VLM)          │
│  ├── /v1/embeddings  → embedding generation          │
│  ├── /v1/audio/*     → Whisper STT + TTS             │
│  ├── /v1/images/*    → Stable Diffusion / FLUX       │
│  ├── /train/*        → LoRA / QLoRA fine-tuning      │
│  └── /models/*       → model management              │
└────────────────────────┬─────────────────────────────┘
                         │ host.docker.internal:12435
┌────────────────────────▼─────────────────────────────┐
│  MLX Daemon (host-side, native macOS)                 │
│  ├── Inference:   mlx-lm (50+ architectures)         │
│  ├── Vision:      mlx-vlm (images + video)           │
│  ├── Training:    LoRA, QLoRA, DPO                   │
│  ├── Image Gen:   Stable Diffusion, SDXL, FLUX       │
│  ├── Audio:       Whisper STT + TTS                  │
│  ├── Embeddings:  Jina v5, BGE, mlx-embeddings       │
│  └── Models:      pull, cache, convert, presets      │
└────────────────────────┬─────────────────────────────┘
                         │ Metal API
┌────────────────────────▼─────────────────────────────┐
│  Apple Silicon M1/M2/M3/M4/M5 — Metal GPU           │
│  Unified Memory • 20-30% faster than llama.cpp       │
└──────────────────────────────────────────────────────┘

Why

On Linux, nvidia-docker gives containers --gpus all and full CUDA access. On Mac, nothing equivalent exists — Metal GPU can't be passed into Docker's Linux VM.

docker_mlx_cpp solves this by running a host-side MLX daemon that exposes the full Apple Silicon GPU stack to any container through standard APIs. Your containers speak OpenAI API. Your Mac does the Metal compute.

Capability	NVIDIA Container Toolkit	docker_mlx_cpp
GPU from containers	`--gpus all` (CUDA)	`http://mlx-gateway:8080` (Metal)
LLM inference	vLLM, TGI, Triton	mlx-lm (50+ architectures)
Training	PyTorch + NCCL	LoRA, QLoRA, DPO via MLX
Image generation	Stable Diffusion (CUDA)	SD, SDXL, FLUX (Metal)
Audio	Whisper (CUDA)	Whisper + TTS (Metal)
Embeddings	sentence-transformers	mlx-embeddings, Jina v5
Model format	Framework-specific	MLX Safetensors (HuggingFace)
Setup	nvidia-container-toolkit	`pip install docker-mlx-cpp`

Quick Start

# 1. Install
pip install -e ".[all]"

# 2. Start the MLX Daemon (host-side GPU service)
mlx-cpp serve

# 3. Start the gateway (in Docker)
docker compose up -d mlx-gateway

# 4. Pull a model
mlx-cpp models pull mlx-community/SmolLM2-360M-Instruct-4bit

# 5. Use from ANY container
docker run --rm --network mlx-network curlimages/curl:8.5.0 \
  curl -s http://mlx-gateway:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"mlx-community/SmolLM2-360M-Instruct-4bit","messages":[{"role":"user","content":"Hello from Docker!"}]}'

CLI

mlx-cpp serve                    # Start GPU daemon
mlx-cpp run <model> "prompt"     # Quick inference
mlx-cpp models list              # Show cached models
mlx-cpp models pull <model>      # Pull from HuggingFace
mlx-cpp health                   # Check daemon + GPU status
mlx-cpp gpu                      # Show GPU info
mlx-cpp benchmark <model>        # Performance benchmark
mlx-cpp train lora --model ...   # LoRA fine-tuning

Model Presets

Use human-readable presets instead of full model IDs:

mlx-cpp run chat-small "hello"       # SmolLM2-360M (8GB Mac)
mlx-cpp run chat-default "hello"     # Llama-3.2-3B (8GB+)
mlx-cpp run code "write a function"  # Qwen2.5-Coder-7B (16GB+)
mlx-cpp run vision "describe image"  # Qwen2-VL-7B (16GB+)

See all presets: models/presets.yaml

Use in Your docker-compose.yml

services:
  your-app:
    image: your-app
    environment:
      - OPENAI_BASE_URL=http://mlx-gateway:8080/v1
      - OPENAI_API_KEY=not-needed
      - OPENAI_MODEL=mlx-community/Llama-3.2-3B-Instruct-4bit
    networks:
      - mlx-network

networks:
  mlx-network:
    external: true

Zero code changes needed — any app using the OpenAI SDK works out of the box.

API Endpoints

Inference (OpenAI-compatible)

Endpoint	Description
`POST /v1/chat/completions`	Chat inference (LLM/VLM)
`POST /v1/completions`	Text completions
`POST /v1/embeddings`	Text embeddings
`GET /v1/models`	List available models

Audio (OpenAI-compatible)

Endpoint	Description
`POST /v1/audio/transcriptions`	Whisper speech-to-text
`POST /v1/audio/speech`	Text-to-speech

Image Generation (OpenAI-compatible)

Endpoint	Description
`POST /v1/images/generations`	Stable Diffusion / FLUX

Training (custom)

Endpoint	Description
`POST /train/lora`	Start LoRA/QLoRA fine-tuning
`GET /train/jobs`	List training jobs
`GET /train/jobs/{id}`	Get job status

Management

Endpoint	Description
`POST /models/pull`	Pull model from HuggingFace
`POST /models/delete`	Remove cached model
`GET /health`	Gateway + daemon + GPU status
`GET /metrics`	Prometheus metrics

Project Structure

docker_mlx_cpp/
├── daemon/                      # Host-side MLX daemon
│   ├── mlx_daemon.py           # FastAPI main app (port 12435)
│   ├── model_manager.py        # Model pull/cache/convert
│   ├── engines/
│   │   ├── inference.py        # LLM/VLM inference (mlx-lm)
│   │   ├── training.py         # LoRA/QLoRA fine-tuning
│   │   ├── embeddings.py       # Text embeddings
│   │   ├── audio.py            # Whisper STT + TTS
│   │   └── image_gen.py        # Stable Diffusion / FLUX
│   └── com.robotflow.mlx-daemon.plist  # macOS auto-start
├── gateway/                     # Docker container gateway
│   ├── Dockerfile
│   ├── server.py               # Unified reverse proxy
│   └── requirements.txt
├── cli/
│   └── mlx_cpp.py              # CLI tool (mlx-cpp)
├── models/
│   └── presets.yaml             # Curated model presets
├── examples/
│   ├── python-client/           # Python OpenAI SDK example
│   └── curl-test.sh            # Quick smoke test
├── scripts/
│   ├── setup.sh                # First-time setup
│   └── benchmark.sh            # Performance benchmark
├── sdk/                         # Client SDKs (coming)
│   └── python/docker_mlx/
├── tests/
├── docker-compose.yml
├── pyproject.toml
└── README.md

How It Works

Metal GPU cannot be passed into Docker containers on macOS (confirmed by Docker, Apple, and Red Hat). The VM boundary blocks it.

docker_mlx_cpp uses the same pattern as NVIDIA's container toolkit, adapted for Mac:

MLX Daemon runs natively on macOS with direct Metal GPU access
Gateway container routes HTTP requests from Docker network to the daemon
Any container calls the gateway using standard OpenAI API — no special runtime needed

MLX is 20-30% faster than llama.cpp on Apple Silicon and supports the full ML stack: inference, training, image generation, audio, embeddings, and custom Metal kernels.

System Requirements

Requirement	Minimum
macOS	14.0+ (Sonoma) on Apple Silicon
Chip	M1 / M2 / M3 / M4 / M5 (any variant)
RAM	8 GB (16 GB+ recommended for larger models)
Docker	Docker Desktop 4.62+
Python	3.11+
Intel Mac	Not supported (no Metal GPU)

Performance Benchmarks

Tested on Apple M5 (24 GB), MLX 0.31.1:

Operation	Latency	Notes
Matmul 1024x1024	~95 TFLOPS	Raw GPU compute
Flash Attention (b=2, h=4, s=128)	1.6 ms	`scaled_dot_product_attention`
Conv2d (3→32, 32x32)	0.4 ms	Neural network layer
FFT2 128x128	0.5 ms	Signal processing
Sort 100K elements	1.2 ms	GPU-accelerated sort
Softmax 1024x1024	1.8 ms	Activation function
LayerNorm (8, 64, 256)	0.9 ms	Normalization
RMSNorm (8, 64, 256)	0.4 ms	LLM normalization

All 107 operations pass on Metal GPU. See tests/test_compute.py for the full suite.

GPU Compute (100+ ops)

Any container can run these operations on the Metal GPU:

# From any container on mlx-network:
curl -X POST http://mlx-gateway:8080/compute/eval \
  -H "Content-Type: application/json" \
  -d '{"op": "matmul", "args": {"a": {"shape": [1024, 1024]}, "b": {"shape": [1024, 1024]}}}'

List all ops: GET /compute/ops

Scaffold a GPU Project

mlx-cpp docker init my-gpu-app
cd my-gpu-app
docker compose up

Creates a ready-to-run Dockerfile + compose + Python app that uses Metal GPU from inside Docker.

Examples

Example	Language	What it does
`examples/python-client/`	Python	OpenAI SDK → Metal GPU inference
`examples/node-client/`	Node.js	fetch() → Metal GPU matmul + chat
`examples/streaming/`	Python	SSE token streaming from container
`examples/gpu-test/`	Python	Tests ALL 107 GPU operations

docker compose --profile examples up example-app    # Python
docker compose --profile gpu-test up gpu-test        # Full GPU test

Limitations

No direct Metal inside containers. Metal GPU requires macOS host access. Containers call the daemon over HTTP. This adds ~1-5ms per call.
SVD, QR, Cholesky, Inverse run on CPU (MLX linalg constraint). All other ops run on GPU.
hard_swish activation not available in MLX 0.31.1.
Single GPU serialization. Concurrent requests from multiple containers are serialized (not parallel).
No Intel Mac support. Requires Apple Silicon for Metal GPU.

Troubleshooting

mlx-cpp serve fails with "No module named mlx"

pip install mlx  # Requires Apple Silicon Mac

Gateway can't reach daemon: "502 MLX Daemon unreachable"

# Ensure daemon is running on host
mlx-cpp serve  # Must run on macOS host, not in Docker

"Connection refused" from container

# Ensure your container is on mlx-network
docker network ls | grep mlx
# Ensure gateway is healthy
curl http://localhost:8080/health

Model download hangs

# Check HuggingFace access
python -c "from huggingface_hub import HfApi; print(HfApi().whoami())"

Out of memory with large models

# Use a smaller preset
mlx-cpp run chat-small "hello"  # 360M params, fits in 8GB
# Or clear cache
curl -X POST http://localhost:12435/compute/clear-cache

FAQ

Q: Can I import mlx directly inside a container? No. MLX requires Metal (macOS). Containers run Linux. Use the HTTP API or the docker_mlx SDK instead.

Q: What's the overhead vs native MLX? ~1-5ms per HTTP round-trip. For inference (100ms+), negligible. For tight loops, batch operations.

Q: Can multiple containers share the GPU? Yes. Requests are serialized through the daemon. No crashes, but sequential not parallel.

Q: Does this work on Intel Mac? No. Metal GPU acceleration requires Apple Silicon (M1/M2/M3/M4/M5).

Q: Is this production-ready? It's v0.1.0 — great for development, prototyping, and local ML workflows. For production, add authentication to the gateway.

Q: How is this different from Docker Model Runner? Docker Model Runner is inference-only and proprietary. docker_mlx_cpp is open source and supports inference + training + image gen + audio + embeddings + raw GPU compute.

Q: What models are supported? Any model the MLX ecosystem supports: 50+ LLM architectures, VLMs, Whisper, Kokoro TTS, FLUX, and all 1000+ models on HuggingFace mlx-community.

Contributing

See CONTRIBUTING.md for how to get started. We welcome:

New GPU operations
Language examples (Rust, Go, Java, etc.)
Documentation improvements
Bug reports and feature requests

License

MIT — See LICENSE

Built by RobotFlow Labs — Making Mac GPUs work like NVIDIA for Docker.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.claude		.claude
.github		.github
assets		assets
cli		cli
daemon		daemon
docs		docs
examples		examples
gateway		gateway
images/mlx-base		images/mlx-base
models		models
scripts		scripts
sdk/python/docker_mlx		sdk/python/docker_mlx
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
NEXT_STEPS.md		NEXT_STEPS.md
README.md		README.md
SECURITY.md		SECURITY.md
docker-compose.yml		docker-compose.yml
docker_mlx.md		docker_mlx.md
install.sh		install.sh
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

docker_mlx_cpp

The NVIDIA Container Toolkit — for Mac

Table of Contents

One-Line Install

Why

Quick Start

CLI

Model Presets

Use in Your docker-compose.yml

API Endpoints

Inference (OpenAI-compatible)

Audio (OpenAI-compatible)

Image Generation (OpenAI-compatible)

Training (custom)

Management

Project Structure

How It Works

System Requirements

Performance Benchmarks

GPU Compute (100+ ops)

Scaffold a GPU Project

Examples

Limitations

Troubleshooting

FAQ

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

docker_mlx_cpp

The NVIDIA Container Toolkit — for Mac

Table of Contents

One-Line Install

Why

Quick Start

CLI

Model Presets

Use in Your docker-compose.yml

API Endpoints

Inference (OpenAI-compatible)

Audio (OpenAI-compatible)

Image Generation (OpenAI-compatible)

Training (custom)

Management

Project Structure

How It Works

System Requirements

Performance Benchmarks

GPU Compute (100+ ops)

Scaffold a GPU Project

Examples

Limitations

Troubleshooting

FAQ

Contributing

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages