Skip to content

RobotFlow-Labs/docker_mlx_cpp

docker_mlx_cpp

docker_mlx_cpp — Metal GPU for Docker

The NVIDIA Container Toolkit — for Mac

Give any Docker container full Apple Silicon Metal GPU access.
100+ GPU operations • LLM inference • Training • Image gen • Audio • Embeddings
Zero CUDA. Zero NVIDIA. Just Metal.

Quick Start Version License Stars

Table of Contents

One-Line Install

curl -fsSL https://raw.githubusercontent.com/RobotFlow-Labs/docker_mlx_cpp/main/install.sh | bash

This installs everything: MLX, all GPU engines, the daemon, the Docker gateway. One command.

┌──────────────────────────────────────────────────────┐
│  ANY Docker Container                                 │
│  (Python, Node, Rust, Go, curl — anything)           │
│  Uses: OpenAI SDK, docker_mlx SDK, or raw HTTP       │
└────────────────────────┬─────────────────────────────┘
                         │ HTTP :8080
┌────────────────────────▼─────────────────────────────┐
│  mlx-gateway (container)                              │
│  ├── /v1/*           → inference (LLM, VLM)          │
│  ├── /v1/embeddings  → embedding generation          │
│  ├── /v1/audio/*     → Whisper STT + TTS             │
│  ├── /v1/images/*    → Stable Diffusion / FLUX       │
│  ├── /train/*        → LoRA / QLoRA fine-tuning      │
│  └── /models/*       → model management              │
└────────────────────────┬─────────────────────────────┘
                         │ host.docker.internal:12435
┌────────────────────────▼─────────────────────────────┐
│  MLX Daemon (host-side, native macOS)                 │
│  ├── Inference:   mlx-lm (50+ architectures)         │
│  ├── Vision:      mlx-vlm (images + video)           │
│  ├── Training:    LoRA, QLoRA, DPO                   │
│  ├── Image Gen:   Stable Diffusion, SDXL, FLUX       │
│  ├── Audio:       Whisper STT + TTS                  │
│  ├── Embeddings:  Jina v5, BGE, mlx-embeddings       │
│  └── Models:      pull, cache, convert, presets      │
└────────────────────────┬─────────────────────────────┘
                         │ Metal API
┌────────────────────────▼─────────────────────────────┐
│  Apple Silicon M1/M2/M3/M4/M5 — Metal GPU           │
│  Unified Memory • 20-30% faster than llama.cpp       │
└──────────────────────────────────────────────────────┘

Why

On Linux, nvidia-docker gives containers --gpus all and full CUDA access. On Mac, nothing equivalent exists — Metal GPU can't be passed into Docker's Linux VM.

docker_mlx_cpp solves this by running a host-side MLX daemon that exposes the full Apple Silicon GPU stack to any container through standard APIs. Your containers speak OpenAI API. Your Mac does the Metal compute.

Capability NVIDIA Container Toolkit docker_mlx_cpp
GPU from containers --gpus all (CUDA) http://mlx-gateway:8080 (Metal)
LLM inference vLLM, TGI, Triton mlx-lm (50+ architectures)
Training PyTorch + NCCL LoRA, QLoRA, DPO via MLX
Image generation Stable Diffusion (CUDA) SD, SDXL, FLUX (Metal)
Audio Whisper (CUDA) Whisper + TTS (Metal)
Embeddings sentence-transformers mlx-embeddings, Jina v5
Model format Framework-specific MLX Safetensors (HuggingFace)
Setup nvidia-container-toolkit pip install docker-mlx-cpp

Quick Start

# 1. Install
pip install -e ".[all]"

# 2. Start the MLX Daemon (host-side GPU service)
mlx-cpp serve

# 3. Start the gateway (in Docker)
docker compose up -d mlx-gateway

# 4. Pull a model
mlx-cpp models pull mlx-community/SmolLM2-360M-Instruct-4bit

# 5. Use from ANY container
docker run --rm --network mlx-network curlimages/curl:8.5.0 \
  curl -s http://mlx-gateway:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"mlx-community/SmolLM2-360M-Instruct-4bit","messages":[{"role":"user","content":"Hello from Docker!"}]}'

CLI

mlx-cpp serve                    # Start GPU daemon
mlx-cpp run <model> "prompt"     # Quick inference
mlx-cpp models list              # Show cached models
mlx-cpp models pull <model>      # Pull from HuggingFace
mlx-cpp health                   # Check daemon + GPU status
mlx-cpp gpu                      # Show GPU info
mlx-cpp benchmark <model>        # Performance benchmark
mlx-cpp train lora --model ...   # LoRA fine-tuning

Model Presets

Use human-readable presets instead of full model IDs:

mlx-cpp run chat-small "hello"       # SmolLM2-360M (8GB Mac)
mlx-cpp run chat-default "hello"     # Llama-3.2-3B (8GB+)
mlx-cpp run code "write a function"  # Qwen2.5-Coder-7B (16GB+)
mlx-cpp run vision "describe image"  # Qwen2-VL-7B (16GB+)

See all presets: models/presets.yaml

Use in Your docker-compose.yml

services:
  your-app:
    image: your-app
    environment:
      - OPENAI_BASE_URL=http://mlx-gateway:8080/v1
      - OPENAI_API_KEY=not-needed
      - OPENAI_MODEL=mlx-community/Llama-3.2-3B-Instruct-4bit
    networks:
      - mlx-network

networks:
  mlx-network:
    external: true

Zero code changes needed — any app using the OpenAI SDK works out of the box.

API Endpoints

Inference (OpenAI-compatible)

Endpoint Description
POST /v1/chat/completions Chat inference (LLM/VLM)
POST /v1/completions Text completions
POST /v1/embeddings Text embeddings
GET /v1/models List available models

Audio (OpenAI-compatible)

Endpoint Description
POST /v1/audio/transcriptions Whisper speech-to-text
POST /v1/audio/speech Text-to-speech

Image Generation (OpenAI-compatible)

Endpoint Description
POST /v1/images/generations Stable Diffusion / FLUX

Training (custom)

Endpoint Description
POST /train/lora Start LoRA/QLoRA fine-tuning
GET /train/jobs List training jobs
GET /train/jobs/{id} Get job status

Management

Endpoint Description
POST /models/pull Pull model from HuggingFace
POST /models/delete Remove cached model
GET /health Gateway + daemon + GPU status
GET /metrics Prometheus metrics

Project Structure

docker_mlx_cpp/
├── daemon/                      # Host-side MLX daemon
│   ├── mlx_daemon.py           # FastAPI main app (port 12435)
│   ├── model_manager.py        # Model pull/cache/convert
│   ├── engines/
│   │   ├── inference.py        # LLM/VLM inference (mlx-lm)
│   │   ├── training.py         # LoRA/QLoRA fine-tuning
│   │   ├── embeddings.py       # Text embeddings
│   │   ├── audio.py            # Whisper STT + TTS
│   │   └── image_gen.py        # Stable Diffusion / FLUX
│   └── com.robotflow.mlx-daemon.plist  # macOS auto-start
├── gateway/                     # Docker container gateway
│   ├── Dockerfile
│   ├── server.py               # Unified reverse proxy
│   └── requirements.txt
├── cli/
│   └── mlx_cpp.py              # CLI tool (mlx-cpp)
├── models/
│   └── presets.yaml             # Curated model presets
├── examples/
│   ├── python-client/           # Python OpenAI SDK example
│   └── curl-test.sh            # Quick smoke test
├── scripts/
│   ├── setup.sh                # First-time setup
│   └── benchmark.sh            # Performance benchmark
├── sdk/                         # Client SDKs (coming)
│   └── python/docker_mlx/
├── tests/
├── docker-compose.yml
├── pyproject.toml
└── README.md

How It Works

Metal GPU cannot be passed into Docker containers on macOS (confirmed by Docker, Apple, and Red Hat). The VM boundary blocks it.

docker_mlx_cpp uses the same pattern as NVIDIA's container toolkit, adapted for Mac:

  1. MLX Daemon runs natively on macOS with direct Metal GPU access
  2. Gateway container routes HTTP requests from Docker network to the daemon
  3. Any container calls the gateway using standard OpenAI API — no special runtime needed

MLX is 20-30% faster than llama.cpp on Apple Silicon and supports the full ML stack: inference, training, image generation, audio, embeddings, and custom Metal kernels.

System Requirements

Requirement Minimum
macOS 14.0+ (Sonoma) on Apple Silicon
Chip M1 / M2 / M3 / M4 / M5 (any variant)
RAM 8 GB (16 GB+ recommended for larger models)
Docker Docker Desktop 4.62+
Python 3.11+
Intel Mac Not supported (no Metal GPU)

Performance Benchmarks

Tested on Apple M5 (24 GB), MLX 0.31.1:

Operation Latency Notes
Matmul 1024x1024 ~95 TFLOPS Raw GPU compute
Flash Attention (b=2, h=4, s=128) 1.6 ms scaled_dot_product_attention
Conv2d (3→32, 32x32) 0.4 ms Neural network layer
FFT2 128x128 0.5 ms Signal processing
Sort 100K elements 1.2 ms GPU-accelerated sort
Softmax 1024x1024 1.8 ms Activation function
LayerNorm (8, 64, 256) 0.9 ms Normalization
RMSNorm (8, 64, 256) 0.4 ms LLM normalization

All 107 operations pass on Metal GPU. See tests/test_compute.py for the full suite.

GPU Compute (100+ ops)

Any container can run these operations on the Metal GPU:

# From any container on mlx-network:
curl -X POST http://mlx-gateway:8080/compute/eval \
  -H "Content-Type: application/json" \
  -d '{"op": "matmul", "args": {"a": {"shape": [1024, 1024]}, "b": {"shape": [1024, 1024]}}}'

15 categories, 107 operations: Arithmetic (16) | Linear Algebra (12) | Reductions (12) | Transforms (13) | Activations (13) | Convolutions (2) | Pooling (4) | Attention (1) | Normalization (5) | Random (6) | FFT (6) | Sorting (4) | Comparison (10) | Metal Memory (4) | Benchmarks (3)

List all ops: GET /compute/ops

Scaffold a GPU Project

mlx-cpp docker init my-gpu-app
cd my-gpu-app
docker compose up

Creates a ready-to-run Dockerfile + compose + Python app that uses Metal GPU from inside Docker.

Examples

Example Language What it does
examples/python-client/ Python OpenAI SDK → Metal GPU inference
examples/node-client/ Node.js fetch() → Metal GPU matmul + chat
examples/streaming/ Python SSE token streaming from container
examples/gpu-test/ Python Tests ALL 107 GPU operations
docker compose --profile examples up example-app    # Python
docker compose --profile gpu-test up gpu-test        # Full GPU test

Limitations

  • No direct Metal inside containers. Metal GPU requires macOS host access. Containers call the daemon over HTTP. This adds ~1-5ms per call.
  • SVD, QR, Cholesky, Inverse run on CPU (MLX linalg constraint). All other ops run on GPU.
  • hard_swish activation not available in MLX 0.31.1.
  • Single GPU serialization. Concurrent requests from multiple containers are serialized (not parallel).
  • No Intel Mac support. Requires Apple Silicon for Metal GPU.

Troubleshooting

mlx-cpp serve fails with "No module named mlx"

pip install mlx  # Requires Apple Silicon Mac

Gateway can't reach daemon: "502 MLX Daemon unreachable"

# Ensure daemon is running on host
mlx-cpp serve  # Must run on macOS host, not in Docker

"Connection refused" from container

# Ensure your container is on mlx-network
docker network ls | grep mlx
# Ensure gateway is healthy
curl http://localhost:8080/health

Model download hangs

# Check HuggingFace access
python -c "from huggingface_hub import HfApi; print(HfApi().whoami())"

Out of memory with large models

# Use a smaller preset
mlx-cpp run chat-small "hello"  # 360M params, fits in 8GB
# Or clear cache
curl -X POST http://localhost:12435/compute/clear-cache

FAQ

Q: Can I import mlx directly inside a container? No. MLX requires Metal (macOS). Containers run Linux. Use the HTTP API or the docker_mlx SDK instead.

Q: What's the overhead vs native MLX? ~1-5ms per HTTP round-trip. For inference (100ms+), negligible. For tight loops, batch operations.

Q: Can multiple containers share the GPU? Yes. Requests are serialized through the daemon. No crashes, but sequential not parallel.

Q: Does this work on Intel Mac? No. Metal GPU acceleration requires Apple Silicon (M1/M2/M3/M4/M5).

Q: Is this production-ready? It's v0.1.0 — great for development, prototyping, and local ML workflows. For production, add authentication to the gateway.

Q: How is this different from Docker Model Runner? Docker Model Runner is inference-only and proprietary. docker_mlx_cpp is open source and supports inference + training + image gen + audio + embeddings + raw GPU compute.

Q: What models are supported? Any model the MLX ecosystem supports: 50+ LLM architectures, VLMs, Whisper, Kokoro TTS, FLUX, and all 1000+ models on HuggingFace mlx-community.

Contributing

See CONTRIBUTING.md for how to get started. We welcome:

  • New GPU operations
  • Language examples (Rust, Go, Java, etc.)
  • Documentation improvements
  • Bug reports and feature requests

License

MIT — See LICENSE


Built by RobotFlow Labs — Making Mac GPUs work like NVIDIA for Docker.

About

The NVIDIA Container Toolkit for Mac — Give any Docker container full Apple Silicon Metal GPU access. 100+ GPU operations, LLM inference, training, image gen, audio, embeddings. Zero CUDA. Just Metal.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors