MLX-Flash

Run AI models too large for your Mac's memory — at near-full speed.

Your MacBook has 32-48GB of RAM, but the best AI models need 100-200GB+. MLX-Flash makes them run anyway by intelligently caching the most-needed parts in RAM and streaming the rest from your SSD — so you don't have to choose between quality and what fits in memory.

How It Works (Simple Version)

Think of it like Netflix streaming: instead of downloading the entire movie before watching, you buffer what you need and stream the rest. MLX-Flash does this for AI model weights:

flowchart TB
    subgraph RAM["Your Mac's RAM (fast)"]
        HC[Hot Cache — 85%+ of active experts]
        MP[Mixed Precision — hot 4-bit, cold 2-bit]
        KV[KV Cache — optional 8-bit quantization]
    end
    subgraph CACHE["Smart Cache Layer"]
        LCP[LCP Eviction — layer-depth biased]
        PF[Speculative Prefetch — 97% accuracy]
        MM[Memory Monitor — never harms your apps]
        SPEC[Speculative Execution — predict → execute → verify]
    end
    subgraph SSD["Your Mac's SSD (big)"]
        FULL[Full model weights — even 200GB+]
        ENT[Entropy-coded storage — 65% smaller]
    end

    SSD -->|stream on demand| CACHE
    CACHE -->|cache hit: 0.08ms| RAM
    CACHE -->|cache miss: 0.6ms| SSD
    RAM -->|feed to GPU| GPU[MLX GPU Inference]

Result: A 200GB AI model runs on your 48GB Mac at 2-3x faster than naive SSD streaming.

Quick Start

# Install from PyPI
pip install mlx-flash

# Or Homebrew (includes Rust sidecar)
brew tap szibis/mlx-flash && brew install mlx-flash

# Or from source
git clone https://github.com/szibis/MLX-Flash.git
cd MLX-Flash && pip install -e ".[all]"

# Interactive chat (simplest way to use it)
mlx-flash-chat

# Start the API server (works with LM Studio, Cursor, Claude Code, Codex, OpenAI SDK)
mlx-flash --port 8080

# With KV cache quantization (45% less KV memory)
mlx-flash --port 8080 --kv-bits 8

# See what models fit your hardware
mlx-flash-browse

Performance

Measured Results

Technique	Speedup	How It Works
LCP Smart Cache	2.80x	Keeps frequently-used model parts in RAM, predicts what's needed next
+ Async Prefetch	2.93x	Loads next part from SSD while GPU computes current part
Mixed Precision	1.80x size reduction	Rarely-used parts stored at lower quality (saves space, barely affects output)
Skip Fallback	2.67x	When something isn't cached, gracefully skip it instead of waiting
Speculative Execution	14-42% TPOT	Execute predicted experts before router confirms, verify after
Adaptive Top-K	10-30% compute	Skip low-confidence secondary experts automatically

Real Hardware Numbers (Measured on M3 Max 36GB)

Memory pressure recovery (the key result):

Model at 0.9x RAM (barely fits):
  Without optimization:    43.5 tok/s  ########
  With mixed precision:   104.5 tok/s  ####################  2.4x faster

The memory pressure cliff is razor-sharp: 10% over the limit causes 59% slowdown. Our 20% footprint reduction shifts the model back to full speed.

Cache warm-up (ISP-like progressive acceleration):

Token  0:  83.3ms (cold start, loading experts from SSD)
Token  8:   5.7ms (warming up, 62% cache hit)
Token 24:   0.5ms (full speed, 85%+ cache hit)
         -> 41x speedup from warm-up

Topic switching:

coding -> writing:  62ms first token (re-warming)  -> 8 tokens to recover
writing -> coding:  0.6ms first token (still cached!) -> instant fast

Expert Streaming Performance

Expert streaming replaces MLX's QuantizedSwitchLinear with a GPU lookup table + pre-stacked tensors. The capacity_per_layer parameter controls how many experts stay in GPU memory:

Model	Total Experts	Capacity	Coverage	Throughput	Notes
Qwen3-30B-A3B	128 per layer	128 (100%)	100%	~35 tok/s	Full speed, no streaming needed
Qwen3-30B-A3B	128 per layer	64 (50%)	85%+ hit rate	~15 tok/s	After warm-up with LCP
Mixtral-8x7B	8 per layer	8 (100%)	100%	~20 tok/s	All experts fit
Mixtral-8x7B	8 per layer	4 (50%)	~95% hit rate	~12 tok/s	Most active cached

Tuning tips:

Start with capacity_per_layer = total_experts if RAM allows (no streaming overhead)
Use --task coding warmup profile for programming tasks (pre-loads code-relevant experts)
Enable skip-fallback with adaptive threshold to skip low-confidence secondary experts
After ~25 tokens, LCP learns your workload and hit rate climbs to 85-95%
Run optimize_wired_memory_limit() before loading to prevent Metal pressure cliff

from mlx_flash_compress.expert_streaming import (
    enable_expert_streaming, enable_skip_fallback, get_warmup_experts
)

# Load model, enable streaming with 50% capacity
streaming = enable_expert_streaming(model, capacity_per_layer=64)
enable_skip_fallback(model, streaming.caches, adaptive_skip_threshold=3.0)
streaming.warmup()

Find Your Optimal Configuration

The Tier Optimizer tells you exactly how to allocate your Mac's memory:

# For a 200GB model on a 48GB Mac
python -m mlx_flash_compress.tier_optimizer --total-ram 48 --model-gb 209

# Output: "Best: 41.5GB RAM cache, 82% of requests served from RAM → 6.4 tok/s"

It shows you the sweet spot — even dedicating just 10GB to caching gives you 54% of requests served instantly from RAM.

What's Inside

Architecture

flowchart TB
    subgraph Prediction["Expert Prediction (97%+ accuracy)"]
        RP[Residual-Stream Predictor<br/>Linear projection of hidden state]
        SM[Shadow MLP Predictor<br/>Online-trained routing MLP]
        CL[Cross-Layer Prefetch<br/>3-hop transitive co-occurrence]
    end
    subgraph CacheLayer["Smart Cache Layer"]
        LCP[LCP Eviction<br/>Layer-depth biased]
        FLE[Forward-Looking Eviction<br/>Belady-optimal approximation]
        VS[Vertical Split<br/>2x coverage in same RAM]
        EM[Expert Merging<br/>Cosine similarity clustering]
    end
    subgraph Execution["Inference Engine"]
        ES[Expert Streaming<br/>GPU lookup + pre-stacked tensors]
        SE[Speculative Execution<br/>Predict → Execute → Verify]
        SF[Skip Fallback<br/>Adaptive top-k]
        MP[Mixed Precision<br/>Hot 4-bit / Cold 2-bit]
    end
    subgraph Storage["Compressed Storage"]
        EC[Entropy Coding<br/>Huffman for uint4]
        ST[Safetensors mmap<br/>Zero-copy SSD reads]
    end

    Prediction --> CacheLayer
    CacheLayer --> Execution
    Storage --> CacheLayer

Core Modules (35 Python files)

Module	What It Does
Expert Streaming
`expert_streaming.py`	GPU lookup table + pre-stacked weights, skip-fallback, adaptive top-k, Mixtral/Qwen support
`speculative_experts.py`	Residual-stream predictor (97%+), Belady-optimal eviction, speculative execution
`advanced_prefetch.py`	Cross-layer N-hop predictor + shadow MLP for >90% prefetch accuracy
Cache Management
`lcp_cache.py`	Smart cache with layer-depth biased LCP eviction + `mx.clear_cache()`
`smart_eviction.py`	SpecMD-inspired least-stale eviction + routing predictor
`vertical_split.py`	Cache partial expert rows for 2x coverage in same RAM (MoEpic)
`expert_merging.py`	Offline expert clustering — merge similar experts for 15-30% fewer params
Compression
`entropy_coding.py`	Huffman coding for uint4 weights — 65% smaller at near-zero quality loss
`mixed_precision.py`	Hot experts at 4-bit, cold at 2-bit — 1.8x smaller, barely noticeable
`compression.py`	LZ4/ZSTD compression + Apple's native LZFSE
Memory & Hardware
`memory_manager.py`	Real-time pressure monitoring, wired memory limit, auto-release
`hardware.py`	Apple Silicon detection (M1-M5), RAM, GPU cores
`tier_optimizer.py`	Finds the perfect RAM/SSD balance for your Mac + model combo
`ssd_protection.py`	Thermal cutoff, sequential hints, zero writes
Inference & Serving
`serve.py`	OpenAI-compatible server with KV cache quantization, memory-aware hints
`chat.py`	Colorful chat CLI with web search, memory, model switching
`web_search.py`	DuckDuckGo search + persistent memory store (Perplexity-style)
`hf_calculator.py`	Model size/memory estimator for any MoE or dense model
`task_profiler.py`	Per-task expert profiles (coding/writing/math/chat) for fast warmup
Distributed
`distributed_experts.py`	Multi-Mac expert parallelism over Thunderbolt 5 RDMA
`kv_cache_sharing.py`	PT-MoE KV-cache sharing between blocks (37.5% memory savings)
`cached_inference.py`	Expert routing capture + cache simulation
`rust_bridge.py`	Python ↔ Rust Unix socket bridge
Rust Sidecar
`mlx-flash-server/`	axum HTTP/SSE proxy, mach2 memory (0.1ms), DashMap LCP, Unix socket

Client Integration

graph LR
    subgraph Clients
        LS[LM Studio]
        CU[Cursor]
        CC[Claude Code]
        SDK[OpenAI SDK]
        CD[continue.dev]
        OW[Open WebUI]
    end
    subgraph Rust["Rust Sidecar :8080"]
        AX[axum HTTP/SSE]
        MEM[Memory Monitor<br/>mach2 0.1ms]
        LCPC[LCP Cache<br/>DashMap lock-free]
    end
    subgraph Python["Python Worker :8081"]
        MLX[MLX Inference<br/>95% of work]
        GEN[generate&#40;&#41;]
    end

    Clients -->|OpenAI API| Rust
    Rust -->|proxy| Python
    Rust -.->|Unix socket| LCPC
    LCPC -.->|expert weights| Python

Using It

How	Command	Best For
Interactive chat	`mlx-flash-chat`	Chat with web search, memory, model switching
API server	`mlx-flash --port 8080`	LM Studio, Cursor, Claude Code, OpenAI SDK
API + KV quant	`mlx-flash --port 8080 --kv-bits 8`	45% less KV memory
Model calculator	`python -m mlx_flash_compress.hf_calculator`	Estimate size/memory for any model
Model browser	`mlx-flash-browse`	See what fits your hardware
Warm-up demo	`python -m mlx_flash_compress.demo_warmup`	Watch cache fill in real-time
Pressure test	`python -m mlx_flash_compress.bench_memory_pressure`	Measure memory impact

Chat commands: /models browse catalog, /model N switch live, /search web search, /ask search+answer, /remember save facts, /memories list, /status memory info

Integrations

All integrations start with running the server:

# Install
pip install mlx-flash

# Start the server
mlx-flash --port 8080 --preload

LM Studio

Start MLX-Flash: mlx-flash --port 8080 --preload
In LM Studio: Settings → Server → Add custom endpoint: http://localhost:8080/v1
Select model: local
Chat normally — LM Studio treats MLX-Flash as its backend

Cursor

Start MLX-Flash: mlx-flash --port 8080 --preload
In Cursor: Settings → Models → Add Model
- Provider: OpenAI Compatible
- API Base: http://localhost:8080/v1
- API Key: not-needed
- Model: local

Claude Code

# Terminal 1: Start server
mlx-flash --port 8080 --preload

# Terminal 2: Use with Claude Code
export OPENAI_API_BASE=http://localhost:8080/v1
export OPENAI_API_KEY=not-needed

Or add to ~/.claude/.mcp.json:

{
  "mlx-flash": {
    "command": "mlx-flash",
    "args": ["--model", "mlx-community/Qwen3-30B-A3B-4bit", "--port", "8080"]
  }
}

Codex CLI

# Start server
mlx-flash --port 8080 --preload

# Use with Codex
export OPENAI_API_BASE=http://localhost:8080/v1
export OPENAI_API_KEY=not-needed
codex "refactor this function"

Ollama (side-by-side)

# Ollama on default port (11434) for dense models
ollama serve

# MLX-Flash on 8080 for MoE models (better expert caching)
mlx-flash --port 8080 --preload

# Use Ollama for dense models, MLX-Flash for MoE models

continue.dev (VS Code / JetBrains)

Add to ~/.continue/config.json:

{
  "models": [{
    "title": "Local MoE (MLX)",
    "provider": "openai",
    "model": "local",
    "apiBase": "http://localhost:8080/v1",
    "apiKey": "not-needed"
  }]
}

Open WebUI

mlx-flash --port 8080 --preload
# In Open WebUI settings: Add connection → http://localhost:8080/v1

Python / OpenAI SDK

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")
response = client.chat.completions.create(
    model="local",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)

Aider (AI pair programming)

mlx-flash --port 8080 --preload
aider --openai-api-base http://localhost:8080/v1 --openai-api-key not-needed --model local

See docs/integrations.md for 18+ detailed integration guides with streaming examples, health checks, and memory monitoring.

Benchmark Suite

python -m mlx_flash_compress.bench_memory_pressure       # Memory pressure analysis (key demo)
python -m mlx_flash_compress.demo_warmup                   # ISP-like warm-up visualization
python -m mlx_flash_compress.cached_inference --multi-topic # Real routing capture
python -m mlx_flash_compress.bench --synthetic              # Quick test (no model needed)
python -m mlx_flash_compress.bench_real                     # Real Qwen MoE model test
python -m mlx_flash_compress.bench_final                    # Final comprehensive benchmark

Key Discoveries

1. Standard Compression Doesn't Work on AI Weights

We tested 6 different compression strategies on real AI model weights. Result: 1.0x compression (zero savings). The data is already maximally dense at 4-bit quantization. Instead, we use entropy coding (Huffman) which exploits the non-uniform distribution of quantized values for 65% savings.

2. Smart Caching Is the #1 Win

Instead of trying to compress, we predict what's needed and pre-load it. Our prediction stack achieves 97%+ accuracy:

Residual-stream predictor (linear projection of hidden states)
Cross-layer 3-hop lookahead (transitive co-occurrence)
Forward-looking Belady-optimal eviction (never evict what you'll need)
Layer-depth bias (early layers are more valuable to cache)

3. The Brain Already Solved This Problem

MoE models work like the brain — only 0.78% of "neurons" (experts) activate per input. The brain handles this with predictive coding (pre-activating expected pathways). We implement the same principle: predict which experts are needed, speculatively execute them, and verify after the router confirms.

4. Speculate, Don't Wait

Speculative expert execution (from MoE-SpAc paper) runs predicted experts before the router confirms them. With 97% prediction accuracy, this means 97% of expert computations start immediately with zero load latency. The 3% misses are discarded and recomputed — on unified memory, this costs only ~0.1ms per wasted computation.

Requirements

macOS with Apple Silicon (M1/M2/M3/M4/M5)
Python 3.10+
16GB+ RAM (more = better caching = faster)
For real model tests: mlx and mlx-lm packages

Project Stats

15,000+ lines of code (Python + Rust)
254 tests (222 Python + 32 Rust)
8 benchmark suites + interactive demos
10 research documents (15+ papers implemented, 60+ surveyed)
40 Python modules covering prediction, caching, compression, distributed, serving
OpenAI-compatible API server with KV cache quantization
Memory-aware inference with wired memory optimization
Rust sidecar with 0.1ms memory checks (210x faster than Python)
Lock-free LCP expert cache (DashMap) with layer-depth bias
Unix socket bridge for Python ↔ Rust expert weight streaming
15+ research techniques implemented from papers 2024-2026

Research & Techniques Implemented

graph TB
    subgraph DONE["Implemented (15+ techniques)"]
        ES[Expert Streaming<br/>GPU lookup tables]
        LCP[Layer-biased LCP<br/>FATE paper]
        RP[Residual Predictor<br/>97%+ accuracy]
        SE[Speculative Execution<br/>MoE-SpAc]
        FE[Forward Eviction<br/>MoE-SpeQ Belady]
        CL[Cross-Layer Prefetch<br/>3-hop lookahead]
        SP[Shadow MLP Predictor<br/>mlx-od-moe]
        VS[Vertical Splitting<br/>MoEpic 2x coverage]
        EM[Expert Merging<br/>DEK/EEP]
        EC[Entropy Coding<br/>EntroLLM Huffman]
        AT[Adaptive Top-K<br/>LExI paper]
        MP[Mixed Precision<br/>HOBBIT]
        KV[KV Cache 8-bit<br/>mlx-moe]
        WM[Wired Memory Limit<br/>macOS sysctl]
        MC[mx.clear_cache<br/>MLX v0.31]
    end
    subgraph BLOCKED["Blocked"]
        AMX[AMX Pipeline<br/>undocumented HW]
        MLXrs[mlx-rs<br/>macOS 26 Metal]
    end

Technique	Paper	Status
Expert streaming (GPU lookup)	HOBBIT arXiv:2411.01433	Implemented
Residual-stream predictor	Speculating Experts arXiv:2603.19289	Implemented
Speculative expert execution	MoE-SpAc arXiv:2603.09983	Implemented
Forward-looking Belady eviction	MoE-SpeQ arXiv:2511.14102	Implemented
Cross-layer 3-hop prefetch	FATE arXiv:2502.12224 / tinyserve	Implemented
Layer-depth cache bias	FATE arXiv:2502.12224	Implemented
Shadow model predictor	mlx-od-moe	Implemented
Vertical expert splitting	MoEpic paper	Implemented
Expert merging (offline)	DEK/EEP arXiv:2509.19781	Implemented
Entropy coding (Huffman uint4)	EntroLLM arXiv:2505.02380	Implemented
Adaptive top-k skipping	LExI arXiv:2509.02753	Implemented
Mixed precision per-expert	HOBBIT arXiv:2411.01433	Implemented
KV cache 8-bit quantization	mlx-moe / mlx-lm v0.31	Implemented
Wired memory optimization	macOS sysctl / mlx-moe	Implemented
`mx.clear_cache()` integration	MLX v0.31.0	Implemented
AMX dequant pipeline	amx-rs Rust crate	Blocked (undocumented HW)
mlx-rs native inference	mlx-rs v0.25.3	Blocked (macOS 26 Metal)

Competition

10+ OSS projects and 15+ papers attack the same problem. Our unique differentiators:

Only project with Rust sidecar + Mach syscall memory monitoring
Only Apple Silicon project with mixed precision per-expert (hot 4-bit / cold 2-bit)
Most techniques implemented: 15+ from research frontier, more than any competitor
Only project combining speculative execution + Belady eviction + residual predictor + expert merging

Competitor	Key Feature	Our Advantage
mu-hashmi/mlx-moe	Expert profiles, 10+ model families	Speculative execution, residual predictor, Rust sidecar
kqb/mlx-od-moe	Shadow model, memory-mapped experts	Cross-layer prefetch, entropy coding, expert merging
jundot/omlx	Hybrid mxfp4/mxfp8 quantization	Belady eviction, adaptive top-k, vertical splitting
HOBBIT (paper)	Nearly identical architecture	Apple Silicon native, open source

See docs/competitive-analysis.md for the full landscape.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 88 Commits
.github/workflows		.github/workflows
assets		assets
csrc		csrc
docs		docs
homebrew		homebrew
mlx-flash-server		mlx-flash-server
mlx_flash_compress		mlx_flash_compress
tests		tests
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
PYPI_README.md		PYPI_README.md
README.md		README.md
SOCIAL.md		SOCIAL.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MLX-Flash

How It Works (Simple Version)

Quick Start

Performance

Measured Results

Real Hardware Numbers (Measured on M3 Max 36GB)

Expert Streaming Performance

Find Your Optimal Configuration

What's Inside

Architecture

Core Modules (35 Python files)

Client Integration

Using It

Integrations

Benchmark Suite

Key Discoveries

1. Standard Compression Doesn't Work on AI Weights

2. Smart Caching Is the #1 Win

3. The Brain Already Solved This Problem

4. Speculate, Don't Wait

Requirements

Project Stats

Research & Techniques Implemented

Competition

License

About

Uh oh!

Releases 4

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MLX-Flash

How It Works (Simple Version)

Quick Start

Performance

Measured Results

Real Hardware Numbers (Measured on M3 Max 36GB)

Expert Streaming Performance

Find Your Optimal Configuration

What's Inside

Architecture

Core Modules (35 Python files)

Client Integration

Using It

Integrations

Benchmark Suite

Key Discoveries

1. Standard Compression Doesn't Work on AI Weights

2. Smart Caching Is the #1 Win

3. The Brain Already Solved This Problem

4. Speculate, Don't Wait

Requirements

Project Stats

Research & Techniques Implemented

Competition

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages