Run AI models too large for your Mac's memory — at near-full speed.
Your MacBook has 32-48GB of RAM, but the best AI models need 100-200GB+. MLX-Flash makes them run anyway by intelligently caching the most-needed parts in RAM and streaming the rest from your SSD — so you don't have to choose between quality and what fits in memory.
Think of it like Netflix streaming: instead of downloading the entire movie before watching, you buffer what you need and stream the rest. MLX-Flash does this for AI model weights:
flowchart TB
subgraph RAM["Your Mac's RAM (fast)"]
HC[Hot Cache — 85%+ of active experts]
MP[Mixed Precision — hot 4-bit, cold 2-bit]
KV[KV Cache — optional 8-bit quantization]
end
subgraph CACHE["Smart Cache Layer"]
LCP[LCP Eviction — layer-depth biased]
PF[Speculative Prefetch — 97% accuracy]
MM[Memory Monitor — never harms your apps]
SPEC[Speculative Execution — predict → execute → verify]
end
subgraph SSD["Your Mac's SSD (big)"]
FULL[Full model weights — even 200GB+]
ENT[Entropy-coded storage — 65% smaller]
end
SSD -->|stream on demand| CACHE
CACHE -->|cache hit: 0.08ms| RAM
CACHE -->|cache miss: 0.6ms| SSD
RAM -->|feed to GPU| GPU[MLX GPU Inference]
Result: A 200GB AI model runs on your 48GB Mac at 2-3x faster than naive SSD streaming.
# Install from PyPI
pip install mlx-flash
# Or Homebrew (includes Rust sidecar)
brew tap szibis/mlx-flash && brew install mlx-flash
# Or from source
git clone https://github.com/szibis/MLX-Flash.git
cd MLX-Flash && pip install -e ".[all]"# Interactive chat (simplest way to use it)
mlx-flash-chat
# Start the API server (works with LM Studio, Cursor, Claude Code, Codex, OpenAI SDK)
mlx-flash --port 8080
# With KV cache quantization (45% less KV memory)
mlx-flash --port 8080 --kv-bits 8
# See what models fit your hardware
mlx-flash-browse| Technique | Speedup | How It Works |
|---|---|---|
| LCP Smart Cache | 2.80x | Keeps frequently-used model parts in RAM, predicts what's needed next |
| + Async Prefetch | 2.93x | Loads next part from SSD while GPU computes current part |
| Mixed Precision | 1.80x size reduction | Rarely-used parts stored at lower quality (saves space, barely affects output) |
| Skip Fallback | 2.67x | When something isn't cached, gracefully skip it instead of waiting |
| Speculative Execution | 14-42% TPOT | Execute predicted experts before router confirms, verify after |
| Adaptive Top-K | 10-30% compute | Skip low-confidence secondary experts automatically |
Memory pressure recovery (the key result):
Model at 0.9x RAM (barely fits):
Without optimization: 43.5 tok/s ########
With mixed precision: 104.5 tok/s #################### 2.4x faster
The memory pressure cliff is razor-sharp: 10% over the limit causes 59% slowdown. Our 20% footprint reduction shifts the model back to full speed.
Cache warm-up (ISP-like progressive acceleration):
Token 0: 83.3ms (cold start, loading experts from SSD)
Token 8: 5.7ms (warming up, 62% cache hit)
Token 24: 0.5ms (full speed, 85%+ cache hit)
-> 41x speedup from warm-up
Topic switching:
coding -> writing: 62ms first token (re-warming) -> 8 tokens to recover
writing -> coding: 0.6ms first token (still cached!) -> instant fast
Expert streaming replaces MLX's QuantizedSwitchLinear with a GPU lookup table + pre-stacked tensors. The capacity_per_layer parameter controls how many experts stay in GPU memory:
| Model | Total Experts | Capacity | Coverage | Throughput | Notes |
|---|---|---|---|---|---|
| Qwen3-30B-A3B | 128 per layer | 128 (100%) | 100% | ~35 tok/s | Full speed, no streaming needed |
| Qwen3-30B-A3B | 128 per layer | 64 (50%) | 85%+ hit rate | ~15 tok/s | After warm-up with LCP |
| Mixtral-8x7B | 8 per layer | 8 (100%) | 100% | ~20 tok/s | All experts fit |
| Mixtral-8x7B | 8 per layer | 4 (50%) | ~95% hit rate | ~12 tok/s | Most active cached |
Tuning tips:
- Start with
capacity_per_layer = total_expertsif RAM allows (no streaming overhead) - Use
--task codingwarmup profile for programming tasks (pre-loads code-relevant experts) - Enable skip-fallback with adaptive threshold to skip low-confidence secondary experts
- After ~25 tokens, LCP learns your workload and hit rate climbs to 85-95%
- Run
optimize_wired_memory_limit()before loading to prevent Metal pressure cliff
from mlx_flash_compress.expert_streaming import (
enable_expert_streaming, enable_skip_fallback, get_warmup_experts
)
# Load model, enable streaming with 50% capacity
streaming = enable_expert_streaming(model, capacity_per_layer=64)
enable_skip_fallback(model, streaming.caches, adaptive_skip_threshold=3.0)
streaming.warmup()The Tier Optimizer tells you exactly how to allocate your Mac's memory:
# For a 200GB model on a 48GB Mac
python -m mlx_flash_compress.tier_optimizer --total-ram 48 --model-gb 209
# Output: "Best: 41.5GB RAM cache, 82% of requests served from RAM → 6.4 tok/s"It shows you the sweet spot — even dedicating just 10GB to caching gives you 54% of requests served instantly from RAM.
flowchart TB
subgraph Prediction["Expert Prediction (97%+ accuracy)"]
RP[Residual-Stream Predictor<br/>Linear projection of hidden state]
SM[Shadow MLP Predictor<br/>Online-trained routing MLP]
CL[Cross-Layer Prefetch<br/>3-hop transitive co-occurrence]
end
subgraph CacheLayer["Smart Cache Layer"]
LCP[LCP Eviction<br/>Layer-depth biased]
FLE[Forward-Looking Eviction<br/>Belady-optimal approximation]
VS[Vertical Split<br/>2x coverage in same RAM]
EM[Expert Merging<br/>Cosine similarity clustering]
end
subgraph Execution["Inference Engine"]
ES[Expert Streaming<br/>GPU lookup + pre-stacked tensors]
SE[Speculative Execution<br/>Predict → Execute → Verify]
SF[Skip Fallback<br/>Adaptive top-k]
MP[Mixed Precision<br/>Hot 4-bit / Cold 2-bit]
end
subgraph Storage["Compressed Storage"]
EC[Entropy Coding<br/>Huffman for uint4]
ST[Safetensors mmap<br/>Zero-copy SSD reads]
end
Prediction --> CacheLayer
CacheLayer --> Execution
Storage --> CacheLayer
| Module | What It Does |
|---|---|
| Expert Streaming | |
expert_streaming.py |
GPU lookup table + pre-stacked weights, skip-fallback, adaptive top-k, Mixtral/Qwen support |
speculative_experts.py |
Residual-stream predictor (97%+), Belady-optimal eviction, speculative execution |
advanced_prefetch.py |
Cross-layer N-hop predictor + shadow MLP for >90% prefetch accuracy |
| Cache Management | |
lcp_cache.py |
Smart cache with layer-depth biased LCP eviction + mx.clear_cache() |
smart_eviction.py |
SpecMD-inspired least-stale eviction + routing predictor |
vertical_split.py |
Cache partial expert rows for 2x coverage in same RAM (MoEpic) |
expert_merging.py |
Offline expert clustering — merge similar experts for 15-30% fewer params |
| Compression | |
entropy_coding.py |
Huffman coding for uint4 weights — 65% smaller at near-zero quality loss |
mixed_precision.py |
Hot experts at 4-bit, cold at 2-bit — 1.8x smaller, barely noticeable |
compression.py |
LZ4/ZSTD compression + Apple's native LZFSE |
| Memory & Hardware | |
memory_manager.py |
Real-time pressure monitoring, wired memory limit, auto-release |
hardware.py |
Apple Silicon detection (M1-M5), RAM, GPU cores |
tier_optimizer.py |
Finds the perfect RAM/SSD balance for your Mac + model combo |
ssd_protection.py |
Thermal cutoff, sequential hints, zero writes |
| Inference & Serving | |
serve.py |
OpenAI-compatible server with KV cache quantization, memory-aware hints |
chat.py |
Colorful chat CLI with web search, memory, model switching |
web_search.py |
DuckDuckGo search + persistent memory store (Perplexity-style) |
hf_calculator.py |
Model size/memory estimator for any MoE or dense model |
task_profiler.py |
Per-task expert profiles (coding/writing/math/chat) for fast warmup |
| Distributed | |
distributed_experts.py |
Multi-Mac expert parallelism over Thunderbolt 5 RDMA |
kv_cache_sharing.py |
PT-MoE KV-cache sharing between blocks (37.5% memory savings) |
cached_inference.py |
Expert routing capture + cache simulation |
rust_bridge.py |
Python ↔ Rust Unix socket bridge |
| Rust Sidecar | |
mlx-flash-server/ |
axum HTTP/SSE proxy, mach2 memory (0.1ms), DashMap LCP, Unix socket |
graph LR
subgraph Clients
LS[LM Studio]
CU[Cursor]
CC[Claude Code]
SDK[OpenAI SDK]
CD[continue.dev]
OW[Open WebUI]
end
subgraph Rust["Rust Sidecar :8080"]
AX[axum HTTP/SSE]
MEM[Memory Monitor<br/>mach2 0.1ms]
LCPC[LCP Cache<br/>DashMap lock-free]
end
subgraph Python["Python Worker :8081"]
MLX[MLX Inference<br/>95% of work]
GEN[generate()]
end
Clients -->|OpenAI API| Rust
Rust -->|proxy| Python
Rust -.->|Unix socket| LCPC
LCPC -.->|expert weights| Python
| How | Command | Best For |
|---|---|---|
| Interactive chat | mlx-flash-chat |
Chat with web search, memory, model switching |
| API server | mlx-flash --port 8080 |
LM Studio, Cursor, Claude Code, OpenAI SDK |
| API + KV quant | mlx-flash --port 8080 --kv-bits 8 |
45% less KV memory |
| Model calculator | python -m mlx_flash_compress.hf_calculator |
Estimate size/memory for any model |
| Model browser | mlx-flash-browse |
See what fits your hardware |
| Warm-up demo | python -m mlx_flash_compress.demo_warmup |
Watch cache fill in real-time |
| Pressure test | python -m mlx_flash_compress.bench_memory_pressure |
Measure memory impact |
Chat commands: /models browse catalog, /model N switch live, /search web search, /ask search+answer, /remember save facts, /memories list, /status memory info
All integrations start with running the server:
# Install
pip install mlx-flash
# Start the server
mlx-flash --port 8080 --preloadLM Studio
- Start MLX-Flash:
mlx-flash --port 8080 --preload - In LM Studio: Settings → Server → Add custom endpoint:
http://localhost:8080/v1 - Select model:
local - Chat normally — LM Studio treats MLX-Flash as its backend
Cursor
- Start MLX-Flash:
mlx-flash --port 8080 --preload - In Cursor: Settings → Models → Add Model
- Provider:
OpenAI Compatible - API Base:
http://localhost:8080/v1 - API Key:
not-needed - Model:
local
- Provider:
Claude Code
# Terminal 1: Start server
mlx-flash --port 8080 --preload
# Terminal 2: Use with Claude Code
export OPENAI_API_BASE=http://localhost:8080/v1
export OPENAI_API_KEY=not-neededOr add to ~/.claude/.mcp.json:
{
"mlx-flash": {
"command": "mlx-flash",
"args": ["--model", "mlx-community/Qwen3-30B-A3B-4bit", "--port", "8080"]
}
}Codex CLI
# Start server
mlx-flash --port 8080 --preload
# Use with Codex
export OPENAI_API_BASE=http://localhost:8080/v1
export OPENAI_API_KEY=not-needed
codex "refactor this function"Ollama (side-by-side)
# Ollama on default port (11434) for dense models
ollama serve
# MLX-Flash on 8080 for MoE models (better expert caching)
mlx-flash --port 8080 --preload
# Use Ollama for dense models, MLX-Flash for MoE modelscontinue.dev (VS Code / JetBrains)
Add to ~/.continue/config.json:
{
"models": [{
"title": "Local MoE (MLX)",
"provider": "openai",
"model": "local",
"apiBase": "http://localhost:8080/v1",
"apiKey": "not-needed"
}]
}Open WebUI
mlx-flash --port 8080 --preload
# In Open WebUI settings: Add connection → http://localhost:8080/v1Python / OpenAI SDK
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")
response = client.chat.completions.create(
model="local",
messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)Aider (AI pair programming)
mlx-flash --port 8080 --preload
aider --openai-api-base http://localhost:8080/v1 --openai-api-key not-needed --model localSee
docs/integrations.mdfor 18+ detailed integration guides with streaming examples, health checks, and memory monitoring.
python -m mlx_flash_compress.bench_memory_pressure # Memory pressure analysis (key demo)
python -m mlx_flash_compress.demo_warmup # ISP-like warm-up visualization
python -m mlx_flash_compress.cached_inference --multi-topic # Real routing capture
python -m mlx_flash_compress.bench --synthetic # Quick test (no model needed)
python -m mlx_flash_compress.bench_real # Real Qwen MoE model test
python -m mlx_flash_compress.bench_final # Final comprehensive benchmarkWe tested 6 different compression strategies on real AI model weights. Result: 1.0x compression (zero savings). The data is already maximally dense at 4-bit quantization. Instead, we use entropy coding (Huffman) which exploits the non-uniform distribution of quantized values for 65% savings.
Instead of trying to compress, we predict what's needed and pre-load it. Our prediction stack achieves 97%+ accuracy:
- Residual-stream predictor (linear projection of hidden states)
- Cross-layer 3-hop lookahead (transitive co-occurrence)
- Forward-looking Belady-optimal eviction (never evict what you'll need)
- Layer-depth bias (early layers are more valuable to cache)
MoE models work like the brain — only 0.78% of "neurons" (experts) activate per input. The brain handles this with predictive coding (pre-activating expected pathways). We implement the same principle: predict which experts are needed, speculatively execute them, and verify after the router confirms.
Speculative expert execution (from MoE-SpAc paper) runs predicted experts before the router confirms them. With 97% prediction accuracy, this means 97% of expert computations start immediately with zero load latency. The 3% misses are discarded and recomputed — on unified memory, this costs only ~0.1ms per wasted computation.
- macOS with Apple Silicon (M1/M2/M3/M4/M5)
- Python 3.10+
- 16GB+ RAM (more = better caching = faster)
- For real model tests:
mlxandmlx-lmpackages
- 15,000+ lines of code (Python + Rust)
- 254 tests (222 Python + 32 Rust)
- 8 benchmark suites + interactive demos
- 10 research documents (15+ papers implemented, 60+ surveyed)
- 40 Python modules covering prediction, caching, compression, distributed, serving
- OpenAI-compatible API server with KV cache quantization
- Memory-aware inference with wired memory optimization
- Rust sidecar with 0.1ms memory checks (210x faster than Python)
- Lock-free LCP expert cache (DashMap) with layer-depth bias
- Unix socket bridge for Python ↔ Rust expert weight streaming
- 15+ research techniques implemented from papers 2024-2026
graph TB
subgraph DONE["Implemented (15+ techniques)"]
ES[Expert Streaming<br/>GPU lookup tables]
LCP[Layer-biased LCP<br/>FATE paper]
RP[Residual Predictor<br/>97%+ accuracy]
SE[Speculative Execution<br/>MoE-SpAc]
FE[Forward Eviction<br/>MoE-SpeQ Belady]
CL[Cross-Layer Prefetch<br/>3-hop lookahead]
SP[Shadow MLP Predictor<br/>mlx-od-moe]
VS[Vertical Splitting<br/>MoEpic 2x coverage]
EM[Expert Merging<br/>DEK/EEP]
EC[Entropy Coding<br/>EntroLLM Huffman]
AT[Adaptive Top-K<br/>LExI paper]
MP[Mixed Precision<br/>HOBBIT]
KV[KV Cache 8-bit<br/>mlx-moe]
WM[Wired Memory Limit<br/>macOS sysctl]
MC[mx.clear_cache<br/>MLX v0.31]
end
subgraph BLOCKED["Blocked"]
AMX[AMX Pipeline<br/>undocumented HW]
MLXrs[mlx-rs<br/>macOS 26 Metal]
end
| Technique | Paper | Status |
|---|---|---|
| Expert streaming (GPU lookup) | HOBBIT arXiv:2411.01433 | Implemented |
| Residual-stream predictor | Speculating Experts arXiv:2603.19289 | Implemented |
| Speculative expert execution | MoE-SpAc arXiv:2603.09983 | Implemented |
| Forward-looking Belady eviction | MoE-SpeQ arXiv:2511.14102 | Implemented |
| Cross-layer 3-hop prefetch | FATE arXiv:2502.12224 / tinyserve | Implemented |
| Layer-depth cache bias | FATE arXiv:2502.12224 | Implemented |
| Shadow model predictor | mlx-od-moe | Implemented |
| Vertical expert splitting | MoEpic paper | Implemented |
| Expert merging (offline) | DEK/EEP arXiv:2509.19781 | Implemented |
| Entropy coding (Huffman uint4) | EntroLLM arXiv:2505.02380 | Implemented |
| Adaptive top-k skipping | LExI arXiv:2509.02753 | Implemented |
| Mixed precision per-expert | HOBBIT arXiv:2411.01433 | Implemented |
| KV cache 8-bit quantization | mlx-moe / mlx-lm v0.31 | Implemented |
| Wired memory optimization | macOS sysctl / mlx-moe | Implemented |
mx.clear_cache() integration |
MLX v0.31.0 | Implemented |
| AMX dequant pipeline | amx-rs Rust crate | Blocked (undocumented HW) |
| mlx-rs native inference | mlx-rs v0.25.3 | Blocked (macOS 26 Metal) |
10+ OSS projects and 15+ papers attack the same problem. Our unique differentiators:
- Only project with Rust sidecar + Mach syscall memory monitoring
- Only Apple Silicon project with mixed precision per-expert (hot 4-bit / cold 2-bit)
- Most techniques implemented: 15+ from research frontier, more than any competitor
- Only project combining speculative execution + Belady eviction + residual predictor + expert merging
| Competitor | Key Feature | Our Advantage |
|---|---|---|
| mu-hashmi/mlx-moe | Expert profiles, 10+ model families | Speculative execution, residual predictor, Rust sidecar |
| kqb/mlx-od-moe | Shadow model, memory-mapped experts | Cross-layer prefetch, entropy coding, expert merging |
| jundot/omlx | Hybrid mxfp4/mxfp8 quantization | Belady eviction, adaptive top-k, vertical splitting |
| HOBBIT (paper) | Nearly identical architecture | Apple Silicon native, open source |
See docs/competitive-analysis.md for the full landscape.
MIT