FLUX × llama.cpp — Novel integration of FLUX bytecode agents with LLM inference.
What if token sampling in language models was driven by bytecode programs running on a virtual machine? What if multiple agents, each running their own sampling strategy as FLUX bytecode, voted on each token via A2A-style consensus?
This is that experiment.
LLM Output Logits
│
▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Agent 0 │ │ Agent 1 │ │ Agent 2 │
│ (Conserv.) │ │ (Creative) │ │ (Penalty) │
│ │ │ │ │ │
│ FLUX Byte │ │ FLUX Byte │ │ FLUX Byte │
│ code: │ │ code: │ │ code: │
│ logit * 2 │ │ pos-dep │ │ freq-div │
│ │ │ temperature │ │ │
└──────┬───────┘ └──────┬───────┘ └──────┬───────┘
│ │ │
└────────────┬───────┴────────────────────┘
│
┌─────▼──────┐
│ Weighted │
│ Vote │
│ (A2A) │
└─────┬──────┘
│
┌─────▼──────┐
│ Selected │
│ Token │
└────────────┘
- Each agent is a FLUX bytecode program that scores candidate tokens
- Agents vote via weighted consensus (A2A-style trust scoring)
- Strategies can be swapped at runtime by loading different bytecode
- Each agent's bytecode is converted to a 128-dim embedding vector
- Opcode frequency → embedding dimension
- Enables similarity comparison between agent strategies
- Standalone (this demo): Simulated logits, pure FLUX VM sampling
- llama.cpp hook: Wire into
llama_sample_token()callback - ggml tensors: Map FLUX registers to tensor operations
- Custom models: Bytecode as a "programming layer" over any LLM
# Standalone (no llama.cpp needed)
gcc -std=c11 -Wall -O2 -DFLUX_STANDALONE -o flux-llama src/flux_llama.c -lm
./flux-llama
# With llama.cpp (requires llama.cpp installed)
gcc -std=c11 -Wall -O2 -I/path/to/llama.cpp/include \
-o flux-llama src/flux_llama.c -lm -lllama📊 Setting up 3-agent inference swarm...
Agent 0 (Conservative): weight=0.5 — boosts high-logit tokens
Agent 1 (Creative): weight=0.3 — position-dependent temperature
Agent 2 (Penalty): weight=0.2 — penalizes high-frequency tokens
📝 Generating text (20 positions) via swarm consensus:
the the the the the the the the the the the sea sea sea sea sea ...
- Agent-driven creativity: Different sampling strategies create different "voices"
- Evolutionary optimization: Bytecode can be mutated and selected for quality
- Transparent decisions: You can disassemble exactly why a token was chosen
- Composable: Mix and match agent strategies like LEGO blocks
- Fast: FLUX VM runs at 48K+ ops/sec on ARM — negligible overhead vs LLM inference
- Real llama.cpp integration (sampling callback hook)
- GPU-accelerated FLUX VM (CUDA) for batch scoring
- Evolutionary agent optimization (mutate bytecode, select by output quality)
- Bytecode embeddings as features for model fine-tuning
- Multi-model swarms (different base models, FLUX coordination layer)
MIT — SuperInstance (DiGennaro et al.)