This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
quantum-llama.cpp is a modified fork of llama.cpp that integrates Quantum Random Number Generators (QRNGs) into token generation. The core idea: "the output is co-authored by the universe itself."
Key difference from upstream: Uses true quantum randomness instead of pseudo-random number generation for sampling.
# Standard build (MUST use -DLLAMA_CURL=OFF for quantum features)
cmake -B build -DLLAMA_CURL=OFF
cmake --build build --config Release -j
# With CUDA
cmake -B build -DLLAMA_CURL=OFF -DGGML_CUDA=ON
cmake --build build --config Release -j
# Debug build
cmake -B build -DCMAKE_BUILD_TYPE=Debug -DLLAMA_CURL=OFF
cmake --build buildBuilt binaries are placed in build/bin/.
# Run test suite
ctest --test-dir build --output-on-failure -j
# Server tests (requires Python venv)
cd tools/server/tests
source ../../../.venv/bin/activate
./tests.shAlways format before committing:
git clang-formatKey style rules:
- 4-space indentation, 120-column limit
- Pointer/reference:
void * ptr,int & a snake_casefor functions, variables, types- Enum values:
ENUM_NAME_VALUE(uppercase with prefix)
src/- Main llama library (llama-*.cpp/hmodules)include/llama.h- Public C APIggml/- Vendored tensor librarytools/- Executables (llama-cli,llama-server, etc.)common/- Shared utilitieslibpsirngclient/- Git submodule for gRPC QRNG client
psirngclient-manager.cpp/h- Singleton managing QRNG connectionsanu-qrng-client.cpp/h- HTTP client for ANU/Qbert QRNG API (hex16 z-score-based sampling)llama-sampling.cpp- Sampling pipeline (integration point for quantum RNG, descending-probability CDF)
Token Logits
↓
Calculate Entropy (normalized 0-1)
↓
[entropy < 0.50?] ─YES─→ GREEDY (no QRNG) → Done
│
NO
↓
Apply EDT Temperature: T = T₀ × 0.8^(θ/entropy)
↓
QRNG API call (hex16, length=1024, size=10)
↓
Compute mean of ~20K uint8 values
↓
Z-score: z = (mean - 127.5) / 0.51433
↓
Uniform float: u = Φ(z) = 0.5 × (1 + erf(z/√2))
↓
Descending-probability CDF sampling (highest prob first)
↓
Done
Key principle: Each token selection makes a fresh API call. No buffering - this preserves temporal correlation between consciousness and token selection.
Bluer colors = shift toward high-probability tokens, redder colors = shift toward low-probability tokens.
| Z-Score Range | Color | Meaning |
|---|---|---|
| N/A (greedy) | Grey | Deterministic (no QRNG) |
| |z| < 1.0 | White | Near expected mean |
| z ∈ [-2, -1) | Light Blue | Mild high-prob shift |
| z < -2 | Blue (vivid) | Strong high-prob shift |
| z ∈ (1, 2] | Pink | Mild low-prob shift |
| z > 2 | Red | Strong low-prob shift |
- entropy < 0.50 → Greedy sampling (no API call, saves bandwidth)
- entropy ≥ 0.50 → EDT temperature + QRNG sampling
- Typically saves 50-80% of API calls for predictable text
- Formula:
T = T₀ × 0.8^(θ/entropy) - Defaults: T₀=2.0, θ=1.0
- Higher entropy → higher temperature (more creative exploration)
- Lower entropy → lower temperature (more focused selection)
- At max entropy (1.0): T ≈ 1.6
Two QRNG providers are supported: ANU (default) and Qbert. Select with --qrng-api.
-
Get your FREE API key at: https://quantumnumbers.anu.edu.au/
-
Set the environment variable:
# Linux/Mac
export ANU_API_KEY="your-api-key-here"
# Windows CMD
set ANU_API_KEY=your-api-key-here
# PowerShell
$env:ANU_API_KEY="your-api-key-here"- Run:
./build/bin/llama-cli -m model.gguf -p "prompt" -n 128 -no-cnvQbert is an invite-only QRNG API by Cipherstone. The request/response format is identical to ANU.
- Set the environment variable:
# Linux/Mac
export QBERT_API_KEY="your-api-key-here"
# Windows CMD
set QBERT_API_KEY=your-api-key-here
# PowerShell
$env:QBERT_API_KEY="your-api-key-here"- Run with
--qrng-api qbert:
./build/bin/llama-cli -m model.gguf -p "prompt" -n 128 -no-cnv --qrng-api qbert| Argument | Description | Default |
|---|---|---|
--qrng-api {anu,qbert} |
Select QRNG API provider | anu |
--quantum-verbose |
Show entropy/temperature for each token | off |
--quantum-statistics |
Print sampling statistics at end | off |
--quantum-entropy-threshold N |
Entropy cutoff for greedy vs QRNG | 0.50 |
--quantum-edt-t0 N |
EDT upper bound temperature | 2.0 |
--quantum-edt-theta N |
EDT entropy sensitivity | 1.0 |
--no-quantum-adaptive-sampling |
Always use QRNG (no greedy) | - |
--no-quantum-edt |
Use fixed temperature instead of EDT | - |
export PSIRNG_HOST=192.0.2.10
export PSIRNG_GRPC_PORT=50051
export PSIRNG_CERT_PATH=/path/to/cert.pem
./build/bin/llama-cli -m model.gguf -p "prompt"- NEVER buffer quantum random data - Each token selection MUST use fresh quantum data from a new API call. Buffering destroys the temporal correlation between consciousness and token selection.
- Never use
-DLLAMA_CURL=ON- incompatible with quantum features - Clone with
--recurse-submodulesto get libpsirngclient - Avoid adding third-party dependencies
- Use basic C++ patterns, avoid fancy STL constructs
- Vertical alignment for readability
- Tensor storage is row-major (dim 0=columns, 1=rows, 2=matrices)
- Matrix multiplication:
C = ggml_mul_mat(ctx, A, B)means C^T = AB^T
- Pattern:
<class>_<method>where method is<action>_<noun> - Examples:
llama_model_init(),llama_sampler_chain_remove() - Optimize for longest common prefix:
number_small,number_big(notsmall_number)
# Benchmark
./build/bin/llama-bench -m model.gguf
# Evaluate perplexity
./build/bin/llama-perplexity -m model.gguf -f dataset.txt
# Test backend ops
./build/bin/test-backend-ops