A pure C++ inference engine for Google's Gemma 2B demonstrating core mechanics of LLMs
No abstraction layers of PyTorch or ONNX
- Zero runtime dependencies: written in standard C++17. no torch, no external math libraries required for inference
- Gemma 2 Architecture: implements key architectural details:
- RMSNorm with
(1+w)scaling - Logit soft-capping (attention & final)
- GeGLU activation
- RoPE (rotary positional embeddings)
- RMSNorm with
- Advanced sampling: implements top-P (nucleus) sampling (p=0.9) for coherent text generation
- Low memory footprint: uses
fseekto stream weights from disk, allowing 2B/9B models to run on systems with limited RAM (though disk i/o becomes the bottleneck)
- C++ Compiler:
g++(GCC), Clang, or MSVC - Python 3.x: required only for exporting weights
- Disk Space: ~5GB for the model weights
install the necessary libraries to download and convert the model:
pip install torch transformers huggingface_hub acceleraterun the provided scripts to download Gemma 2 2B from HuggingFace and convert it to Aura's binary format
# Downloads model (~5GB) and converts weights
python export_gemma.py
# Exports the tokenizer vocabulary
python export_tokenizer.pythis will create gemma_weights.bin and tokenizer.bin in your directory
compile main.cpp. using -O3 and -march=native is recommended for best performance
g++ -O3 -march=native main.cpp -o auraStart the engine:
./auraPossible usage:
init aura engine...
ready...
chat mode enabled.
user: The capital of Switzerland is
Note on performance: since the engine streams weights from disk for every token, the initial prompt processing will be slow (limited by your disk speed). once generation starts, it produces tokens at a steady pace
main.cpp: the core C++ inference engineexport_gemma.py: python script to export huggingface weights to raw binaryexport_tokenizer.py: python script to export the tokenizer vocabulary
MIT License. The Gemma 2 model weights are subject to Google's Gemma Terms of Use.