Skip to content

AnvithaCodes/Aura

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Aura

A pure C++ inference engine for Google's Gemma 2B demonstrating core mechanics of LLMs

No abstraction layers of PyTorch or ONNX

Features

  • Zero runtime dependencies: written in standard C++17. no torch, no external math libraries required for inference
  • Gemma 2 Architecture: implements key architectural details:
    • RMSNorm with (1+w) scaling
    • Logit soft-capping (attention & final)
    • GeGLU activation
    • RoPE (rotary positional embeddings)
  • Advanced sampling: implements top-P (nucleus) sampling (p=0.9) for coherent text generation
  • Low memory footprint: uses fseek to stream weights from disk, allowing 2B/9B models to run on systems with limited RAM (though disk i/o becomes the bottleneck)

Quick start

1. Prerequisites

  • C++ Compiler: g++ (GCC), Clang, or MSVC
  • Python 3.x: required only for exporting weights
  • Disk Space: ~5GB for the model weights

2. Install dependencies (Python)

install the necessary libraries to download and convert the model:

pip install torch transformers huggingface_hub accelerate

3. Export weights

run the provided scripts to download Gemma 2 2B from HuggingFace and convert it to Aura's binary format

# Downloads model (~5GB) and converts weights
python export_gemma.py

# Exports the tokenizer vocabulary
python export_tokenizer.py

this will create gemma_weights.bin and tokenizer.bin in your directory

4. Build the engine

compile main.cpp. using -O3 and -march=native is recommended for best performance

g++ -O3 -march=native main.cpp -o aura

5. Run Aura

Start the engine:

./aura

Possible usage:

init aura engine...
ready...
chat mode enabled.
user: The capital of Switzerland is

Note on performance: since the engine streams weights from disk for every token, the initial prompt processing will be slow (limited by your disk speed). once generation starts, it produces tokens at a steady pace

Architecture

  • main.cpp: the core C++ inference engine
  • export_gemma.py: python script to export huggingface weights to raw binary
  • export_tokenizer.py: python script to export the tokenizer vocabulary

License

MIT License. The Gemma 2 model weights are subject to Google's Gemma Terms of Use.

About

An inference engine for Gemma 2 built with C++ from scratch (currently in progress)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors