A minimal implementation for running instruction tuning and inference tasks on LLaMA2 using a single NVIDIA A100 GPU. Applied techniques include Low-rank Adaptation, Auto-mix-precision Training, Gradient Scaling, and Gradient Checkpointing.
To download LLaMA weights and tokenizer, please visit the Meta website and accept the License. Instructions
Tested on
- gcc/11.3.0
- cuda/11.8.0
- python/3.9.12
- pytorch/2.1.0
-
Change
model_path,tokenizer_path, andlora_weights_pathininference.pypython inference.py -
python finetune.py
For n_layers = 8 (number of transformer blocks, default=32) , and epochs = 5
| Configuration | Trainable Parameters | GPU Memory Usage (MiB) | Training Time (seconds) |
|---|---|---|---|
| Original | 1,881,214,976 | 38,401 | / |
| + Low-rank Adaptation | 2,097,152 | 10,377 | 70.31 |
| + Auto-mix-precision Training & Gradient Scaling | 2,097,152 | 13,079 | 25.96 |
| + Gradient Accumulation | 2,097,152 | 13,089 | 25.12 |
| + Gradient Checkpointing | 2,097,152 | 9,409 | 45.98 |