Skip to content

NatBrian/chatgpt-from-scratch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 

Repository files navigation

Building a Modern Large Language Model from Scratch

This repository contains a comprehensive, self-contained Jupyter notebook designed to teach the fundamental architecture and training mechanics of modern Large Language Models (LLMs).

It provides a step-by-step implementation of a Generative Pre-trained Transformer (GPT), transitioning from basic language modeling concepts to a fully functional model incorporating state-of-the-art architectural components found in systems like Llama 3.

Project Overview

The primary objective of this project is educational: to demystify the internal operations of LLMs by building one from first principles. Rather than relying on high-level abstractions from libraries like transformers, this notebook implements every component using only Python and PyTorch.

Key Learning Objectives

  • Tokenization: Implementation of Byte Pair Encoding (BPE) for efficient text representation.
  • Transformer Architecture: A detailed construction of the Transformer block, including:
    • Causal Self-Attention: From naive loops to efficient matrix operations.
    • Multi-Head Attention: Parallelizing attention mechanisms.
  • Modern Architectural Enhancements: Integrating components standard in 2024/2025 LLM architectures:
    • RMSNorm (Root Mean Square Normalization) for improved training stability.
    • RoPE (Rotary Positional Embeddings) for better relative position handling.
    • SwiGLU (Swish-Gated Linear Unit) for enhanced expressivity in feed-forward layers.
  • Training Infrastructure: Developing a robust training loop with mixed-precision training (float16), gradient accumulation, and checkpointing.
  • Inference & Alignment: Implementing generation strategies (temperature, top-k sampling) and introducing Reinforcement Learning from Human Feedback (RLHF) concepts.

Target Audience

This resource is intended for software engineers, data scientists, and researchers seeking a deep, code-level understanding of LLMs.

Prerequisites:

  • Proficiency in Python programming.
  • Foundational knowledge of PyTorch (tensor operations, basic neural network modules).
  • A Google account (for execution on Google Colab).

Notebook Structure

The chatgpt_zero_to_hero.ipynb notebook follows a progressive structure:

  1. Stage 1: The Simplest Language Model (Bigram): Establishing a baseline with a simple probabilistic model.
  2. Stage 2: Tokenization: Implementing Byte Pair Encoding (BPE) from scratch.
  3. Stage 3: Attention (The Heart of Transformers): Deriving self-attention mathematically and implementing it efficiently.
  4. Stage 4: Modern Components (Llama 3): Upgrading the architecture with RMSNorm, RoPE, and SwiGLU.
  5. Stage 5: Assembling the Full GPT Model: Composing the full architecture (~30M parameters).
  6. Stage 6: Training: Training on the FineWeb-Edu dataset using a T4 GPU.
  7. Stage 7: Inference & Chat: Implementing the generation loop and chat interface.
  8. Stage 8: Instruction Fine-Tuning (SFT): Teaching the model to follow instructions.
  9. Stage 9: Alignment (RLHF & DPO): Aligning model outputs with human intent using Direct Preference Optimization.

Usage Instructions

This project is optimized for the Google Colab Free Tier.

1. Environment Setup

Open the notebook in Google Colab by clicking the badge above or uploading chatgpt_zero_to_hero.ipynb to your Google Drive.

2. Hardware Acceleration

Ensure a GPU runtime is selected to enable efficient training:

  • Navigate to Runtime > Change runtime type.
  • Select T4 GPU.

3. Execution

Execute the notebook cells sequentially. The notebook is designed to be idempotent and self-contained.

  • Runtime: Approximately 2 hours for a complete end-to-end run.
  • Persistence: The notebook implements automatic checkpointing to Google Drive, allowing training to resume seamlessly in case of session interruption.

Scope and Limitations

  • Educational Focus: The resulting model (~30M parameters) is designed for pedagogical clarity and rapid iteration, not for production-grade performance or reasoning capabilities comparable to commercial LLMs.
  • Implementation Depth: While the architecture is modern, the scale is deliberately constrained to fit within free cloud resources.

Attribution and References

This project is inspired by the educational materials of Andrej Karpathy, specifically:

Future Directions

Upon completion, learners are encouraged to:

  1. Scale: Increase model size (e.g., to 124M parameters) by adjusting the COLAB_MODE flag.
  2. Data: Retrain the model on domain-specific datasets.
  3. Deep Dive: Explore production codebases such as llama.cpp or Hugging Face transformers with a grounded understanding of their internal mechanics.

About

A comprehensive, zero-dependency notebook to build a modern GPT from scratch. Implements Llama 3 architecture (RMSNorm, RoPE, SwiGLU) and trains on FineWeb-Edu using PyTorch. Educational, clean, and hackable.

Topics

Resources

Stars

Watchers

Forks

Contributors