Skip to content

asayad1/CatastrophicForgetting

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Factual Forgetting in GPT-2 Models

This project studies catastrophic forgetting in GPT-2 models when fine-tuned on dialog generation tasks (DailyDialog). We measure how fine-tuning affects factual recall using a factual QA benchmark derived from the LAMA T-REx dataset, and track changes across fine-tuning checkpoints.


Installation & running the pipeline

Follow the following instructions to run the experiment pipeline

# 1. Create a virtual environment
python -m venv venv

# 2. Activate the virtual environment (Windows)
. venv\Scripts\activate

# (If on Mac/Linux)
source venv/bin/activate

# 3. Install requirements.txt
pip install -r requirements.txt 

# 4. Run the pipeline from the root directory
python main.py

If you want to run a less computationally intensive version of this pipeline (where only GPT-2 small gets fine tuned), rename simple_config.json to config.json and run the pipeline!

Pipeline Overview

The workflow consists of five main steps:

1. Build the factual benchmark

Create a dataset of factual prompts and correct answers from LAMA T-REx and annotate examples by subject frequency (rare vs common). Output: data/benchmark/

2. Evaluate baseline pretrained models

Evaluate GPT-2 and GPT-2-medium (pretrained) on the benchmark to determine what they know before fine-tuning. Output: stored in data/baseline/

3. Fine-tune models on DailyDialog

Fine-tune multiple GPT-2 variants (e.g., small, medium, medium-6-epochs). Outputs are checkpoints inside checkpoints/...

4. Evaluate fine-tuned checkpoints

Evaluate each checkpoint on the same factual benchmark and store per-example correctness. Output: data/finetuned/...

5. Compute metrics and plot results

Compute and visualize:

  • Knowledge Retention Rate (KRR)
  • Fact Frequency Robustness (FFR)
  • Representation Drift (cosine similarity of hidden states)

Output plots & metric CSVs in: evaluation/


Configuration

All settings live in config.json, including:

  • Fine-tuning hyperparameters
  • Which models to train and evaluate
  • Where output files are written

This allows experiments to be run and compared without modifying source code.


Outputs

Stage Output Location Description
Benchmark data/benchmark/ Prompts & answers for factual evaluation
Baseline Eval data/baseline/ Accuracy & correctness flags
Fine-tuned Eval data/finetuned/ Per-checkpoint performance
Metrics & Plots evaluation/ KRR, FFR, drift curves & CSVs

About

This project investigates catastrophic forgetting in LLMs. This is my submission to the JHU ChatGPT From Scratch course.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages