This project studies catastrophic forgetting in GPT-2 models when fine-tuned on dialog generation tasks (DailyDialog). We measure how fine-tuning affects factual recall using a factual QA benchmark derived from the LAMA T-REx dataset, and track changes across fine-tuning checkpoints.
Follow the following instructions to run the experiment pipeline
# 1. Create a virtual environment
python -m venv venv
# 2. Activate the virtual environment (Windows)
. venv\Scripts\activate
# (If on Mac/Linux)
source venv/bin/activate
# 3. Install requirements.txt
pip install -r requirements.txt
# 4. Run the pipeline from the root directory
python main.pyIf you want to run a less computationally intensive version of this pipeline (where only GPT-2 small gets fine tuned), rename simple_config.json to config.json and run the pipeline!
The workflow consists of five main steps:
Create a dataset of factual prompts and correct answers from LAMA T-REx and annotate examples by subject frequency (rare vs common).
Output: data/benchmark/
Evaluate GPT-2 and GPT-2-medium (pretrained) on the benchmark to determine what they know before fine-tuning.
Output: stored in data/baseline/
Fine-tune multiple GPT-2 variants (e.g., small, medium, medium-6-epochs).
Outputs are checkpoints inside checkpoints/...
Evaluate each checkpoint on the same factual benchmark and store per-example correctness.
Output: data/finetuned/...
Compute and visualize:
- Knowledge Retention Rate (KRR)
- Fact Frequency Robustness (FFR)
- Representation Drift (cosine similarity of hidden states)
Output plots & metric CSVs in: evaluation/
All settings live in config.json, including:
- Fine-tuning hyperparameters
- Which models to train and evaluate
- Where output files are written
This allows experiments to be run and compared without modifying source code.
| Stage | Output Location | Description |
|---|---|---|
| Benchmark | data/benchmark/ |
Prompts & answers for factual evaluation |
| Baseline Eval | data/baseline/ |
Accuracy & correctness flags |
| Fine-tuned Eval | data/finetuned/ |
Per-checkpoint performance |
| Metrics & Plots | evaluation/ |
KRR, FFR, drift curves & CSVs |