Factual Forgetting in GPT-2 Models

This project studies catastrophic forgetting in GPT-2 models when fine-tuned on dialog generation tasks (DailyDialog). We measure how fine-tuning affects factual recall using a factual QA benchmark derived from the LAMA T-REx dataset, and track changes across fine-tuning checkpoints.

Installation & running the pipeline

Follow the following instructions to run the experiment pipeline

# 1. Create a virtual environment
python -m venv venv

# 2. Activate the virtual environment (Windows)
. venv\Scripts\activate

# (If on Mac/Linux)
source venv/bin/activate

# 3. Install requirements.txt
pip install -r requirements.txt 

# 4. Run the pipeline from the root directory
python main.py

If you want to run a less computationally intensive version of this pipeline (where only GPT-2 small gets fine tuned), rename `simple_config.json` to `config.json` and run the pipeline!

Pipeline Overview

The workflow consists of five main steps:

1. Build the factual benchmark

Create a dataset of factual prompts and correct answers from LAMA T-REx and annotate examples by subject frequency (rare vs common). Output: data/benchmark/

2. Evaluate baseline pretrained models

Evaluate GPT-2 and GPT-2-medium (pretrained) on the benchmark to determine what they know before fine-tuning. Output: stored in data/baseline/

3. Fine-tune models on DailyDialog

Fine-tune multiple GPT-2 variants (e.g., small, medium, medium-6-epochs). Outputs are checkpoints inside checkpoints/...

4. Evaluate fine-tuned checkpoints

Evaluate each checkpoint on the same factual benchmark and store per-example correctness. Output: data/finetuned/...

5. Compute metrics and plot results

Compute and visualize:

Knowledge Retention Rate (KRR)
Fact Frequency Robustness (FFR)
Representation Drift (cosine similarity of hidden states)

Output plots & metric CSVs in: evaluation/

Configuration

All settings live in config.json, including:

Fine-tuning hyperparameters
Which models to train and evaluate
Where output files are written

This allows experiments to be run and compared without modifying source code.

Outputs

Stage	Output Location	Description
Benchmark	`data/benchmark/`	Prompts & answers for factual evaluation
Baseline Eval	`data/baseline/`	Accuracy & correctness flags
Fine-tuned Eval	`data/finetuned/`	Per-checkpoint performance
Metrics & Plots	`evaluation/`	KRR, FFR, drift curves & CSVs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Factual Forgetting in GPT-2 Models

Installation & running the pipeline

If you want to run a less computationally intensive version of this pipeline (where only GPT-2 small gets fine tuned), rename `simple_config.json` to `config.json` and run the pipeline!

Pipeline Overview

1. Build the factual benchmark

2. Evaluate baseline pretrained models

3. Fine-tune models on DailyDialog

4. Evaluate fine-tuned checkpoints

5. Compute metrics and plot results

Configuration

Outputs

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
checkpoints		checkpoints
data		data
evaluation		evaluation
scripts		scripts
.gitignore		.gitignore
README.md		README.md
config.json		config.json
main.py		main.py
requirements.txt		requirements.txt
simple_config.json		simple_config.json

Folders and files

Latest commit

History

Repository files navigation

Factual Forgetting in GPT-2 Models

Installation & running the pipeline

If you want to run a less computationally intensive version of this pipeline (where only GPT-2 small gets fine tuned), rename simple_config.json to config.json and run the pipeline!

Pipeline Overview

1. Build the factual benchmark

2. Evaluate baseline pretrained models

3. Fine-tune models on DailyDialog

4. Evaluate fine-tuned checkpoints

5. Compute metrics and plot results

Configuration

Outputs

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

If you want to run a less computationally intensive version of this pipeline (where only GPT-2 small gets fine tuned), rename `simple_config.json` to `config.json` and run the pipeline!

Packages