roberta-pretrain

A pre-training pipeline for RoBERTa masked language models on custom datasets. Built with PyTorch and HuggingFace Transformers. Also supports Phi-3 causal language model training.

Features

Tokenize HuggingFace datasets with sliding-window stride support
Pre-train RoBERTa (MLM) or Phi-3 (causal LM) from scratch
Custom data collator that excludes special tokens from masking
Weights & Biases integration (optional)
Checkpoint resume support
Ready-to-use SLURM scripts for HPC clusters

Requirements

Python 3.11
UV package manager

Installation

git clone https://github.com/scilons/roberta-pretrain.git
cd roberta-pretrain
uv venv --python 3.11
source .venv/bin/activate
uv sync

Usage

Step 1: Prepare the dataset

Tokenize a HuggingFace dataset and save it to disk:

python -m roberta_pretrain.commands.dataset \
    --dataset-with-range "dataset_name:0-100" \
    --tokenizer "FacebookAI/roberta-base" \
    --output-dir ./dataset \
    --max-length 510

Arguments:

--dataset-with-range — HuggingFace dataset name with percentage range (e.g. "my-org/my-dataset:0-100" for 100%, "my-org/my-dataset:0-50" for the first half)
--tokenizer — HuggingFace tokenizer name or local path
--output-dir — Directory to write the tokenized dataset
--max-length — Maximum token sequence length. Must be at least 2 less than max_position_embeddings in your training config
--stride — (optional, default 0) Sliding window stride for long documents

Step 2: Pre-train the model

python -m roberta_pretrain.commands.train \
    --seed 42 \
    --config configs/training.toml \
    --tokenizer "FacebookAI/roberta-base" \
    --datasets ./dataset \
    --checkpoint ./checkpoints \
    --output ./model

Arguments:

--seed — Random seed for reproducibility
--config — Path to a TOML training config (see Configuration)
--tokenizer — HuggingFace tokenizer name or local path
--datasets — One or more paths to tokenized datasets (accepts multiple paths)
--checkpoint — Directory for training checkpoints
--output — Directory to save the final model
--masking-prob — (optional, default 0.15) MLM masking probability
--resume — (optional) Resume training from the latest checkpoint
--is-phi3 — (optional) Train a Phi-3 causal LM instead of RoBERTa
--wandb-project — (optional) Weights & Biases project name (enables W&B logging)
--wandb-entity — (optional) Weights & Biases entity/team name

Quick test with a small model

Use the 3-layer debug config to verify everything works before launching a full run:

python -m roberta_pretrain.commands.train \
    --seed 42 \
    --config configs/3-layer-roberta.toml \
    --tokenizer "FacebookAI/roberta-base" \
    --datasets ./dataset \
    --checkpoint ./checkpoints \
    --output ./model

Configuration

Training is configured via TOML files with two sections:

[model]
max_position_embeddings = 514  # must be >= max_length + 2

[training]
per_device_train_batch_size = 256
num_train_epochs = 1
max_steps = 240000
warmup_steps = 2400
learning_rate = 1e-4
weight_decay = 0.1
# ... see configs/training.toml for all options

Available configs:

configs/training.toml — Production RoBERTa config (based on the original paper)
configs/3-layer-roberta.toml — Small 3-layer model for debugging
configs/phi3/03B.toml — Phi-3 3B config
configs/phi3/debug.toml — Small Phi-3 for debugging

SLURM cluster usage

The slurms/ directory contains example job scripts for running on HPC clusters. These scripts were written for the DFKI Pegasus cluster using enroot containers, but can be adapted to any SLURM environment.

Dataset tokenization:

sbatch slurms/202507/datasets/512_roberta_base.slurm

Training:

sbatch slurms/202507/train/512/train.slurm

To adapt the SLURM scripts to your cluster:

Update --container-image to your PyTorch container image (or remove container options if not using enroot)
Update --partition to match your cluster's available partitions
Adjust --container-workdir and data paths to your environment
Set your HF_TOKEN and WANDB_API_KEY environment variables as needed

Development

Linting & formatting

ruff check .
black .
mypy .

Running tests

pytest
pytest tests/test_foo.py -k "test_name"   # run a single test

Project structure

src/roberta_pretrain/
├── commands/
│   ├── dataset.py          # CLI: tokenize datasets
│   └── train.py            # CLI: run pre-training
├── custom_data_collator.py # MLM collator excluding special tokens
├── dataset.py              # Dataset loading and tokenization logic
├── train.py                # RobertaTrainer and Phi3Trainer classes
└── utils.py                # Logging utilities

configs/                    # TOML training configurations
slurms/                     # Example SLURM job scripts

License

Apache License 2.0 — see LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
configs		configs
slurms		slurms
src/roberta_pretrain		src/roberta_pretrain
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

roberta-pretrain

Features

Requirements

Installation

Usage

Step 1: Prepare the dataset

Step 2: Pre-train the model

Quick test with a small model

Configuration

SLURM cluster usage

Development

Linting & formatting

Running tests

Project structure

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

roberta-pretrain

Features

Requirements

Installation

Usage

Step 1: Prepare the dataset

Step 2: Pre-train the model

Quick test with a small model

Configuration

SLURM cluster usage

Development

Linting & formatting

Running tests

Project structure

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages