- Python 3.11+
Install UV (Assuming bash):
curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/envCreate the virtual env and sync with the requirements:
cd tokenizers
uv venv --python 3.11
source .venv/bin/activate
uv syncRun the tokenizer training script with the following command:
./.venv/bin/python \
-m roberta.train_tokenizer \
--dataset /netscratch/lfoppiano/scilons/datasets/texts_pq_4-deduped-Eng_Latn \
--vocab-size 50265 \
--output-dir /netscratch/lfoppiano/scilons/tokenizers/sciroberta-tokenizer-50kCommand line options:
usage: train_tokenizer.py [-h] [--data-dir DATA_DIR] [--text-column TEXT_COLUMN] [--vocab-size VOCAB_SIZE] [--output-dir OUTPUT_DIR]
Train a tokenizer on a dataset of text files.
options:
-h, --help show this help message and exit
--data-dir DATA_DIR Directory containing the dataset files.
--text-column TEXT_COLUMN
Column name containing the text data.
--vocab-size VOCAB_SIZE
Size of the vocabulary.
--output-dir OUTPUT_DIR
Directory to save the tokenizer.
See slurm example in train-tokenizer.md