Skip to content

scilons/tokenisers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

tokenizers

Requirements

  • Python 3.11+

Setup

Install UV (Assuming bash):

curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env

Create the virtual env and sync with the requirements:

cd tokenizers
uv venv --python 3.11
source .venv/bin/activate
uv sync

Usage

Run the tokenizer training script with the following command:

./.venv/bin/python \
    -m roberta.train_tokenizer \
    --dataset /netscratch/lfoppiano/scilons/datasets/texts_pq_4-deduped-Eng_Latn \
    --vocab-size 50265 \
    --output-dir /netscratch/lfoppiano/scilons/tokenizers/sciroberta-tokenizer-50k

Command line options:

usage: train_tokenizer.py [-h] [--data-dir DATA_DIR] [--text-column TEXT_COLUMN] [--vocab-size VOCAB_SIZE] [--output-dir OUTPUT_DIR]

Train a tokenizer on a dataset of text files.

options:
  -h, --help            show this help message and exit
  --data-dir DATA_DIR   Directory containing the dataset files.
  --text-column TEXT_COLUMN
                        Column name containing the text data.
  --vocab-size VOCAB_SIZE
                        Size of the vocabulary.
  --output-dir OUTPUT_DIR
                        Directory to save the tokenizer.

See slurm example in train-tokenizer.md

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors