tokenizers

Requirements

Python 3.11+

Setup

Install UV (Assuming bash):

curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env

Create the virtual env and sync with the requirements:

cd tokenizers
uv venv --python 3.11
source .venv/bin/activate
uv sync

Usage

Run the tokenizer training script with the following command:

./.venv/bin/python \
    -m roberta.train_tokenizer \
    --dataset /netscratch/lfoppiano/scilons/datasets/texts_pq_4-deduped-Eng_Latn \
    --vocab-size 50265 \
    --output-dir /netscratch/lfoppiano/scilons/tokenizers/sciroberta-tokenizer-50k

Command line options:

usage: train_tokenizer.py [-h] [--data-dir DATA_DIR] [--text-column TEXT_COLUMN] [--vocab-size VOCAB_SIZE] [--output-dir OUTPUT_DIR]

Train a tokenizer on a dataset of text files.

options:
  -h, --help            show this help message and exit
  --data-dir DATA_DIR   Directory containing the dataset files.
  --text-column TEXT_COLUMN
                        Column name containing the text data.
  --vocab-size VOCAB_SIZE
                        Size of the vocabulary.
  --output-dir OUTPUT_DIR
                        Directory to save the tokenizer.

See slurm example in train-tokenizer.md

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
roberta		roberta
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

tokenizers

Requirements

Setup

Usage

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

tokenizers

Requirements

Setup

Usage

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages