Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
d070d2e
refactor pipeline. resolve task <-> sandbox relationship
May 31, 2021
e12a87e
combine two logger-initting methods
May 31, 2021
80187b2
Merge pull request #1 from youscan/infra
May 31, 2021
e1388b5
Update README.md
dchaplinsky Aug 3, 2021
a495a50
Merge pull request #1 from lang-uk/dchaplinsky-patch-1
dchaplinsky Aug 3, 2021
3670dac
Merge pull request #3 from lang-uk/master
terpiljenya Aug 5, 2021
f01b9d7
Merge remote-tracking branch 'origin/adaptation' into gpt
koren-v Aug 28, 2021
8c4adf9
ukr-gpt, FromIterableTextDataset, GroupTextForCasualLMDataset
koren-v Aug 28, 2021
fd92a6b
MinHashLSH deduplication, Wiki
koren-v Aug 31, 2021
e4ae992
MinHashLSH deduplication, Wiki
koren-v Aug 31, 2021
e41d5b1
configs
koren-v Aug 31, 2021
3513970
updated with experiments
koren-v Sep 6, 2021
1e1f1f6
saving PreTrainedTokenizer
koren-v Sep 6, 2021
4300cfd
new configs
koren-v Sep 6, 2021
e889a3f
Iterable[Iterable[str]] -> Iterable[str] (should be merged outside)
koren-v Sep 6, 2021
acf89ed
writing to single/train-val files
koren-v Sep 6, 2021
6d56975
steps
koren-v Sep 6, 2021
d6f5d73
pre-commit fixes
koren-v Sep 6, 2021
13af9e1
change paths, save_total_limit, comment lsh dependency
koren-v Sep 7, 2021
66b25c8
separate data step: saving input ids
koren-v Sep 9, 2021
ae77dae
separate MinHashLSHDeduplicator
koren-v Sep 9, 2021
5d0566e
convert tokenizer to transformers format
koren-v Sep 9, 2021
2b4103a
removed unused class
koren-v Sep 10, 2021
7a34448
updated config
koren-v Sep 10, 2021
ae0a8c9
map -> submit & as_completed
koren-v Sep 10, 2021
0b1f91d
IterableDataset
koren-v Sep 10, 2021
5bd1dc0
pre-commit fixes
koren-v Sep 10, 2021
eb8aac7
update paths
koren-v Sep 10, 2021
c3fdeb5
fix max length
koren-v Sep 13, 2021
ba2569c
more frequent eval
koren-v Sep 13, 2021
f9f7c43
make validation dataset not IterableDataset
koren-v Sep 16, 2021
5d9c4d2
add logs to .gitignore
koren-v Sep 16, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,13 @@ ENV/
.idea/
.mypy_cache/
apex/
LSH/
/data/
results/
outputs/
lab/
credentials

# logs
logs/
mlruns/
2 changes: 1 addition & 1 deletion .isort.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -6,4 +6,4 @@ use_parentheses=True
line_length=119
skip_glob=venv/*,stubs/*
known_first_party = language_model
known_third_party = ds_shared,pynlple,setuptools,tokenizers,torch,transformers
known_third_party = bs4,datasets,ds_shared,more_itertools,numpy,pynlple,setuptools,tokenizers,torch,transformers,wget
31 changes: 2 additions & 29 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -1,41 +1,14 @@
repos:
- repo: https://github.com/asottile/seed-isort-config
rev: v1.9.1
- repo: git@github.com:youscan/python-codestyle.git
rev: pre_commit_version
hooks:
- id: seed-isort-config
- repo: https://github.com/pre-commit/mirrors-isort
rev: v4.3.21
hooks:
- id: isort
args: ["-rc"]
- repo: https://github.com/psf/black
rev: 19.3b0
hooks:
- id: black
args: ["--line-length=119"]
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v2.3.0
hooks:
- id: trailing-whitespace
- id: check-yaml
- id: check-json
- id: end-of-file-fixer
- id: requirements-txt-fixer
- repo: https://github.com/pycqa/flake8
rev: 3.8.2
hooks:
- id: flake8
additional_dependencies: [
flake8-bugbear==20.1.4,
flake8-builtins==1.5.3,
flake8-debugger==3.2.1,
flake8-isort==3.0.0,
isort==4.3.21,
]
args: ["--config=setup.cfg"]
- repo: https://github.com/pre-commit/mirrors-mypy
rev: v0.761
hooks:
- id: mypy
args: ["--config=setup.cfg"]
exclude: configs/
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,8 +28,8 @@ Ukrainian Roberta is released via [HuggingFace Transformers library](https://hug
```python
from transformers import pipeline, RobertaForMaskedLM, RobertaTokenizer

model = RobertaForMaskedLM.from_pretrained("ukr-roberta-base")
tokenizer = RobertaTokenizer.from_pretrained("ukr-roberta-base")
model = RobertaForMaskedLM.from_pretrained("youscan/ukr-roberta-base")
tokenizer = RobertaTokenizer.from_pretrained("youscan/ukr-roberta-base")

fill_mask = pipeline('fill-mask', model=model, tokenizer=tokenizer)
fill_mask("Тарас Шевченко – великий українсьский <mask>.")
Expand Down
10 changes: 10 additions & 0 deletions configs/cyr/gpt/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# Steps:

1) `python run.py --task configs/cyr/gpt/load_data/wiki.py`
2) `python -m wikiextractor.WikiExtractor outputs/cyr/gpt/load_data/wiki/ukwiki-latest-pages-articles.xml.bz2 -o outputs/cyr/gpt/load_data/wiki/ukwiki-latest-pages-articles -b 1M --no-templates`
3) `python run.py --task configs/cyr/gpt/load_data/in-house.py`
4) `python run.py --task configs/cyr/gpt/extract_texts/train-validation-open-data.py`
5) `python run.py --task configs/cyr/gpt/extract_texts/in-house-data.py`
6) `python run.py --task configs/cyr/gpt/train_tokenizer/ukr-gpt.py`
7) `python run.py --task configs/cyr/gpt/train_tokenizer/convert-to-transformers.py`
8) `shuf outputs/cyr/gpt/extract_texts/train-validation-open-data/train.txt -o outputs/cyr/gpt/extract_texts/train-validation-open-data/train_shuffled.txt`
22 changes: 22 additions & 0 deletions configs/cyr/gpt/extract_texts/in-house-data.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
from pynlple.processing.preprocessor import (
HtmlTagReplacer,
MultiLetterReplacer,
MultiNonLetterReplacer,
StackingPreprocessor,
URLReplacer,
)

from language_model.data.extract import ExtractTextsFromData, FromLoadedYsDataSource

YS_FOLDER_PATHS = ["outputs/cyr/gpt/load_data/in-house"]


preprocessor = StackingPreprocessor(
[HtmlTagReplacer(), URLReplacer(), MultiNonLetterReplacer(include_digits=False), MultiLetterReplacer()]
)

ys_train = FromLoadedYsDataSource(source_folder_paths=YS_FOLDER_PATHS)

task = ExtractTextsFromData(
text_source=ys_train, preprocessor=preprocessor, seeds=100, char_ngram=20, bands=20, min_jaccard=0.9
)
28 changes: 28 additions & 0 deletions configs/cyr/gpt/extract_texts/train-validation-open-data.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
from itertools import chain

from datasets import load_dataset
from pynlple.processing.preprocessor import (
HtmlTagReplacer,
MultiLetterReplacer,
MultiNonLetterReplacer,
StackingPreprocessor,
URLReplacer,
)

from language_model.data.extract import PostWikiExtractorDataSource, RandomSplitTextsFromData

WIKI_EXTRACTED_PATH = "outputs/cyr/gpt/load_data/wiki/ukwiki-latest-pages-articles"


preprocessor = StackingPreprocessor(
[HtmlTagReplacer(), URLReplacer(), MultiNonLetterReplacer(include_digits=False), MultiLetterReplacer()]
)

oscar_train = (item["text"] for item in load_dataset("oscar", "unshuffled_deduplicated_uk", split="train"))
cc100_train = (item["text"] for item in load_dataset("cc100", lang="uk", split="train"))
wiki_train = (item["text"] for item in PostWikiExtractorDataSource(WIKI_EXTRACTED_PATH))


task = RandomSplitTextsFromData(
text_source=chain(oscar_train, cc100_train, wiki_train), preprocessor=preprocessor, test_size=5_000
)
27 changes: 27 additions & 0 deletions configs/cyr/gpt/extract_vectors/vectorize-train.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
import os

from transformers import PreTrainedTokenizerFast

from language_model.data.extract import ExtractVectorsFromTexts, LineByLineSource, ShuffledSources

TOKENIZER_PATH = "outputs/cyr/gpt/train_tokenizer/convert-to-transformers/tokenizer/"

IN_HOUSE_TRAIN_DATA_PATH = "outputs/cyr/gpt/extract_texts/in-house-data/texts.txt"
OPEN_TRAIN_DATA_PATH = "outputs/cyr/gpt/extract_texts/train-validation-open-data/train_shuffled.txt"
MODEL_MAX_LENGTH = 1024

# data
train_data_source = ShuffledSources(
(text for text in LineByLineSource(IN_HOUSE_TRAIN_DATA_PATH)),
(text for text in LineByLineSource(OPEN_TRAIN_DATA_PATH)),
)

os.environ["TOKENIZERS_PARALLELISM"] = "true"

task = ExtractVectorsFromTexts(
data_source=train_data_source,
tokenizer=PreTrainedTokenizerFast.from_pretrained(TOKENIZER_PATH),
block_size=MODEL_MAX_LENGTH,
process_batch_size=100_000,
workers=18,
)
22 changes: 22 additions & 0 deletions configs/cyr/gpt/extract_vectors/vectorize-validation.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
import os

from transformers import PreTrainedTokenizerFast

from language_model.data.extract import ExtractVectorsFromTexts, LineByLineSource

TOKENIZER_PATH = "outputs/cyr/gpt/train_tokenizer/convert-to-transformers/tokenizer/"

OPEN_VALIDATION_DATA_PATH = "outputs/cyr/gpt/extract_texts/train-validation-open-data/validation.txt"
MODEL_MAX_LENGTH = 1024

# data
validation_data_source = LineByLineSource(OPEN_VALIDATION_DATA_PATH)
os.environ["TOKENIZERS_PARALLELISM"] = "true"

task = ExtractVectorsFromTexts(
data_source=validation_data_source,
tokenizer=PreTrainedTokenizerFast.from_pretrained(TOKENIZER_PATH),
block_size=MODEL_MAX_LENGTH,
process_batch_size=100_000,
workers=18,
)
9 changes: 9 additions & 0 deletions configs/cyr/gpt/load_data/in-house.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
from language_model.data.load import YSDataDownloadTask
from language_model.data.processing import LightweightMention

task = YSDataDownloadTask(
credentials_path="credentials",
topic_id=275648,
query={"from": "2019-01-01", "to": "2021-09-01", "sanitize": False, "dedup": False},
mention_processor=LightweightMention(),
)
3 changes: 3 additions & 0 deletions configs/cyr/gpt/load_data/wiki.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
from language_model.data.load import WikiDownloadTask

task = WikiDownloadTask(url="https://dumps.wikimedia.org/ukwiki/latest/ukwiki-latest-pages-articles.xml.bz2")
74 changes: 74 additions & 0 deletions configs/cyr/gpt/train_model/ukr-gpt.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
from transformers import (
GPT2Config,
GPT2LMHeadModel,
IntervalStrategy,
PreTrainedTokenizerFast,
Trainer,
TrainingArguments,
)

from language_model.data.dataset import (
DataCollatorForGroupTextForCasualLMDataset,
FromInputIdsDataset,
FromInputIdsIterableDataset,
)
from language_model.modelling.trainer import TransformersTrainTask

TOKENIZER_PATH = "outputs/cyr/gpt/train_tokenizer/convert-to-transformers/tokenizer/"

TRAIN_IDS_PATH = "outputs/cyr/gpt/extract_vectors/vectorize-train/processed_batch.jsonl"
VALIDATION_IDS_PATH = "outputs/cyr/gpt/extract_vectors/vectorize-validation/processed_batch.jsonl"
MODEL_MAX_LENGTH = 1024


# tokenizer
tokenizer = PreTrainedTokenizerFast.from_pretrained(TOKENIZER_PATH)
# model
model_config = GPT2Config(vocab_size=len(tokenizer), bos_token_id=tokenizer.bos_token_id)
model = GPT2LMHeadModel(model_config)


# data
train_dataset = FromInputIdsIterableDataset(TRAIN_IDS_PATH)
valid_dataset = FromInputIdsDataset(VALIDATION_IDS_PATH)
data_collator = DataCollatorForGroupTextForCasualLMDataset(MODEL_MAX_LENGTH)


training_args = TrainingArguments(
do_train=True,
do_eval=True,
evaluation_strategy=IntervalStrategy.STEPS,
eval_steps=20_000,
num_train_epochs=5,
per_device_train_batch_size=4, # overall bs = 4 * 16 * num_gpus (GPT2 used 512)
gradient_accumulation_steps=16,
per_device_eval_batch_size=4,
output_dir="checkpoints",
overwrite_output_dir=False,
save_steps=20_000,
save_total_limit=10,
prediction_loss_only=False,
learning_rate=0.0002, # (was manually tuned in GPT2 on held-out validation)
warmup_ratio=0.004,
fp16=True,
logging_dir="logs",
seed=42,
lr_scheduler_type="cosine", # type: ignore
logging_first_step=True,
logging_steps=500,
label_names=["labels"],
load_best_model_at_end=True,
group_by_length=False,
report_to=["mlflow"],
dataloader_num_workers=1, # because of IterableDataset that reads from one opened file
)

trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=valid_dataset,
data_collator=data_collator,
)

task = TransformersTrainTask(trainer=trainer)
22 changes: 22 additions & 0 deletions configs/cyr/gpt/train_tokenizer/convert-to-transformers.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
from transformers import PreTrainedTokenizerFast

from language_model.tokenization.factory import FAST_TOKENIZER_DEFAULT_FILE_NAME
from language_model.tokenization.tasks import PreTrainedTokenizerFastSavingTask

TOKENIZER_PATH = f"outputs/cyr/gpt/train_tokenizer/ukr-gpt/{FAST_TOKENIZER_DEFAULT_FILE_NAME}"

IN_HOUSE_TRAIN_DATA_PATH = "outputs/cyr/gpt/extract_texts/in-house-data/texts.txt"
OPEN_TRAIN_DATA_PATH = "outputs/cyr/gpt/extract_texts/train-validation-open-data/train_shuffled.txt"
MODEL_MAX_LENGTH = 1024


# tokenizer
tokenizer = PreTrainedTokenizerFast(
tokenizer_file=TOKENIZER_PATH, model_max_length=MODEL_MAX_LENGTH, padding_side="right"
)
tokenizer.add_special_tokens({"bos_token": "<|endoftext|>"})
# basically `pad_token` wont be used for training, as DataCollatorForGroupTextForCasualLMDataset pack sequences up to
# max_length but to avoid an error within DataCollatorForGroupTextForCasualLMDataset
tokenizer.pad_token = tokenizer.bos_token

task = PreTrainedTokenizerFastSavingTask(pretrained_fast_tokenizer=tokenizer)
23 changes: 23 additions & 0 deletions configs/cyr/gpt/train_tokenizer/ukr-gpt.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
from itertools import islice

from tokenizers import Tokenizer, decoders, models, pre_tokenizers, processors, trainers

from language_model.data.extract import LineByLineSource
from language_model.tokenization.trainer import TrainTokenizerTask

tokenizer = Tokenizer(models.BPE())

tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)
tokenizer.decoder = decoders.ByteLevel()
tokenizer.post_processor = processors.ByteLevel(trim_offsets=True)


TRAIN_DATA_PATH = "outputs/cyr/gpt/extract_texts/train-validation-open-data/train.txt"
NUM_TRAIN_LINES = 1_000_000
TRAIN_SAMPLING_STEP = 200
train_data_source = islice(
(line for i, line in enumerate(LineByLineSource(TRAIN_DATA_PATH)) if i % TRAIN_SAMPLING_STEP == 0), NUM_TRAIN_LINES
)
trainer = trainers.BpeTrainer(vocab_size=50264, special_tokens=["<|endoftext|>"])

task = TrainTokenizerTask(tokenizer=tokenizer, iterator=train_data_source, trainer=trainer)
23 changes: 0 additions & 23 deletions configs/ukr/train_model/ukr-roberta-base.py

This file was deleted.

11 changes: 0 additions & 11 deletions configs/ukr/train_tokenizer/ukr-roberta-base.py

This file was deleted.

8 changes: 8 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,12 @@
bs4==0.0.1
Cython==3.0.0a9
datasets==1.11.0
lxml==4.6.3
more-itertools==8.9.0
numpy==1.19.5
pyNlple==0.7.5
tokenizers==0.10.1
torch==1.8.1
transformers==4.4.2
wget==3.2
wikiextractor==3.0.4
6 changes: 6 additions & 0 deletions requirements_installation.sh
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,12 @@ pip install -r requirements.dev.txt
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
)

(
git clone https://github.com/mattilyra/LSH || { echo "Failed to download and install LSH"; exit 1; }
cd LSH && \
python setup.py install
)

pip install -e .

pre-commit install
Loading