Skip to content

AlenaResiko/UML

Repository files navigation

Repo and Drive Layout

  • Data & Preprocessing

Raw corpora sit under data/: music/ includes Taylor Swift lyric dumps grouped by album, scientific/ stores fetched arXiv sources/metadata, and huggingface/ is where datasets listed in data/huggingface/REGISTRY.json are materialized.

Programmatic ingestion is handled in uml_project/data/scientific/scientific.py (line 1) (download, tar extraction, title parsing, and JSON export of LaTeX sources) and /Users/uml_project/data/pre_processing/scentific_aboba.ipynb/

Actual sentence-level cleaning is implemented in uml_project/data/pre_processing/sentence.py and (line 1) (spaCy-based sentencizer with lyric-specific filtering, deduplication, and LaTeX-aware branches) and the supporting LaTeX scrubber in uml_project/data/pre_processing/latex_helper.py (line 1) (removes math/env blocks, inline commands, and normalizes prose before re-splitting). Self-supervised utilities such as synthetic label generation sit in uml_project/data/self_supervised/synthetic_labels.py (line 1).

Results are in https://drive.google.com/drive/u/0/folders/1tlR0zWuS-A_NvsfKnnS6VFsQlovlsrnX

Running doc (paper analysis, suggested benchmarks + eval, random): https://docs.google.com/document/d/1iEbmiGagz64P1kkLeR32FaTMiEfaba8-CwHGpI4GJw4/edit?usp=sharing

Hyperparameters

train_test_split = 0.8 seed = 42

About

Sentence embedding

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors