Setup Finemath-4plus dataset for training

Initial thoughts on what we want to do to set it up. These are up for discussion.
<https://huggingface.co/datasets/HuggingFaceTB/finemath>
- [x] Create a script utilizing the standard llm-foundry library as much as possible to prepare the dataset in MDS (need to confirm?) format. (potentially using modal)
- [x] Push to a private HF dataset repo
- [ ] Set up a modal function to load the dataset into our datasets volume
- [x] Set up a train/valid 90/10 split
- [x] Set up a 1m, 10m, 100m, 1b, full size version of the datasets. Set the seed so we can reproduce. (Do we think these are appropriate sizes to set up?)

Questions
- Do we want to pre-tokenize? If so which tokenizer? Will this be a problem if we want to train different models?
- Do we want to keep the tokenized and non-tokenized versions? I think this makes sense.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Setup Finemath-4plus dataset for training #42

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Setup Finemath-4plus dataset for training #42

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions