Skip to content

Setup Finemath-4plus dataset for training #42

Description

@matdmiller

Initial thoughts on what we want to do to set it up. These are up for discussion.
https://huggingface.co/datasets/HuggingFaceTB/finemath

  • Create a script utilizing the standard llm-foundry library as much as possible to prepare the dataset in MDS (need to confirm?) format. (potentially using modal)
  • Push to a private HF dataset repo
  • Set up a modal function to load the dataset into our datasets volume
  • Set up a train/valid 90/10 split
  • Set up a 1m, 10m, 100m, 1b, full size version of the datasets. Set the seed so we can reproduce. (Do we think these are appropriate sizes to set up?)

Questions

  • Do we want to pre-tokenize? If so which tokenizer? Will this be a problem if we want to train different models?
  • Do we want to keep the tokenized and non-tokenized versions? I think this makes sense.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions