The main idea (to be confirmed though) is to have for the user the following process:
- The user adds raw data files such as (csv + npy for embeddings - to be extended to other formats as well)
- The user defines a schema for variable types
- The library converts the raw data files into a format used to load data in memory
- The dataset instance can return native
tf.data.Dataset or torch.Dataloader in order to train models with this dataset
For the last point, mainly two options:
- convert dataset (
csv with npy files) to hdf5 and then Apache Arrow or vaex to load it in memory
- or if we want native
tf/torch tensors in the end: convert datasets into Parquet and then use petastorm
Brainstorm has been done in a Notion doc. Next steps is to investigate properly the different options.
Other points:
- Tensorflow or PyTorch should not be dependencies for the project, we need to put it as dependencies in
environment.yaml rather than requirements.txt
- The user needs to be able to use
biodatasets package with either PyTorch or TF installed, so we need to manage import errors in both to_torch_dataset() and to_tf_dataset() and catch errors to display that one of these libraries should be installed if the user tries to call one of these functions.
The main idea (to be confirmed though) is to have for the user the following process:
tf.data.Datasetortorch.Dataloaderin order to train models with this datasetFor the last point, mainly two options:
csvwithnpyfiles) tohdf5and then Apache Arrow or vaex to load it in memorytf/torchtensors in the end: convert datasets into Parquet and then use petastormBrainstorm has been done in a Notion doc. Next steps is to investigate properly the different options.
Other points:
environment.yamlrather thanrequirements.txtbiodatasetspackage with either PyTorch or TF installed, so we need to manageimporterrors in bothto_torch_dataset()andto_tf_dataset()and catch errors to display that one of these libraries should be installed if the user tries to call one of these functions.