Update the dataset workflow with new structure/format

The main idea (to be confirmed though) is to have for the user the following process:

- The user adds raw data files such as (csv + npy for embeddings - to be extended to other formats as well)
- The user defines a schema for variable types
- The library converts the raw data files into a format used to load data in memory 
- The dataset instance can return native [`tf.data.Dataset`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset) or [`torch.Dataloader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset) in order to train models with this dataset

For the last point, mainly two options:
- convert dataset (`csv` with `npy` files) to `hdf5` and then *Apache Arrow* or *vaex* to load it in memory
- or if we want native `tf/torch` tensors in the end: convert datasets into *Parquet* and then use *petastorm*


Brainstorm has been done in a [Notion doc](https://www.notion.so/bio-datasets-6382d5cb9dc54a98ae24cfe1f7b23791). Next steps is to investigate properly the different options.

Other points:
- Tensorflow or PyTorch should not be dependencies for the project, we need to put it as dependencies in `environment.yaml` rather than `requirements.txt`
- The user needs to be able to use `biodatasets` package with either PyTorch or TF installed, so we need to manage `import` errors in both `to_torch_dataset()` and `to_tf_dataset()` and catch errors to display that one of these libraries should be installed if the user tries to call one of these functions.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update the dataset workflow with new structure/format #10

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Update the dataset workflow with new structure/format #10

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions