ElixirDatasets is a comprehensive library for accessing and managing datasets from Hugging Face Hub in Elixir. Inspired by the Python datasets library, it brings powerful dataset management capabilities to the Elixir ecosystem with seamless integration with Explorer DataFrames.
- 🚀 Easy Access to Hugging Face Hub - Load thousands of datasets with a single function call
- 📊 Explorer Integration - Automatic conversion to Explorer DataFrames for data manipulation
- 💾 Smart Caching - Intelligent local caching to avoid redundant downloads
- 🌊 Streaming Support - Process large datasets without loading everything into memory
- 📤 Upload Datasets - Publish your own datasets to Hugging Face Hub
- 🔒 Private Repositories - Full support for authentication and private datasets
- 🎯 Multiple Formats - Support for CSV, Parquet, and JSONL files
Add elixir_datasets to your list of dependencies in mix.exs:
def deps do
[
{:elixir_datasets, "~> 0.1.0"}
]
end{:ok, [train_df]} = ElixirDatasets.load_dataset(
{:hf, "cornell-movie-review-data/rotten_tomatoes"},
split: "train"
)
{:ok, datasets} = ElixirDatasets.load_dataset({:local, "./data"})
{:ok, stream} = ElixirDatasets.load_dataset(
{:hf, "stanfordnlp/imdb", subdir: "plain_text"},
split: "train",
streaming: true
)
stream |> Enum.take(100) |> IO.inspect()All examples can be found in the examples directory.
examples/usage_examples.livemd- Comprehensive usage examples of the elixir_datasets apiexamples/integration_examples.livemd- Examples demonstrating integration with other Elixir libraries like Nx, Axon, and Bumblebee
ELIXIR_DATASETS_CACHE_DIR- Custom cache directoryELIXIR_DATASETS_OFFLINE- Enable offline mode ("1"or"true")HF_TOKEN- Authentication token for private datasets- [🚧 In-progress]
HF_DEBUG- Enable debug logging ("1"or"true")
Full documentation is available at HexDocs and hosted on GitHub Pages for current status of under-development features. Documentation can be generated locally using:
mix docsMIX_ENV=test mix testThis project is licensed under the MIT License - see the LICENSE file for details.
Copyright (c) 2025 Radosław Rolka, Weronika Wojtas