Dataset Toolbox

A dataset toolbox for preparing and analyzing conversational datasets, including CSV splitting, CSV → Parquet conversion, dataset statistics, dialogue-turn filtering, turn-based filtering, token and turn analysis, Parquet cleaning and sorting, HuggingFace–style metadata generation, and batched chain insertion into PostgreSQL — with Rich progress, multiprocessing, and 32 GB-RAM-friendly batching.

Install

python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
git clone https://github.com/mookiezi/dataset-toolbox
cd dataset-toolbox
pip install -U pip
pip install -r requirements.txt

Python 3.10+ recommended.

Features

File	Purpose	Example Usage
`chains.sh`	Batch insert reply chains into PostgreSQL using recursive CTEs. Deduplicates, merges turns, and writes to `chains` table.	`chmod +x chains.sh && ./chains.sh` (Edit `PGUSER` and `DB` inside script.)
`splitcsv.py`	Split a CSV into N parts, writing each to `<parent>/<N>/split.csv`. Uses Polars for fast IO.	`python splitcsv.py -p data/dump.csv -s 10`
`combineall.py`	Recursively combine multiple CSVs into one. Estimates safe chunksize by RAM, streams in batches, shows Rich progress bar.	`python combineall.py -p data_folder -o combined.csv --max-mem-gb 32`
`filterturns.py`	Filter rows by number of message blocks (turns). Saves filtered CSV (`{min}to{max}.csv`) and a histogram (`*_turn_hist.png`).	`python filterturns.py -p mydata.csv -min 2 -max 39` → `mydata2to39.csv`
`dropcols.py`	Remove identifier columns (`id`, `guild_id`, `channel_id`) from a CSV. Writes a new `_pure.csv` file alongside the input.	`python dropcols.py -p mydata.csv` → `mydata_pure.csv`
`stats.py`	Compute token counts and dataset statistics (tokens, turns, chars, words). Uses Hugging Face tokenizer with batch processing and Rich progress bar.	`python stats.py -p mydata.csv -m NousResearch/Hermes-3-Llama-3.1-8B -b 1024` → `mydata_stats.csv`
`tokens.py`	Generate detailed token statistics for a CSV (`text` col). Computes descriptive stats, histograms, assistant blocks, and saves a log file.	`python tokens.py -p mydata.csv` → `mydata_tokenstats.txt`
`turnstats.py`	Generate statistics on im_start blocks (turns) from CSV. Saves distribution table (`_turn_table.txt`) and histogram (`_turn_hist.png`).	`python turnstats.py -p mydata.csv`
`par.py`	Convert CSV → Parquet with Zstandard compression. Skips malformed lines, prompts before overwrite.	`python par.py -p mydata.csv -o mydata.parquet`
`sortpar.py`	Rank dataset by turns and character count. Computes char bonus, effective turns, and composite score to sort rows.	`python sort.py -p train.parquet` → `train_sort.parquet`
`cleanpar.py`	Drop unnecessary columns (`assistant_turns`, `__index_level_0__`) from Parquet and restore row order. Writes `train.parquet`.	`python cleanpar.py -p mydata.parquet` → `train.parquet`
`parjson.py`	Generate Hugging Face–style `dataset_infos.json` from a Parquet file. Reads metadata only (no full load).	`python parjson.py -p train.parquet -o dataset_infos.json`

`chains.sh`

Batch insert reply chains into PostgreSQL.

Uses recursive CTEs to build two-author chains.
Deduplicates messages, merges same-author turns.
Tracks batch progress in root_id_progress.
Inserts results into a chains table in ChatML format.
Shows progress (inserts processed, elapsed time, ETA).

Usage:

chmod +x chains.sh
./chains.sh

Output (per batch):

🚀 Starting batched chain insert runner...
Running batch 12...
📦 Finished batch in 95s
Running batch 13...
📦 Finished batch in 102s
No more batches left. Exiting.

Configure PGUSER and DB at the top of the script for your environment.

`splitcsv.py` — Split CSV into N Parts

Split a large CSV into evenly sized parts, writing each to <parent>/<N>/split.csv.
Uses Polars for fast IO and Rich printing for timing.

CLI

-p/--path              Input CSV path (with or without .csv, required)
-s/--split-count       Number of parts to split into (default: 10)

Usage

python splitcsv.py -p data/dump.csv -s 10

Output

Loaded 1,234,567 rows — splitting into 20 parts
[1/20] Writing data/1/split.csv... done in 0.42s
[2/20] Writing data/2/split.csv... done in 0.37s
...

`combineall.py`

Combine multiple CSV files into one large CSV safely.

Recursively finds all .csv files in a folder.
Estimates safe chunksize based on available RAM, or accepts --chunksize.
Streams files in chunks to avoid OOM errors.

Usage:

python combineall.py -p data_folder -o combined.csv --max-mem-gb 32

Output:

Auto chunksize based on 32GB RAM: 250000 rows
Found 12 CSV files. Combining into combined.csv
Combining ██████████████ 12/12 • 00:55 • 00:00
Combined CSV saved to combined.csv

`dropcols.py`

Remove identifier columns from a CSV quickly using Polars.

Drops id, guild_id, and channel_id if present.
Skips malformed rows with ignore_errors=True.
Writes a new _pure.csv file alongside the input.

Usage:

Output:

Saved without id, guild_id, and channel_id → mydata_pure.csv

`stats.py`

Compute token counts and basic stats for a CSV dataset.

Counts tokens, turns, assistant turns, characters, and words.
Uses any Hugging Face tokenizer (default: NousResearch/Hermes-3-Llama-3.1-8B).
Processes in batches with a Rich progress bar.

Usage:

python stats.py -p mydata.csv -m NousResearch/Hermes-3-Llama-3.1-8B -b 1024

Output:

`mydata.csv` → `mydata_stats.csv`

`filterturns.py`

Compute token/turn statistics and filter rows by ChatML message-block count.

Counts occurrences of <|im_start|> per row to estimate dialogue turns.
Filters rows with counts within [--min, --max] inclusive.
Saves a filtered CSV and a histogram PNG.
Exits with error if the input CSV lacks a text column.

Usage:

python turns.py -p mydata.csv --min 2 --max 39

`par.py`

Convert CSV → Parquet with Zstandard compression.

Skips malformed CSV lines.
Prompts before overwriting existing files.
Lightweight and fast.

Usage:

python par.py -p mydata.csv -o mydata.parquet

Output:

[ok]Saved 176,131 rows → mydata.parquet

`sortpar.py`

Rank a Parquet dataset by turns and character count.

Computes a char_count column from text length.
Adds a capped character bonus (0–20) scaled between median and 95th percentile.
Calculates an effective_turns value (max of turns vs. 5 + char bonus, capped at 25).
Creates a composite sort_score = effective_turns × 1,000,000 + char_count.
Sorts by this score and saves a new _sort.parquet file.

Usage:

python sort.py -p train.parquet

Output:

Written sorted dataset to train_sort.parquet

`cleanpar.py`

Clean a Parquet file by dropping extra columns and restoring original row order.

Drops assistant_turns and __index_level_0__ if present.
Attaches a temporary row index to preserve input order.
Sorts back to the original order and removes the index.
Saves as train.parquet with Zstandard compression.

Usage:

python cleanpar.py -p mydata.parquet

Output:

Wrote train.parquet with 176,131 rows, preserved order.

`parjson.py`

Generate a Hugging Face–style dataset_infos.json from a Parquet file.

Reads Parquet footer/schema only (no full load).
Extracts row count, size, and primary column.
Prompts before overwriting.

Usage:

python parjson.py -p train.parquet -o dataset_infos.json

Output:

Generated dataset_infos.json
──────────────────────────────────────────
Field              Value
──────────────────────────────────────────
Rows (num_examples)   N examples
Size (bytes)          N bytes
Column name           id
Config                default
Split                 train
──────────────────────────────────────────
[ok]Wrote: dataset_infos.json

`tokens.py`

Generate detailed token statistics for a CSV dataset.

Tokenizes the text column in batches using Hugging Face tokenizer (default: Hermes-3-Llama-3.1-8B).
Computes descriptive stats (min, max, mean, median, std, skew, kurtosis, percentiles, histograms).
Counts assistant message blocks (ChatML or DeepHermes markers).
Logs results to <stem>_tokenstats.txt.

Usage:

python tokens.py -p mydata.csv

Output (excerpt):

Stats for text:
  min: 3
  max: 2048
  mean: 87.3
  median: 74.0
  std: 56.1
  ...
  assistant_blocks: 25

Total tokens across all columns: 15,382,921
Total assistant blocks: 142,883

Output:

Filtered rows (>= 2 message blocks): 15,392
Saved to: mydata2to8.csv
Histogram saved to: mydata_turn_hist.png

`turnstats.py`

Generate message-block statistics and a histogram for a CSV dataset.

Counts <|im_start|> occurrences per row to measure dialogue turns.
Produces a full frequency distribution table (including zero-filled bins).
Saves the table to <base>_turn_table.txt.
Plots and saves a histogram as <base>_turn_hist.png.
Prints the distribution table to the console.

Usage:

python turnstats.py -p mydata.csv

Output:

Score Distribution
───────────────────────────
Range        Count
───────────────────────────
0            153
1            841
2            392
...
───────────────────────────
Score table saved to: mydata_turn_table.txt
Histogram saved to: mydata_turn_hist.png

Cross-Platform Notes

Unix/macOS: after chmod +x script.py, you can run ./script.py … thanks to the shebang.
Windows: run as python script.py …; shebang is ignored by default shell, but code is fully supported.

License

This project is licensed under the MIT License.
See the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dataset Toolbox

Install

Features

`chains.sh`

`splitcsv.py` — Split CSV into N Parts

`combineall.py`

`dropcols.py`

`stats.py`

`filterturns.py`

`par.py`

`sortpar.py`

`cleanpar.py`

`parjson.py`

`tokens.py`

`turnstats.py`

Cross-Platform Notes

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
LICENSE		LICENSE
README.md		README.md
chains.sh		chains.sh
cleanpar.py		cleanpar.py
combineall.py		combineall.py
dropcols.py		dropcols.py
filterturns.py		filterturns.py
par.py		par.py
parjson.py		parjson.py
requirements.txt		requirements.txt
sortpar.py		sortpar.py
splitcsv.py		splitcsv.py
stats.py		stats.py
tokens.py		tokens.py
turnstats.py		turnstats.py

Folders and files

Latest commit

History

Repository files navigation

Dataset Toolbox

Install

Features

chains.sh

splitcsv.py — Split CSV into N Parts

combineall.py

dropcols.py

stats.py

filterturns.py

par.py

sortpar.py

cleanpar.py

parjson.py

tokens.py

turnstats.py

Cross-Platform Notes

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`chains.sh`

`splitcsv.py` — Split CSV into N Parts

`combineall.py`

`dropcols.py`

`stats.py`

`filterturns.py`

`par.py`

`sortpar.py`

`cleanpar.py`

`parjson.py`

`tokens.py`

`turnstats.py`

Packages