A dataset toolbox for preparing and analyzing conversational datasets, including CSV splitting, CSV → Parquet conversion, dataset statistics, dialogue-turn filtering, turn-based filtering, token and turn analysis, Parquet cleaning and sorting, HuggingFace–style metadata generation, and batched chain insertion into PostgreSQL — with Rich progress, multiprocessing, and 32 GB-RAM-friendly batching.
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
git clone https://github.com/mookiezi/dataset-toolbox
cd dataset-toolbox
pip install -U pip
pip install -r requirements.txtPython 3.10+ recommended.
| File | Purpose | Example Usage |
|---|---|---|
chains.sh |
Batch insert reply chains into PostgreSQL using recursive CTEs. Deduplicates, merges turns, and writes to chains table. |
chmod +x chains.sh && ./chains.sh (Edit PGUSER and DB inside script.) |
splitcsv.py |
Split a CSV into N parts, writing each to <parent>/<N>/split.csv. Uses Polars for fast IO. |
python splitcsv.py -p data/dump.csv -s 10 |
combineall.py |
Recursively combine multiple CSVs into one. Estimates safe chunksize by RAM, streams in batches, shows Rich progress bar. | python combineall.py -p data_folder -o combined.csv --max-mem-gb 32 |
filterturns.py |
Filter rows by number of message blocks (turns). Saves filtered CSV ({min}to{max}.csv) and a histogram (*_turn_hist.png). |
python filterturns.py -p mydata.csv -min 2 -max 39 → mydata2to39.csv |
dropcols.py |
Remove identifier columns (id, guild_id, channel_id) from a CSV. Writes a new _pure.csv file alongside the input. |
python dropcols.py -p mydata.csv → mydata_pure.csv |
stats.py |
Compute token counts and dataset statistics (tokens, turns, chars, words). Uses Hugging Face tokenizer with batch processing and Rich progress bar. | python stats.py -p mydata.csv -m NousResearch/Hermes-3-Llama-3.1-8B -b 1024 → mydata_stats.csv |
tokens.py |
Generate detailed token statistics for a CSV (text col). Computes descriptive stats, histograms, assistant blocks, and saves a log file. |
python tokens.py -p mydata.csv → mydata_tokenstats.txt |
turnstats.py |
Generate statistics on im_start blocks (turns) from CSV. Saves distribution table (*_turn_table.txt) and histogram (*_turn_hist.png). |
python turnstats.py -p mydata.csv |
par.py |
Convert CSV → Parquet with Zstandard compression. Skips malformed lines, prompts before overwrite. | python par.py -p mydata.csv -o mydata.parquet |
sortpar.py |
Rank dataset by turns and character count. Computes char bonus, effective turns, and composite score to sort rows. | python sort.py -p train.parquet → train_sort.parquet |
cleanpar.py |
Drop unnecessary columns (assistant_turns, __index_level_0__) from Parquet and restore row order. Writes train.parquet. |
python cleanpar.py -p mydata.parquet → train.parquet |
parjson.py |
Generate Hugging Face–style dataset_infos.json from a Parquet file. Reads metadata only (no full load). |
python parjson.py -p train.parquet -o dataset_infos.json |
Batch insert reply chains into PostgreSQL.
- Uses recursive CTEs to build two-author chains.
- Deduplicates messages, merges same-author turns.
- Tracks batch progress in
root_id_progress. - Inserts results into a
chainstable in ChatML format. - Shows progress (inserts processed, elapsed time, ETA).
Usage:
chmod +x chains.sh
./chains.shOutput (per batch):
🚀 Starting batched chain insert runner...
Running batch 12...
📦 Finished batch in 95s
Running batch 13...
📦 Finished batch in 102s
No more batches left. Exiting.
Configure
PGUSERandDBat the top of the script for your environment.
Split a large CSV into evenly sized parts, writing each to <parent>/<N>/split.csv.
Uses Polars for fast IO and Rich printing for timing.
CLI
-p/--path Input CSV path (with or without .csv, required)
-s/--split-count Number of parts to split into (default: 10)
Usage
python splitcsv.py -p data/dump.csv -s 10Output
Loaded 1,234,567 rows — splitting into 20 parts
[1/20] Writing data/1/split.csv... done in 0.42s
[2/20] Writing data/2/split.csv... done in 0.37s
...
Combine multiple CSV files into one large CSV safely.
- Recursively finds all
.csvfiles in a folder. - Estimates safe chunksize based on available RAM, or accepts
--chunksize. - Streams files in chunks to avoid OOM errors.
Usage:
python combineall.py -p data_folder -o combined.csv --max-mem-gb 32Output:
Auto chunksize based on 32GB RAM: 250000 rows
Found 12 CSV files. Combining into combined.csv
Combining ██████████████ 12/12 • 00:55 • 00:00
Combined CSV saved to combined.csv
Remove identifier columns from a CSV quickly using Polars.
- Drops
id,guild_id, andchannel_idif present. - Skips malformed rows with
ignore_errors=True. - Writes a new
_pure.csvfile alongside the input.
Usage:
Output:
Saved without id, guild_id, and channel_id → mydata_pure.csv
Compute token counts and basic stats for a CSV dataset.
- Counts tokens, turns, assistant turns, characters, and words.
- Uses any Hugging Face tokenizer (default:
NousResearch/Hermes-3-Llama-3.1-8B). - Processes in batches with a Rich progress bar.
Usage:
python stats.py -p mydata.csv -m NousResearch/Hermes-3-Llama-3.1-8B -b 1024Output:
`mydata.csv` → `mydata_stats.csv`
Compute token/turn statistics and filter rows by ChatML message-block count.
- Counts occurrences of
<|im_start|>per row to estimate dialogue turns. - Filters rows with counts within
[--min, --max]inclusive. - Saves a filtered CSV and a histogram PNG.
- Exits with error if the input CSV lacks a
textcolumn.
Usage:
python turns.py -p mydata.csv --min 2 --max 39Convert CSV → Parquet with Zstandard compression.
- Skips malformed CSV lines.
- Prompts before overwriting existing files.
- Lightweight and fast.
Usage:
python par.py -p mydata.csv -o mydata.parquetOutput:
[ok]Saved 176,131 rows → mydata.parquet
Rank a Parquet dataset by turns and character count.
- Computes a
char_countcolumn from text length. - Adds a capped character bonus (0–20) scaled between median and 95th percentile.
- Calculates an
effective_turnsvalue (max of turns vs. 5 + char bonus, capped at 25). - Creates a composite
sort_score= effective_turns × 1,000,000 + char_count. - Sorts by this score and saves a new
_sort.parquetfile.
Usage:
python sort.py -p train.parquetOutput:
Written sorted dataset to train_sort.parquet
Clean a Parquet file by dropping extra columns and restoring original row order.
- Drops
assistant_turnsand__index_level_0__if present. - Attaches a temporary row index to preserve input order.
- Sorts back to the original order and removes the index.
- Saves as
train.parquetwith Zstandard compression.
Usage:
python cleanpar.py -p mydata.parquetOutput:
Wrote train.parquet with 176,131 rows, preserved order.
Generate a Hugging Face–style dataset_infos.json from a Parquet file.
- Reads Parquet footer/schema only (no full load).
- Extracts row count, size, and primary column.
- Prompts before overwriting.
Usage:
python parjson.py -p train.parquet -o dataset_infos.jsonOutput:
Generated dataset_infos.json
──────────────────────────────────────────
Field Value
──────────────────────────────────────────
Rows (num_examples) N examples
Size (bytes) N bytes
Column name id
Config default
Split train
──────────────────────────────────────────
[ok]Wrote: dataset_infos.json
Generate detailed token statistics for a CSV dataset.
- Tokenizes the
textcolumn in batches using Hugging Face tokenizer (default: Hermes-3-Llama-3.1-8B). - Computes descriptive stats (min, max, mean, median, std, skew, kurtosis, percentiles, histograms).
- Counts assistant message blocks (ChatML or DeepHermes markers).
- Logs results to
<stem>_tokenstats.txt.
Usage:
python tokens.py -p mydata.csvOutput (excerpt):
Stats for text:
min: 3
max: 2048
mean: 87.3
median: 74.0
std: 56.1
...
assistant_blocks: 25
Total tokens across all columns: 15,382,921
Total assistant blocks: 142,883
Output:
Filtered rows (>= 2 message blocks): 15,392
Saved to: mydata2to8.csv
Histogram saved to: mydata_turn_hist.png
Generate message-block statistics and a histogram for a CSV dataset.
- Counts
<|im_start|>occurrences per row to measure dialogue turns. - Produces a full frequency distribution table (including zero-filled bins).
- Saves the table to
<base>_turn_table.txt. - Plots and saves a histogram as
<base>_turn_hist.png. - Prints the distribution table to the console.
Usage:
python turnstats.py -p mydata.csvOutput:
Score Distribution
───────────────────────────
Range Count
───────────────────────────
0 153
1 841
2 392
...
───────────────────────────
Score table saved to: mydata_turn_table.txt
Histogram saved to: mydata_turn_hist.png
- Unix/macOS: after
chmod +x script.py, you can run./script.py …thanks to the shebang. - Windows: run as
python script.py …; shebang is ignored by default shell, but code is fully supported.
This project is licensed under the MIT License.
See the LICENSE file for details.
