emolia-bench

Benchmark tooling for two emotion-audio annotation subsets and CLAP-style model evaluation against them.

Subsets

emolia-emo — 3-level ordinal rating (not_present / weakly_present / strongly_present) per (file, queried_emotion, task_type). Used to score whether a clip expresses a queried emotion. Variables are self-explanatory emotion names.
emolia-dim — binary yes / no rating per (file, dimension, level, polarity) against a written rubric in dataset/emolia-dim/variables.json. Used to score whether a clip matches a specific level of a perceptual dimension (e.g. TEMP level 5 = "fast").

Repository layout

annotations_raw/          # gitignored; real usernames live here
  emolia-emo/{annotations.csv, users.csv}
  emolia-dim/{annotations.csv, users.csv}
annotations/              # committed; usernames anonymized to user_0, user_1, …
  emolia-emo/{annotations.csv, users.csv}
  emolia-dim/{annotations.csv, users.csv}
dataset/
  emolia-emo/data/<Emotion>_best/<stem>.{mp3,json}
  emolia-dim/data/<DIM>/<level>/<polarity>/sample_NN.{mp3,json}
  emolia-dim/variables.json   # rubric for prompts
analysis_outputs/<subset>/
  benchmark_labels.csv per_*_summary.csv summary.json incomplete_items.csv
analysis_outputs/report.md   # combined paper-ready summary
benchmark_outputs/<subset>/
  predictions.csv metrics_by_*.csv summary.json report.md

Environment

uv venv --python 3.13       # only needed once
uv run anonymize.py         # then any other entry point

All scripts use uv run.

Pipeline

1. Refresh anonymized annotations

annotations_raw/ is gitignored because usernames are still in there. Run the anonymizer any time the raw CSVs are replaced:

uv run anonymize.py

This rewrites annotations/<subset>/annotations.csv and annotations/<subset>/users.csv with usernames replaced by user_0, user_1, … (sorted username order; mapping saved to annotations_raw/<subset>/_anon_map.csv).

2. Build benchmark labels and the agreement summary

uv run analysis.py

For each subset this writes to analysis_outputs/<subset>/:

benchmark_labels.csv — one row per item with majority-vote target (majority_present), per-rating vote counts, n_raters, all_agree_binary, and benchmark_bucket (unanimous_*, majority_*, single_rater_*).
per_*_summary.csv — slice tables (task type / emotion / dimension / polarity).
incomplete_items.csv — items lacking 3 raters.
summary.json — machine-readable summary including human_upper_bound_binary (mean pairwise exact agreement) and rater-coverage histogram.

It also writes a single combined paper-ready report to analysis_outputs/report.md. Numbers there are formatted for direct use in a methods section: total annotations, annotators / demographics, kappa / Fleiss kappa, per-task and per-emotion / per-dimension breakdowns.

3. Run the model benchmark

Sham mode (deterministic fake similarity, no audio is read):

uv run benchmark.py

Remote endpoint mode:

uv run benchmark.py --endpoint http://127.0.0.1:8765/v1/similarity

Useful flags:

--subset emolia-emo|emolia-dim|both (default both).
--require-3-raters — restrict to items with full 3-rater coverage.
--limit N — quick smoke test on first N rows.
--threshold 0.0 — predict positive if similarity ≥ threshold.
--no-audio-send — with --endpoint, send only stem + text in the JSON payload (server reads files itself).

For each subset, benchmark.py writes to benchmark_outputs/<subset>/:

predictions.csv
metrics_by_<task_type|polarity>.csv
metrics_by_benchmark_bucket.csv
summary.json
report.md

The report starts and ends with a score rubric:

Band	Balanced accuracy	Notes
Bad	< 0.55	At or below random; model isn't learning
Weak	0.55 – 0.65	Some signal, far from human
Medium	0.65 – 0.75	Useful but lossy; decent training target
Good	0.75 – 0.85	Strong CLAP-style performance
Excellent	≥ 0.85	Approaches human inter-rater agreement

Headline metric: balanced accuracy on majority_present. The unanimous subset (benchmark_bucket starting with unanimous_) is the cleanest target to optimize against.

Endpoint contract

benchmark.py posts JSON to your endpoint:

{
  "text": "Speech audio in which the speaker expresses or conveys Sadness.",
  "audio_filename": "EN_B00025_S06526_W000000.mp3",
  "audio_base64": "<base64 mp3 bytes>"
}

The server should return one of {"similarity": …}, {"score": …}, or {"logit": …}. The local stub server sham_clap_server.py implements this contract for testing.

uv run sham_clap_server.py --port 8765
uv run benchmark.py --endpoint http://127.0.0.1:8765/v1/similarity

End-to-end refresh after new annotations

# Drop new raw CSVs into annotations_raw/<subset>/, then:
uv run anonymize.py
uv run analysis.py
uv run benchmark.py

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
analysis_outputs		analysis_outputs
annotations		annotations
dataset		dataset
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
analysis.py		analysis.py
anonymize.py		anonymize.py
benchmark.py		benchmark.py
main.py		main.py
pyproject.toml		pyproject.toml
sham_clap_server.py		sham_clap_server.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

emolia-bench

Subsets

Repository layout

Environment

Pipeline

1. Refresh anonymized annotations

2. Build benchmark labels and the agreement summary

3. Run the model benchmark

Endpoint contract

End-to-end refresh after new annotations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

emolia-bench

Subsets

Repository layout

Environment

Pipeline

1. Refresh anonymized annotations

2. Build benchmark labels and the agreement summary

3. Run the model benchmark

Endpoint contract

End-to-end refresh after new annotations

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages