Benchmark tooling for two emotion-audio annotation subsets and CLAP-style model evaluation against them.
- emolia-emo — 3-level ordinal rating (
not_present/weakly_present/strongly_present) per(file, queried_emotion, task_type). Used to score whether a clip expresses a queried emotion. Variables are self-explanatory emotion names. - emolia-dim — binary
yes/norating per(file, dimension, level, polarity)against a written rubric indataset/emolia-dim/variables.json. Used to score whether a clip matches a specific level of a perceptual dimension (e.g.TEMPlevel 5 = "fast").
annotations_raw/ # gitignored; real usernames live here
emolia-emo/{annotations.csv, users.csv}
emolia-dim/{annotations.csv, users.csv}
annotations/ # committed; usernames anonymized to user_0, user_1, …
emolia-emo/{annotations.csv, users.csv}
emolia-dim/{annotations.csv, users.csv}
dataset/
emolia-emo/data/<Emotion>_best/<stem>.{mp3,json}
emolia-dim/data/<DIM>/<level>/<polarity>/sample_NN.{mp3,json}
emolia-dim/variables.json # rubric for prompts
analysis_outputs/<subset>/
benchmark_labels.csv per_*_summary.csv summary.json incomplete_items.csv
analysis_outputs/report.md # combined paper-ready summary
benchmark_outputs/<subset>/
predictions.csv metrics_by_*.csv summary.json report.md
uv venv --python 3.13 # only needed once
uv run anonymize.py # then any other entry pointAll scripts use uv run.
annotations_raw/ is gitignored because usernames are still in there. Run
the anonymizer any time the raw CSVs are replaced:
uv run anonymize.pyThis rewrites annotations/<subset>/annotations.csv and
annotations/<subset>/users.csv with usernames replaced by user_0, user_1,
… (sorted username order; mapping saved to annotations_raw/<subset>/_anon_map.csv).
uv run analysis.pyFor each subset this writes to analysis_outputs/<subset>/:
benchmark_labels.csv— one row per item with majority-vote target (majority_present), per-rating vote counts,n_raters,all_agree_binary, andbenchmark_bucket(unanimous_*,majority_*,single_rater_*).per_*_summary.csv— slice tables (task type / emotion / dimension / polarity).incomplete_items.csv— items lacking 3 raters.summary.json— machine-readable summary includinghuman_upper_bound_binary(mean pairwise exact agreement) and rater-coverage histogram.
It also writes a single combined paper-ready report to
analysis_outputs/report.md. Numbers there are formatted for direct use in a
methods section: total annotations, annotators / demographics, kappa /
Fleiss kappa, per-task and per-emotion / per-dimension breakdowns.
Sham mode (deterministic fake similarity, no audio is read):
uv run benchmark.pyRemote endpoint mode:
uv run benchmark.py --endpoint http://127.0.0.1:8765/v1/similarityUseful flags:
--subset emolia-emo|emolia-dim|both(defaultboth).--require-3-raters— restrict to items with full 3-rater coverage.--limit N— quick smoke test on first N rows.--threshold 0.0— predict positive if similarity ≥ threshold.--no-audio-send— with--endpoint, send only stem + text in the JSON payload (server reads files itself).
For each subset, benchmark.py writes to benchmark_outputs/<subset>/:
predictions.csvmetrics_by_<task_type|polarity>.csvmetrics_by_benchmark_bucket.csvsummary.jsonreport.md
The report starts and ends with a score rubric:
| Band | Balanced accuracy | Notes |
|---|---|---|
| Bad | < 0.55 | At or below random; model isn't learning |
| Weak | 0.55 – 0.65 | Some signal, far from human |
| Medium | 0.65 – 0.75 | Useful but lossy; decent training target |
| Good | 0.75 – 0.85 | Strong CLAP-style performance |
| Excellent | ≥ 0.85 | Approaches human inter-rater agreement |
Headline metric: balanced accuracy on majority_present. The unanimous
subset (benchmark_bucket starting with unanimous_) is the cleanest target
to optimize against.
benchmark.py posts JSON to your endpoint:
{
"text": "Speech audio in which the speaker expresses or conveys Sadness.",
"audio_filename": "EN_B00025_S06526_W000000.mp3",
"audio_base64": "<base64 mp3 bytes>"
}The server should return one of {"similarity": …}, {"score": …}, or
{"logit": …}. The local stub server sham_clap_server.py implements this
contract for testing.
uv run sham_clap_server.py --port 8765
uv run benchmark.py --endpoint http://127.0.0.1:8765/v1/similarity# Drop new raw CSVs into annotations_raw/<subset>/, then:
uv run anonymize.py
uv run analysis.py
uv run benchmark.py