An end-to-end audio sentiment analysis pipeline using Whisper for speech-to-text and transformer-based sentiment classification.
Audio → Whisper (STT) → Sentiment Classification
cardiffnlp/twitter-roberta-base-sentiment-latest
- Accuracy: 0.967
- Macro F1: 0.967
- Avg Latency: 2.83 sec/clip
- Validation Accuracy: 0.923
- Validation Macro F1: 0.915
- Test Accuracy: 1.000
- Test Macro F1: 1.000
The fine-tuned model achieved perfect test performance on the current split, but validation performance suggests possible overfitting due to the small dataset size. Larger real-world audio datasets are needed for stronger generalization claims.
data/— cleaned dataset and train/val/test splitsoutputs/— leaderboard and evaluation summariesreports/— model selection summarysrc/— data generation, evaluation, and fine-tuning scripts
python src/fine_tune_cardiff_roberta.py