This repository contains MATLAB and Python scripts for feature extraction and classification of sustained vowel voice signals (pathological vs. healthy) using the HUPA database. The focus is on Perturbation, Regularity, Noise (PRN), and Complexity features.
To run the scripts, you need the HUPA database.
After downloading and organising the data, the expected structure inside this repository is:
HUPA-Voice-Analysis/
├── toolboxes/
│ ├── AVCA-ByO-master/
│ ├── covarep-master/
│ ├── fastdfa/
│ ├── hctsa-main/
│ ├── hurst estimators/
│ ├── ME-master/
│ └── rpde/
├── data/
│ ├── HUPA_db/
│ │ ├── healthy/
│ │ │ ├── 50 kHz/ ← mono .wav files at / resampled to 50 kHz
│ │ │ └── 25 kHz/ ← mono .wav files resampled to 25 kHz
│ │ ├── pathological/
│ │ │ ├── 50 kHz/ ← mono .wav files at / resampled to 50 kHz
│ │ │ └── 25 kHz/ ← mono .wav files resampled to 25 kHz
│ │ ├── HUPA_db.xlsx
│ │ └── README.md
│ ├── HUPA_voice_features_PRN_CPP_25kHz.csv
│ ├── HUPA_voice_features_PRN_CPP_50kHz.csv
│ ├── HUPA_Python_Results_Summary_25kHz.csv
│ ├── HUPA_Python_Results_Summary_50kHz.csv
│ ├── reviewer_analysis/
│ │ ├── ReviewerAnalysis_25kHz_<Group>.csv
│ │ └── ReviewerAnalysis_50kHz_<Group>.csv
│ ├── subtype_error_audit/ ← subtype audit on the held-out test split
│ │ └── SubtypeAudit_<fs>_<Group>_<ModelKey>.csv
│ └── subtype_error_audit_oof/ ← subtype audit using OOF predictions across the full dataset
│ └── SubtypeAudit_OOF_<fs>_<Group>_<ModelKey>.csv
├── figures/
│ ├── ROC_HUPA_25kHz_Python.png
│ ├── ROC_HUPA_25kHz_Python.pdf
│ ├── ROC_HUPA_50kHz_Python.png
│ ├── ROC_HUPA_50kHz_Python.pdf
│ ├── ROC_HUPA_25kHz_MATLAB.png
│ ├── ROC_HUPA_25kHz_MATLAB.pdf
│ ├── ROC_HUPA_50kHz_MATLAB.png
│ ├── ROC_HUPA_50kHz_MATLAB.pdf
│ └── confusion_matrices/
│ └── CM_<fs>_<Group>_<ModelKey>.png
├── HUPA_Features_Extraction.m
├── HUPA_PRN_GridSearch_ROC.m
├── HUPA_Python_GridSearch.py
├── requirements.txt
└── README.md
- The healthy/ folder contains recordings from healthy speakers.
- The pathological/ folder contains recordings from patients with different laryngeal pathologies.
- Each condition is available at 50 kHz and 25 kHz (all files are mono).
- Inside
data/HUPA_db/there is a spreadsheetHUPA_db.xlsxdescribing all speakers and recordings (age, sex, GRBAS scores, pathology codes, etc.), together with a localREADME.mdin the same folder that documents the database structure and metadata fields.
New (metadata used by the scripts).
The file HUPA_db.xlsx is also used to:
- add metadata columns to the exported feature CSVs (e.g.,
SexandPathology code), and - map
Pathology codevalues to human-readable pathology names using the worksheetPathology classification.
This script:
-
Loads
.wavfiles from:data/HUPA_db/healthy/50 kHz/data/HUPA_db/pathological/50 kHz/data/HUPA_db/healthy/25 kHz/data/HUPA_db/pathological/25 kHz/
-
Extracts:
- AVCA PRN features (Perturbation, Regularity, Noise)
- Nonlinear/complexity features (depending on AVCA configuration)
- CPP (Cepstral Peak Prominence) using Covarep
-
Saves two CSV files, one per sampling frequency, in the
data/folder:HUPA_voice_features_PRN_CPP_50kHz.csvHUPA_voice_features_PRN_CPP_25kHz.csv
Each CSV includes:
-
One row per audio file
-
Columns:
- All AVCA PRN (and complexity) features
CPPFileNameLabel(0 = healthy, 1 = pathological)
New (metadata columns for reproducible modelling).
The exported feature CSVs also include:
Sex(string)Pathology code(integer; 0 for healthy, >0 for pathological subtypes)
These columns are taken from data/HUPA_db/HUPA_db.xlsx and are intended to support stratified analyses and subtype-level audits.
For each CSV:
-
Loads
HUPA_voice_features_PRN_CPP_50kHz.csvorHUPA_voice_features_PRN_CPP_25kHz.csv. -
Defines feature sets:
- Noise
- Perturbation (including CPP and jitter/shimmer)
- Tremor
- Complexity / nonlinear measures
- All features (union of all feature blocks)
-
Cleans the data:
- Removes all-NaN / constant columns
- Imputes remaining NaNs (median)
-
Splits the data:
- 80% Train (for hyperparameter optimisation via 5-fold CV)
- 20% independent Test set
-
Trains and tunes:
- Logistic Regression (
fitclinear) - SVM (RBF) (
fitcsvm+fitPosterior) - Random Forest (
TreeBagger) - MLP (
fitcnet, if available)
- Logistic Regression (
-
Evaluates models on the Test set and computes AUC.
-
Plots ROC curves organised by classifier (2×2 subplots).
Each subplot corresponds to one classifier, and each ROC curve within a subplot corresponds to one feature set (Noise, Perturbation, Tremor, Complexity, All).
Additional outputs (reviewer-oriented analysis). Using thresholds selected from out-of-fold (OOF) predictions via Youden’s J, the script also:
- saves test-set confusion matrices to
figures/confusion_matrices/, - reports sex-stratified AUC (test and OOF when
Sexis available), - writes subtype-level false negative audits:
- on the held-out test set (
data/subtype_error_audit/), - and using OOF predictions across the full dataset (
data/subtype_error_audit_oof/).
- on the held-out test set (
The script saves one ROC figure per sampling rate:
figures/ROC_HUPA_50kHz_MATLAB.pngand.pdffigures/ROC_HUPA_25kHz_MATLAB.pngand.pdf
A Python implementation using scikit-learn reproduces the MATLAB analysis for both sampling frequencies.
The script expects the two CSVs generated by MATLAB:
data/HUPA_voice_features_PRN_CPP_50kHz.csvdata/HUPA_voice_features_PRN_CPP_25kHz.csv
Each CSV may include Sex and Pathology code. If present, the script will run the reviewer-oriented analyses described below.
For each sampling frequency (50 kHz, 25 kHz):
-
Loads the corresponding CSV.
-
Defines feature sets:
- Noise, Perturbation, Tremor, Complexity, and All features.
-
Uses a common train–test split:
- 80% Train, 20% Test, stratified by label.
-
For each feature set, runs a
GridSearchCVwith 5-fold CV and AUC as the scoring metric, over:- Logistic Regression
- SVM (RBF)
- Random Forest
- MLP
Each model is wrapped in a
Pipelinewith:SimpleImputer(strategy="median")StandardScaler(except Random Forest, which only uses imputation)
-
Evaluates the best model (per algorithm) on the hold-out Test set.
-
Plots ROC curves organised by classifier (2×2 subplots).
Each subplot corresponds to one classifier, and ROC curves within a subplot correspond to feature sets (Noise, Perturbation, Tremor, Complexity, All). Figures are saved tofigures/:figures/ROC_HUPA_50kHz_Python.pngand.pdffigures/ROC_HUPA_25kHz_Python.pngand.pdf
-
Saves a summary CSV with all models and feature sets:
data/HUPA_Python_Results_Summary_50kHz.csvdata/HUPA_Python_Results_Summary_25kHz.csv
Each summary file contains, for every combination of feature set and model:
GroupModelTest_AUCCV_AUC_MeanBest_Params
Additional outputs (reviewer-oriented analysis).
For each feature set, the script also creates:
-
data/reviewer_analysis/ReviewerAnalysis_<fs>_<Group>.csvwith:- test AUC and CV AUC,
- sex-stratified AUC on test (
AUC_by_Sex_Test) whenSexis available, - sex-stratified AUC using OOF predictions (
AUC_by_Sex_OOF) whenSexis available, - Youden threshold selected from OOF predictions,
- test confusion-matrix metrics (Sensitivity, Specificity, BalancedAcc),
- paths to the confusion-matrix figure and subtype audit files.
-
subtype-level audits:
data/subtype_error_audit/SubtypeAudit_<fs>_<Group>_<ModelKey>.csv(test split),data/subtype_error_audit_oof/SubtypeAudit_OOF_<fs>_<Group>_<ModelKey>.csv(OOF full dataset).
- MATLAB (R2020b or newer recommended)
- Statistics and Machine Learning Toolbox
- Deep Learning Toolbox (optional, for
fitcnet)
Place these libraries inside toolboxes/:
- AVCA-ByO: Essential for P, R, N features.
- Covarep: Used for CPP feature extraction.
- Hurst Estimators: Implementation to compute the Hurst exponent.
- RPDE: Code to compute Recurrence Period Density Entropy (Little et al., 2007).
- FastDFA: Implementation to compute Detrended Fluctuation Analysis (Little et al., 2006).
- HCTSA: Highly Comparative Time-Series Analysis (used for D2 and LLE).
- ME (Markovian Entropies): Functions for the computation of entropies from Markov Models.
Compatibility Note for Newer MATLAB Versions
Many of these toolboxes were developed years ago. If you are using a recent version of MATLAB (e.g., R2020b+), please be aware of the following:
- Legacy Code: You may need to manually update small parts of the external toolboxes to fix deprecated functions.
- Path Conflicts: The script
HUPA_Features_Extraction.malready handles a known conflict with Covarep (it removesbackcompatibility_2015to avoid breaking the built-inaudioread).- Debugging: If you encounter "function not found" or "input argument" errors inside these toolboxes, check that their internal paths are correctly added and that they support your MATLAB version.
Install dependencies via:
pip install -r requirements.txtOptional (Windows stability).
If parallel jobs cause issues on Windows, set the number of jobs to 1 before running:
set HUPA_N_JOBS=1
python HUPA_Python_GridSearch.py[Add here the reference to the HUPA database and the related publication, once finalised.]