Deepfake Audio Detection Benchmark

A neutral, public benchmark for evaluating audio deepfake detection systems on a diverse, format-rich dataset.

Why this benchmark? Self-reported deepfake detection scores are often unreliable due to overfitting on public test sets and selective reporting. This project hosts a fixed evaluation set with private gold-standard labels held by Podonos, scoring submissions in a verifiable, apples-to-apples manner.

Leaderboard

16 systems evaluated: 7 commercial APIs (bold) and 9 open-source baselines, sorted by accuracy. Company and model names link to their source.

#	System	N	Rej%	Acc%	F1	FPR%	FNR%	Lat(ms)	RTF
1	Resemble AI	4524	0.0%	98.05%	0.981	2.5%	1.4%	1,164	0.40
2	Whispeak	4524	0.0%	97.70%	0.977	2.9%	1.7%	1,052	0.39
3	Aurigin AI	4524	0.0%	96.75%	0.967	1.5%	5.0%	980	0.33
4	Pindrop	4524	0.0%	95.05%	0.951	6.2%	3.7%	282	0.076
5	Corsound AI	3875	14.3%	87.79%	0.865	1.0%	23.1%	180	0.035
6	Hive	4524	0.0%	83.53%	0.808	2.4%	30.5%	881	0.34
7	Reality Defender	3745	17.2%	71.27%	0.770	53.7%	3.6%	5,718	1.52
8	Wav2Vec2 (2019 LA)	4524	0.0%	62.89%	0.514	13.4%	60.8%	622	0.14
9	AST (ASVspoof 5)	4524	0.0%	56.83%	0.657	69.0%	17.4%	5	0.0017
10	Wav2Vec2 (2024 mix)	4524	0.0%	55.55%	0.499	33.1%	55.8%	219	0.056
11	Deepfake-V2 (W2V2-base)	4524	0.0%	53.03%	0.162	3.1%	90.9%	94	0.027
12	AST (VoxCelebSpoof)	4524	0.0%	50.99%	0.048	0.5%	97.5%	8	0.0030
13	RawNet2 (2019 LA)	4524	0.0%	50.66%	0.430	35.9%	62.7%	94	0.035
14	LCNN-LFCC (2019 LA)	4524	0.0%	50.00%	0.667	100.0%	0.0%	23	0.0056
15	AASIST (2019 LA)	4524	0.0%	48.17%	0.486	52.6%	51.1%	322	0.11
16	AASIST3 (ASVspoof 5)	4524	0.0%	47.63%	0.029	6.3%	98.4%	363	0.13

Legend:

N — number of evaluated audio files
Rej% — % of files the API rejected (NOT_APPLICABLE / errors)
Acc% — overall accuracy
FPR% — false positive rate (real flagged as fake)
FNR% — false negative rate (fake missed)
Lat(ms) — average per-file inference latency
RTF — real-time factor: prediction-time / audio-duration (lower is better)

Observations

Top tier — four commercial APIs clear 95 %:

Four production APIs separate themselves from the rest, all above 95 % accuracy with F1 ≥ 0.95. The choice between them comes down to which error you can least afford and how fast you need an answer.

Resemble AI — 98.05 % accuracy, F1 0.981, FNR 1.4 %. Best at catching fakes: only ~1 in 70 deepfakes slips past it (FPR 2.5 %). Choose Resemble when missing a deepfake is worse than a false alarm — fraud / KYC voice verification, content provenance, anywhere letting a synthetic voice through is the high-cost outcome.
Whispeak — 97.70 % accuracy, F1 0.977, a balanced 2.9 % FPR / 1.7 % FNR. The most symmetric error profile in the top tier — strong on both real and fake audio without leaning either way.
Aurigin AI — 96.75 % accuracy, F1 0.967, FPR 1.5 % (the lowest of any system that also catches fakes). Best at protecting real audio: only ~1 in 65 genuine clips is wrongly flagged (FNR 5.0 %). Choose Aurigin when false alarms on real audio are worse than missed fakes — content moderation at scale, automated takedowns, journalist verification.
Pindrop — 95.05 % accuracy, F1 0.951, and by far the fastest commercial API: ~282 ms/file (RTF 0.076), roughly 4× faster than Resemble/Whispeak and ~20× faster than Reality Defender. Errors lean toward false positives (FPR 6.2 % vs FNR 3.7 %). The pick when latency budget is tight and you can tolerate a slightly higher false-alarm rate.

Mid tier — accurate but with a catch:

Corsound AI — 87.8 % accuracy with the lowest FPR of any commercial system (1.0 %), but a high 23.1 % FNR (misses ~1 in 4 fakes) and it rejects 14.3 % of files (it declines clips below its minimum duration). Very conservative: it almost never false-flags real audio, at the cost of letting fakes through.
Hive — 83.5 % accuracy, FPR 2.4 % (very low), FNR 30.5 % (high). Same conservative shape as Corsound — rarely cries wolf, misses about a third of fakes.

Bottom tier (commercial) — high false-alarm rates:

Reality Defender — 71.3 % accuracy with FPR 53.7 % (false-flags over half of all real audio). Also rejects 17.2 % of files — its engine cannot evaluate audio shorter than ~1.5 s — and runs slower than real-time (RTF 1.52).

Open-source baselines — none generalize to modern TTS:

All nine open-source models land in the 48–63 % band — near random for binary classification. This holds regardless of training era: models trained on legacy ASVspoof 2019 LA and ones trained on newer ASVspoof 5 / VoxCelebSpoof both collapse on this distribution, which is dominated by current commercial voice-cloning systems (ElevenLabs, F5-TTS, Chatterbox, …).
Wav2Vec2 (2019 LA) is the strongest open baseline (62.9 %), reflecting the value of self-supervised audio representations.
Several models are effectively degenerate — they collapse to predicting one class: LCNN-LFCC flags everything as fake (100 % FPR), while AST (VoxCelebSpoof) (97.5 % FNR), AASIST3 (98.4 % FNR), and Deepfake-V2 (90.9 % FNR) flag almost everything as real. Their headline accuracy is an artifact of the 50/50 class balance, not detection skill.
Takeaway: off-the-shelf academic checkpoints are not a substitute for a production detector on real-world, format-diverse, modern-TTS audio.

Latency / RTF:

Reality Defender's RTF > 1.0 means it is slower than real-time — a 5-second clip takes ~7.6 s to process. Not viable for streaming.
Every other detector runs faster than real-time (RTF < 1). Among the accurate commercial systems, Pindrop is the fastest (RTF 0.076); Resemble, Whispeak, Aurigin, and Hive land around RTF 0.33–0.40.
The open-source models run on local hardware (no network round-trip), so their low RTF reflects pure compute cost — but at this accuracy that speed buys little.

Error Profile

Accuracy vs Real-Time Factor

Dataset

4,524 audio files spanning six formats: .mp3, .wav, .flac, .ogg, .m4a, .webm
Class balance: 50/50 (real / fake)
Real audio drawn from three established public corpora:
- VCTK — 110 English speakers, multiple accents
- LJSPEECH — single-speaker, ~24 hours of public-domain audiobook recordings
- LibriTTS-360 — 360-hour subset of LibriTTS, 904 speakers
Synthetic audio: ~25 commercial TTS / voice-cloning models including Chatterbox, ElevenLabs, Microsoft F5-TTS, and others
Quality verification: All synthetic audio is round-trip transcribed with OpenAI Whisper to ensure the TTS system synthesized the intended utterance, before format conversion.

See DATASET.md for full construction details.

Telephony Tracks (New)

Most real-world fraud, KYC, and call-center audio never arrives as a studio file. It comes over a phone or mobile/VoIP link: band-limited, resampled, and compressed by a low-bitrate speech codec. We provide two telephony tracks that stress detectors under those conditions. Each is the same 4,524 clips and the same hidden labels as the main benchmark, only degraded to channel grade, so scores are directly comparable to the studio leaderboard above:

Narrowband, 8 kHz (2G/3G): dataset_8k_nb/
Wideband, 16 kHz (4G/5G): dataset_16k_wb/

In each track, every clip is decoded, resampled to the track rate with a high-quality anti-aliased resampler, band-pass filtered to the channel passband, passed through one randomly assigned codec (full encode then decode, so it picks up that codec's real compression artifacts), and written as 16-bit mono WAV. The per-file codec assignment is seeded, stratified across source formats, and kept private (like the labels). The two tracks use independent permutations, so their file orders do not line up with each other or with the studio set. Both tracks are released as audio only; the codec pipeline is kept private to preserve benchmark integrity.

PESQ is measured against the clean track-rate reference (higher is better); Whisper-WER is the word-error rate of the codec'd clip versus the clean-reference transcript (lower means intelligibility is preserved).

Narrowband track: 8 kHz (2G/3G)

Codec pool: landline (G.711 μ-law / A-law), 2G/3G mobile (GSM-FR, AMR-NB), and VoIP (G.729). Band-pass 300 to 3400 Hz. 4,524 clips, ~6 hours, mean 4.80 s.

Codec	Bitrate	Files	PESQ-NB (mean) ↑	Whisper-WER (mean) ↓
G.711 μ-law	64 kbit/s	905	4.44	5.3%
G.711 A-law	64 kbit/s	903	4.44	4.4%
AMR-NB	12.2 kbit/s	904	4.07	8.4%
G.729	8 kbit/s	906	3.71	6.4%
GSM-FR	13 kbit/s	906	3.51	11.0%

Wideband track: 16 kHz (4G/5G)

Codec pool: the 4G/5G mobile wideband codecs EVS-WB and AMR-WB (G.722.2), each at two bitrates. Band-pass 50 to 7000 Hz. 4,524 clips, ~6 hours, mean 4.80 s.

Codec	Bitrate	Files	PESQ-WB (mean) ↑	Whisper-WER (mean) ↓
EVS-WB	24.4 kbit/s	1129	4.05	2.0%
AMR-WB	23.85 kbit/s	1133	3.72	3.2%
EVS-WB	13.2 kbit/s	1130	3.69	4.0%
AMR-WB	12.65 kbit/s	1132	3.27	4.0%

In both tracks the PESQ ordering is the expected one (higher bitrate and newer codecs score higher, and EVS edges AMR-WB at matched rates). Median word-error rates are ~0%, confirming the clips remain intelligible after degradation, so the detection task stays fair.

Submit your results

Run your detector over a track's folder (filenames are 0.wav, 1.wav, ...) and submit a predictions.csv as described in Submission Format below. Each track is scored against its own private gold standard. The telephony leaderboards open as submissions arrive; the studio leaderboard above is already live.

How to Reproduce

1. Clone and install

git clone https://github.com/podonos/audio-dfd-benchmark.git
cd audio-dfd-benchmark
pip install -r requirements.txt

# Install ffmpeg (for audio conversion)
# macOS:  brew install ffmpeg
# Linux:  apt-get install ffmpeg

2. Convert audio to 16 kHz mono WAV (open-source models only)

python scripts/convert_audio.py

This populates dataset_wav16k/ with 4,524 normalized WAV files.

3. Run open-source models

Each open-source model uses publicly available pre-trained checkpoints.

python scripts/run_aasist.py     # AASIST  (clovaai/aasist)
python scripts/run_rawnet2.py    # RawNet2 (MattyB95/pre_trained_DF_RawNet2)
python scripts/run_wav2vec2.py   # Wav2Vec2 SSL (Gustking/wav2vec2-large-xlsr-deepfake-audio-classification)
python scripts/run_lcnn.py       # LCNN-LFCC (MattyB95/pre_trained_DF_LFCC-LCNN)

Each script writes results/predictions_<model>.csv with columns: filename, label, confidence, latency_ms, audio_duration_sec.

4. Run commercial APIs (optional, requires API keys)

Set the relevant API keys as environment variables:

export RESEMBLE_API_KEY="<your-key>"
export HIVE_API_KEY="<your-key>"
export REALITY_DEFENDER_API_KEY="<your-key>"
export AURIGIN_API_KEY="<your-key>"

python scripts/run_commercial_apis.py                   # all APIs
python scripts/run_commercial_apis.py --api resemble    # one API
python scripts/run_commercial_apis.py --api hive --limit 100

5. Compute metrics

python scripts/compute_metrics.py

Outputs the per-model breakdown including per-format accuracy and the leaderboard.

Note: Computing metrics requires the gold-standard labels CSV. The labels are kept private to maintain the benchmark's integrity. Email your predictions.csv to hello@podonos.com for scoring.

Models Evaluated

Commercial APIs

Vendor	Product / Model	Docs / Product page
Resemble AI	DETECT-3B Omni	https://docs.resemble.ai/detect
Whispeak	Voice Biometric Authentication (anti-spoofing)	https://whispeak.io/voice-authentication/
Aurigin AI	Apollo deepfake detection	https://docs.aurigin.ai
Pindrop	Pindrop Pulse	https://www.pindrop.com/product/pindrop-pulse/
Corsound AI	Deepfake Detect	https://apis.corsound.ai/
Hive	AI-generated audio detection	https://docs.thehive.ai/docs/ai-generated-audio-detection
Reality Defender	RealAPI	https://docs.realitydefender.com

Open-source baselines

Two generations are included: legacy models trained on ASVspoof 2019 LA, and modern models trained on the newer ASVspoof 5 / VoxCelebSpoof corpora. Neither generation generalizes to the modern commercial TTS in this benchmark.

Model	Source	Architecture	Training data
Wav2Vec2 (2019 LA)	Gustking/wav2vec2-large-xlsr-deepfake-audio-classification	SSL XLSR + fine-tuned classifier	ASVspoof 2019 LA
AASIST (2019 LA)	clovaai/aasist	Graph attention on raw waveform	ASVspoof 2019 LA
RawNet2 (2019 LA)	MattyB95/pre_trained_DF_RawNet2	End-to-end CNN on raw waveform	ASVspoof 2019 LA
LCNN-LFCC (2019 LA)	MattyB95/pre_trained_DF_LFCC-LCNN	Lightweight CNN, LFCC frontend	ASVspoof 2019 DF
AASIST3 (ASVspoof 5)	lab260/AASIST3	Graph attention on raw waveform	ASVspoof 5
AST (ASVspoof 5)	MattyB95/AST-ASVspoof5-Synthetic-Voice-Detection	Audio Spectrogram Transformer	ASVspoof 5
AST (VoxCelebSpoof)	MattyB95/AST-VoxCelebSpoof-Synthetic-Voice-Detection	Audio Spectrogram Transformer	VoxCelebSpoof
Wav2Vec2 (2024 mix)	garystafford/wav2vec2-deepfake-voice-detector	Wav2Vec2 + fine-tuned classifier	2024 real/fake mix
Deepfake-V2 (W2V2-base)	MelodyMachine/Deepfake-audio-detection-V2	Wav2Vec2-base audio classifier	mixed real/fake

These checkpoints are standard academic references. Their near-random accuracy on this benchmark reflects the generalization gap between their training attacks (ASVspoof / VoxCelebSpoof) and the modern commercial voice-cloning systems represented here.

Metrics

For each model we report:

Metric	Definition
Accuracy	(TP + TN) / Total
F1 score	2 · Precision · Recall / (Precision + Recall)
FPR (False Positive Rate)	FP / (FP + TN) — real flagged as fake
FNR (False Negative Rate)	FN / (FN + TP) — fake missed
Latency	Mean round-trip time per file (ms)
Real-time factor (RTF)	Latency / audio duration
Per-format performance	Same metrics broken down by `.mp3`, `.wav`, `.flac`, `.ogg`, `.m4a`, `.webm`
Rejection ratio	(NOT_APPLICABLE + errors) / attempted calls

We deliberately do not report Equal Error Rate (EER), since EER assumes an oracle threshold that cannot be set in production.

Submission Format

Produce a CSV file predictions.csv with three columns, filename, label, and latency_ms (mean per-file inference time in milliseconds):

filename,label,latency_ms
0.flac,real,247.09
1.webm,fake,493.58
2.mp3,real,289.01
...

Labels must be exactly real or fake (lowercase). latency_ms lets us report the Lat(ms) and RTF columns on the leaderboard; if you cannot measure it, leave the column blank.

Submit one CSV per dataset:

Studio track: run over dataset/; filenames keep their original extension (e.g. 0.flac).
Narrowband telephony track (8 kHz): run over dataset_8k_nb/; filenames are 0.wav, 1.wav, ...
Wideband telephony track (16 kHz): run over dataset_16k_wb/; filenames are 0.wav, 1.wav, ...

The three datasets share filenames (0.wav...) but are independently shuffled, so a prediction file for one track will not score on another.

Email your predictions.csv to hello@podonos.com for scoring against the private gold standard.

Related Work

License

This benchmark code is released under the MIT License (see LICENSE).

The audio dataset is provided for research and benchmarking purposes only. Source corpora retain their respective licenses (VCTK: ODC-BY; LJSPEECH: public domain; LibriTTS: CC BY 4.0). Synthetic samples are generated under the terms of each TTS vendor's API ToS.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Deepfake Audio Detection Benchmark

Leaderboard

Observations

Error Profile

Accuracy vs Real-Time Factor

Dataset

Telephony Tracks (New)

Narrowband track: 8 kHz (2G/3G)

Wideband track: 16 kHz (4G/5G)

Submit your results

How to Reproduce

1. Clone and install

2. Convert audio to 16 kHz mono WAV (open-source models only)

3. Run open-source models

4. Run commercial APIs (optional, requires API keys)

5. Compute metrics

Models Evaluated

Commercial APIs

Open-source baselines

Metrics

Submission Format

Related Work

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
dataset		dataset
dataset_16k_wb		dataset_16k_wb
dataset_8k_nb		dataset_8k_nb
images		images
scripts		scripts
.gitignore		.gitignore
DATASET.md		DATASET.md
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Deepfake Audio Detection Benchmark

Leaderboard

Observations

Error Profile

Accuracy vs Real-Time Factor

Dataset

Telephony Tracks (New)

Narrowband track: 8 kHz (2G/3G)

Wideband track: 16 kHz (4G/5G)

Submit your results

How to Reproduce

1. Clone and install

2. Convert audio to 16 kHz mono WAV (open-source models only)

3. Run open-source models

4. Run commercial APIs (optional, requires API keys)

5. Compute metrics

Models Evaluated

Commercial APIs

Open-source baselines

Metrics

Submission Format

Related Work

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages