Modulate’s Model Benchmarks

Name: Modulate Velma Voice AI Benchmarks
Creator: Modulate

Compare audio-native Velma ‍to LLMs

Conversation Understanding Benchmark — Accuracy vs. Cost

Evaluates a model's ability to identify conversation types, topics, speaker roles and key behaviors.

Highest accuracy lowest cost

Inference cost

Accuracy score

velma-2-fast

velma-2

grok-4.1-fast-non-reasoning

grok-4.1-fast-reasoning

gemini-2-flash-lite

deepseek-v3.1

gemini-2-flash

deepseek-v3.2

gemini-3-flash-min

deepseek-r1

gemini-3-flash-med

gemini-2.5-pro

gemini-3-pro

grok-3

nova-3-intelligence

scribe-v2

grok-4-heavy

gpt-5-mini

gpt-5.2-pro

gpt-5.2

$0.01

0.02

0.03

0.04

0.05

0.06

0.07

$0.08

$0.10

0.50

1.00

$1.50

Compare Transcribe
to the competition

Transcription Benchmark (Accuracy vs. Price)

Average Word Error Rate (WER) across Earnings-22 and VoxPopuli datasets

Lowest WER lowest cost

Cost per hour

Avg. Word Error Rate

modulate-transcribe

scribe-v2

assemblyai-universal-2

assemblyai-universal-3-pro

speechmatics-enhanced

google-gemini-2.5-pro

gpt-4o-transcribe

google-chirp-2

deepgram-nova-3

openai-whisper-large-v3

13 %

$0.00

0.10

0.20

0.30

$0.40

Speech-to-Text Transcription Pricing (Batch)

Modulate

$0.03 / hr

xAI

grok-stt

$0.10 / hr

AssemblyAI

universal-3 Pro

$0.21 / hr

ElevenLabs

scribe v2

$0.22 / hr

Speechmatics

enhanced

$0.24 / hr

Deepgram

nova-3

$0.31 / hr

OpenAI

gpt-4o-transcribe

$0.36 / hr

Speech-to-Text Transcription Pricing (Streaming)

Modulate

$0.06 / hr

xAI

grok

$0.20 / hr

Speechmatics

enhanced

$0.24 / hr

Deepgram

nova-3

$0.35 / hr

OpenAI

gpt-4o-transcribe

$0.36 / hr

ElevenLabs

scribe-v2

$0.39 / hr

AssemblyAI

universal-3-pro

$0.45 / hr

Hugging Face’s Deepfake Speech Leaderboard

Modulate is the top ranked deepfake detection model on Hugging Face's Speak Deepfake Arena , the leading independent benchmark. View it here.

Compare Deepfake
Detect to the competition

Modulate is #1 on 🤗 Hugging Face

Modulate is the top ranked deepfake detection model on Hugging Face's Speech Arena Leaderboard, the leading independent benchmark. Just 1.1% Equal Error Rate, Modulate catches 133% more deepfakes than the next best.

System	Date Added	Num Params (M)	Pooled EER	Average EER ↓
	🥇Modulate-VELMA-2-Syntheti
🥇Modulate-VELMA-2-Syntheti	11/03/2026	316.000	1.586	1.104
	🥈Resemble-Detect-3B-Omni
🥈Resemble-Detect-3B-Omni	14/10/2025	3000.000	2.099	2.570
	🥉Hiya-Authenticity-Verific
🥉Hiya-Authenticity-Verific	13/02/2026	1000.000	2.324	2.113
	DLMSL-SpeakSure-v0.1
DLMSL-SpeakSure-v0.1	27/10/2025	658.630	6.142	3.954
	Whispeak
Whispeak	20/08/2025	98.900	8.060	3.049

EER (Equal Error Rate) is the foundation performance metric used to evaluate how accurately a model can distinguish between genuine human speech and AI-generated audio.

Modulate Catches 99% of all Deepfakes

Catch 2x more deepfakes and flag 48% fewer false positives vs. next-best. 🤗 Hugging Face Leaderboard.

Accuracy

100%

98.9%

Modulate

velma-deepfake-detect

97.9%

Hiya

authenticity-verific

97.4%

Resemble AI

resemble-detect-3b

96.9%

Whispeak

whispeak

96.0%

Deep Learning

dlmsl-speaksure-v0.1

94.2%

DF Arena

df-arena-500m-v1

94.1%

DF Arena

df-arena-1b-v1

93.9%

Syntra

syntra-detector

92.9%

Momenta

momenta

Detect Deepfakes for just $0.25 / hr

Fraud protection at scale, at a price that levels the playing field vs. scammers.

Modulate Deepfake-Detect

$0.25 / hr

Resemble AI Enterprise

$29 / hr

Other Providers

$30 — $120 / hr

Resemble AI Self-Serve

$144 / hr

Cookie consent notice

Preferences Dashboard

Modulate’s Model Benchmarks

Compare audio-native Velma ‍to LLMs

Compare Transcribeto the competition

Hugging Face’s Deepfake Speech Leaderboard

Compare DeepfakeDetect to the competition

Modulate is #1 on 🤗 Hugging Face

Modulate Catches 99% of all Deepfakes

Detect Deepfakes for just $0.25 / hr

Compare Transcribe
to the competition

Compare Deepfake
Detect to the competition