Modulate’s Model Benchmarks

Meet Velma

#1 in accuracy. #1 in cost-effectiveness. Unlimited insights.

Most “voice AI” stacks treat audio like nothing more than words - transcribing what is said and handing the text to a language model.

Velma is different. It’s the first production Ensemble Listening Model (ELM): a voice-native model that understands conversations the way humans do - with full awareness of how something is said, not just what.

Velma is available as the engine behind Modulate's enterprise platform and as direct APIs for transcription and deepfake detection, with full voice analytics coming soon.

See how Velma outperforms LLMs.

Conversation Understanding Benchmark — Accuracy vs. Cost
Tests models' ability to recognize key conversational behaviors including aggression, policy violations, complaints, deception and more
Highest accuracy lowest cost
Inference cost
Accuracy score
velma-2-fast
velma-2
gpt-4o-mini
grok-4.1-fast-non-reasoning
grok-4.1-fast-reasoning
gemini-2-flash-lite
deepseek-v3.1
gemini-2-flash
deepseek-v3.2
gemini-3-flash-min
deepseek-r1
gemini-3-flash-med
gpt-4o
gemini-2.5-pro
gemini-3-pro
grok-3
nova-3-intelligence
scribe-v2
grok-4-heavy
gpt-5-mini
gpt-4-turbo
gpt-5.2-pro
gpt-5.2
1
2
3
4
5
$0.01
0.02
0.03
0.04
0.05
0.06
$0.07
$0.10
0.50
1.00
$1.50
0

Compare Velma Transcribe
to the competition

Transcription Benchmark (Accuracy vs. Price)
Average Word Error Rate (WER) across Earnings-22 and VoxPopuli datasets
Lowest WER lowest cost
Cost per 1000 minutes of audio
Avg. Word Error Rate
modulate-velma-2
elevenlabs-scribe-v2
google-gemini-2.5-pro
assemblyai-universal
speechmatics-enhanced
gladia-solaria-1
openai-gpt-4o-transcribe
google-chirp-2
speechmatics-standard
openai-whisper-large-v3
deepgram-nova-3
8
9
10
11
12 %
$0
1
2
3
4
5
6
7
8
$9
Transcription API Cost Comparison among STT Leaders
Modulate
$0.03 / hr
modulate-velma-2
AssemblyAI
$0.15 / hr
universal
Deepgram
$0.26 / hr
nova-2
ElevenLabs
$0.40 / hr
scribe-v2

Hugging Face’s Deepfake Speech Leaderboard

Modulate is the top ranked deepfake detection model on Hugging Face's Speak Deepfake Arena , the leading independent benchmark. View it here.

Compare Velma Deepfake
Detect to the competition

Modulate is #1 on 🤗 Hugging Face

Modulate is the top ranked deepfake detection model on Hugging Face's Speech Arena Leaderboard, the leading independent benchmark. Just 1.1% Equal Error Rate, Modulate catches 133% more deepfakes than the next best.
System Date Added Num Params (M) Pooled EER Average EER ↓
🥇Modulate-VELMA-2-Syntheti
🥇Modulate-VELMA-2-Syntheti 11/03/2026 316.000 1.586 1.104
🥈Resemble-Detect-3B-Omni
🥈Resemble-Detect-3B-Omni 14/10/2025 3000.000 2.099 2.570
🥉Hiya-Authenticity-Verific
🥉Hiya-Authenticity-Verific 13/02/2026 1000.000 2.324 2.113
DLMSL-SpeakSure-v0.1
DLMSL-SpeakSure-v0.1 27/10/2025 658.630 6.142 3.954
Whispeak
Whispeak 20/08/2025 98.900 8.060 3.049
EER (Equal Error Rate) is the foundation performance metric used to evaluate how accurately a model can distinguish between genuine human speech and AI-generated audio.

Modulate Catches 99% of all Deepfakes

Catch 2x more deepfakes and flag 48% fewer false positives vs. next-best. 🤗 Hugging Face Leaderboard.
Accuracy
92
94
96
98
100%
98.9%
Modulate
velma-deepfake-detect
97.9%
Hiya
authenticity-verific
97.4%
Resemble AI
resemble-detect-3b
96.9%
Whispeak
whispeak
96.0%
Deep Learning
dlmsl-speaksure-v0.1
94.2%
DF Arena
df-arena-500m-v1
94.1%
DF Arena
df-arena-1b-v1
93.9%
Syntra
syntra-detector
92.9%
Momenta
momenta

Detect Deepfakes for just $0.25 / hr

Fraud protection at scale, at a price that levels the playing field vs. scammers.
Modulate Deepfake-Detect
$0.25 / hr
Resemble AI Enterprise
$29 / hr
Other Providers
$30 — $120 / hr
Resemble AI Self-Serve
$144 / hr