Modulate
Modulate · Transcription

Speech-to Text API for Real-World Audio.

Most transcription APIs are trained on clean, structured recordings. Modulate is trained on 500M+ hours of real conversations — noisy calls, crosstalk, accents, and all.

"This transcription model might be the best stuff I've seen. I've never seen another model with realtime diarization (that works?!)"
NL
Nick Leonard
CEO, VoiceRun
Free tier · no card needed
Get immediate API access
400 hours free. Start transcribing in minutes.
400 hours free · $0.03/hr after

Everything you need.
One API.

Batch and real-time streaming on the same API key. Capabilities that usually require separate models are built in — no extra calls, no extra cost.

Real-time diarization — speaker-separated transcripts out of the box
70+ languages and dialects supported
Emotion and accent detection — 20+ emotions, 20+ accents, direct from audio
PII/PHI redaction included — no add-on or post-processing step
REST batch and WebSocket streaming — same key, same endpoint pattern
Explore the docs →
bash
$ curl -X POST \
  https://api.modulate.ai/v1/transcribe \
  -H "Authorization: Bearer $API_KEY" \
  -F "audio=@call.wav" \
  -F "model=modulate-transcribe" \
  -F "diarize=true" \
  -F "emotion=true"
200 OKJSON response312ms
{
  "transcript": "Thanks for calling, how can I help—",
  "speakers": 2,
  "emotion": "neutral",
  "accent": "southern_us",
  "duration_ms": 4210,
  "cost_usd": 0.000035
}

Flexible. Private. Scalable.

Designed to fit your stack today and grow with your usage tomorrow.

Flexibility
REST and WebSocket on one keyBatch and real-time streaming share the same API key and endpoint pattern.
Enrichments are per-request flagsDiarization, emotion, accent, PII tagging — all optional. Pay only for what you enable.
Speed-cost tradeoffs built inEnglish Fast at $0.025/hr for high-throughput pipelines; Multilingual at $0.03/hr for full feature support.
SDKs in Python and Node.jsWith async and concurrent patterns documented for production workloads.
Privacy
Audio deleted after 35 daysPermanently and automatically — no manual cleanup required.
No data sold. Ever.Audio processed via the API is used solely to deliver the service.
Enterprise opt-out from model trainingAnnual commitment customers can opt out entirely.
DPA availableGDPR, UK GDPR, and Standard Contractual Clauses for EU/UK data transfers.
Scale
Limits raiseable on requestConcurrency and monthly caps can be increased for high-volume workloads.
Real-time usage dashboardMonitor audio hours and concurrent connections against your limits at any time.
Built for batch pipelinesSemaphore and connection pool patterns documented for processing large file backlogs concurrently.
100 MB per batch requestWith automatic silence trimming and format normalization.

Where accuracy meets cost.

WER vs. cost/hr across Earnings-22 and VoxPopuli. Lower-left is better.

8%9%10%11%12%13%$0.03$0.10$0.21$0.31$0.36Cost per hour ($) →Word Error Rate (%) →← best zoneElevenLabsSpeechmaticsAssemblyAIxAIDeepgramOpenAIGoogle ChirpModulate$0.03/hr · 9.35% WER

Transparent. On-demand. No lock-in.

No contracts, no volume minimums. Pay only for what you process.

Batch · REST APIUSD / hour processed
Modulate lowest cost
modulate-transcribe
$0.03
xAI
grok-stt
$0.10
AssemblyAI
universal-3 Pro
$0.21
ElevenLabs
scribe v2
$0.22
Deepgram
nova-3
$0.31
OpenAI
gpt-4o-transcribe
$0.36
Streaming · WebSocketUSD / hour live audio
Modulate lowest cost
modulate-transcribe
$0.06
xAI
grok
$0.20
Speechmatics
enhanced
$0.24
Deepgram
nova-3
$0.35
OpenAI
gpt-4o-transcribe
$0.36
AssemblyAI
universal-3-pro
$0.45

400 free hours.
No credit card required.

Start transcribing in under 5 minutes. Full docs included.

No commitment. No sales call. Scales to hundreds of hours.