Velma API

A drop-in understanding layer for your voice stack.

Transcription discards voice signals that provide the true intent and meaning behind a conversation. Velma is the only voice-native AI model built that uses audio signals to understand conversations in all of their depth.
The Velma API turns any voice conversation into structured representation that you define — tell Velma what to surface, and it does. What you build on top is limited only by your imagination.
DEFAULT ENDPOINTS

What Velma returns by default

Emotions detected and analyzed

Velma detects emotions to better understand meaning in a conversation.

Speaker identification & diarization

Industry-leading accuracy at knowing who said what — even with overlapping speakers and noisy audio.

Topic categorization

Automatically classifies what each conversation is about — billing, retention, support, sales, complaints, and more.

Conversation summary

A clean, concise summary of every audio file — with higher accuracy than transcription + LLMs.

150+ key behaviors & events

Fraud, churn, compliance violations, harassment, escalation, and dozens more — detected the moment they happen.

Accent detection

Recognizes regional and global accents so analysis stays accurate across diverse populations.

Deepfake detection

#1 ranked on Hugging Face. Detects synthetic and cloned voices with 98.9% accuracy in real time.

Speaker dynamics

Detects who's leading the conversation, who's interrupting, turn-taking patterns, and dominance — the social shape of every call.

Want more than the defaults?

Define your own custom behaviors — describe what to find, and Velma surfaces it.
CUSTOM BEHAVIORS

Detect anything using natural language

Describe what matters to you.
Velma uses every audio signal — not just words — to surface it accurately.
Velma vs. competitors

Audio-native capabilities from a better architecture.

Most "voice AI" stacks are transcription pipelines with an LLM bolted on top. Velma is voice-native — built from the audio signal up.
Build with Velma

Build on top of audio understanding, not transcription

Smarter voice agents

AI agents that understand voice signals for better responses.

AI voice guardrails

Monitor what your LLM-powered voice agent is saying — and how callers are reacting to it.

Emotion-driven apps

Personalize every interaction in real time — route, respond, and adapt based on how the caller actually feels.

Conversation analytics

Replace your STT/ASR layer with better conversational insights.

Live coaching tools

Real-time agent assist that surfaces what to say next, based on how the call is going.

Anything you can imagine

Ask Velma to find anything in a conversation, and it does. The only limit on what you build is what you can describe.
Where Velma fits

A drop-in layer for your voice stack

Get started in minutes

Understand a conversation in a few lines of code

More from Modulate

Explore Modulate's other leading voice models

Audio-native APIs built for real-time performance — designed to drop right into your stack.

Ready to get st Start building with Velma.arted?

Grab an API key or try the playground to see Velma understand a real conversation.
More from Modulate

Explore Modulate's other leading voice models

Audio-native APIs built for real-time performance — designed to drop right into your stack.

Velma Transcribe

Real-time and batch transcription with speaker diarization. Lowest cost, lowest error rate.
See how it works

Deepfake Detection

Synthetic voice detection, batch and streaming. #1 on Hugging Face leaderboards.
See how it works

PII/PHI Redaction

Auto-redact sensitive content from both transcripts and audio. Compliance-ready.
See how it works

Music Detection

Detect music vs. speech in any audio stream. Real-time and batch.
See how it works