Velma API
A drop-in understanding layer for your voice stack.
Transcription discards voice signals that provide the true intent and meaning behind a conversation. Velma is the only voice-native AI model built that uses audio signals to understand conversations in all of their depth.
The Velma API turns any voice conversation into structured representation that you define — tell Velma what to surface, and it does. What you build on top is limited only by your imagination.
DEFAULT ENDPOINTS
What Velma returns by default
Emotions detected and analyzed
Velma detects emotions to better understand meaning in a conversation.
Speaker identification & diarization
Industry-leading accuracy at knowing who said what — even with overlapping speakers and noisy audio.
Topic categorization
Automatically classifies what each conversation is about — billing, retention, support, sales, complaints, and more.
Conversation summary
A clean, concise summary of every audio file — with higher accuracy than transcription + LLMs.
150+ key behaviors & events
Fraud, churn, compliance violations, harassment, escalation, and dozens more — detected the moment they happen.
Accent detection
Recognizes regional and global accents so analysis stays accurate across diverse populations.
Deepfake detection
#1 ranked on Hugging Face. Detects synthetic and cloned voices with 98.9% accuracy in real time.
Speaker dynamics
Detects who's leading the conversation, who's interrupting, turn-taking patterns, and dominance — the social shape of every call.
Want more than the defaults?
Define your own custom behaviors — describe what to find, and Velma surfaces it.
CUSTOM BEHAVIORS
Detect anything using natural language
Describe what matters to you.
Velma uses every audio signal — not just words — to surface it accurately.
Velma uses every audio signal — not just words — to surface it accurately.
Velma vs. competitors
Audio-native capabilities from a better architecture.
Most "voice AI" stacks are transcription pipelines with an LLM bolted on top. Velma is voice-native — built from the audio signal up.
Build with Velma
Build on top of audio understanding, not transcription
Smarter voice agents
AI agents that understand voice signals for better responses.
AI voice guardrails
Monitor what your LLM-powered voice agent is saying — and how callers are reacting to it.
Emotion-driven apps
Personalize every interaction in real time — route, respond, and adapt based on how the caller actually feels.
Conversation analytics
Replace your STT/ASR layer with better conversational insights.
Live coaching tools
Real-time agent assist that surfaces what to say next, based on how the call is going.
Anything you can imagine
Ask Velma to find anything in a conversation, and it does. The only limit on what you build is what you can describe.
Where Velma fits
A drop-in layer for your voice stack
Get started in minutes
Understand a conversation in a few lines of code
More from Modulate
Explore Modulate's other leading voice models
Audio-native APIs built for real-time performance — designed to drop right into your stack.
Ready to get st Start building with Velma.arted?
Grab an API key or try the playground to see Velma understand a real conversation.
More from Modulate
Explore Modulate's other leading voice models
Audio-native APIs built for real-time performance — designed to drop right into your stack.