By clicking “Accept All”, you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. View our Privacy Policy for more information.
Transcription discards signals like emotion, tone and other audio cues that carry what a conversation actually means. Velma is a voice-native model that listens to the audio itself.
Velma turns voice conversations into signals and behaviors you can act on — out of the box, no LLM needed. The future of voice AI is built with Velma.
Describe in plain English. Velma uses audio + text together for higher accuracy.
Possible via prompt engineering. Accuracy limited to what words alone can reveal.
Out-of-box behaviors
50 by default, 100 more as templates — fraud, churn, compliance & escalation
None. Each requires prompt engineering + ongoing maintenance.
Speaker diarization
Industry-leading, handles overlap and noise
Varies; overlap is a common failure
Integration complexity
Drop-in. Send audio, receive structured JSON. A few lines of code.
Manage STT + LLM separately, plus custom logic to enrich context.
Cost
Starting at $0.75/hr
$2.50–$10/hr
Build with Velma
Build on top of audio understanding, not transcription
Smarter voice agents
AI agents that understand voice signals for better responses.
AI voice guardrails
Monitor what your LLM-powered voice agent is saying — and how callers are reacting to it.
Emotion-driven apps
Personalize every interaction in real time — route, respond, and adapt based on how the caller actually feels.
Conversation analytics
Replace your STT/ASR layer with better conversational insights.
Live coaching tools
Real-time agent assist that surfaces what to say next, based on how the call is going.
Anything you can imagine
Ask Velma to find anything in a conversation, and it does. The only limit on what you build is what you can describe.
Where Velma fits
A drop-in layer for your voice stack
Audio in
Telephony / SIP
Voice agents
Recordings
Browser / Mobile
Understanding layer
Velma API
REST + WebSocket
Your application
Real-time alerts
Agent assist
Dashboards
Data warehouse
Drop Velma into any voice pipeline. The underlying model handles the rest.
Velma is the #1 model for Conversation Understanding
Conversation Understanding Benchmark —
Accuracy vs. Cost
Evaluates a model's ability to identify conversation types, topics, speaker roles and key behaviors. Methodology ↗
Highest accuracy lowest cost
Inference cost
Accuracy score
velma-2-fast
velma-2
grok-4.1-fast-non-reasoning
grok-4.1-fast-reasoning
gemini-2-flash-lite
deepseek-v3.1
gemini-2-flash
deepseek-v3.2
gemini-3-flash-min
deepseek-r1
gemini-3-flash-med
gemini-2.5-pro
gemini-3-pro
grok-3
nova-3-intelligence
scribe-v2
grok-4-heavy
gpt-5-mini
gpt-5.2-pro
gpt-5.2
1
2
3
4
5
6
7
8
9
10
$0.01
0.02
0.03
0.04
0.05
0.06
0.07
$0.08
$0.10
0.50
1.00
$1.50
0
$0.04
4.63
Get started in minutes
Drop-in by design — three steps, one API
1
Send audio
Point Velma at a file or a live stream — or connect the platform you already use (Five9, Genesys, Teams, Twilio, SIP). One endpoint, no pipeline to assemble.
2
Velma analyzes
A single voice-native model does all the work — no separate transcription, LLM, or enrichment services to wire together and keep in sync.
3
Output, where and how you like it
A structured JSON — stream it live, drop it in your warehouse, or trigger alerts. You decide where it goes.
It really is this short — streaming, start to finish:
# 1 · open a connection 2 · stream audio 3 · read results