Velma API

Understand the true meaning of every conversation

Transcription discards signals like emotion, tone and other audio cues that carry what a conversation actually means. Velma is a voice-native model that listens to the audio itself.
Velma turns voice conversations into signals and behaviors you can act on — out of the box, no LLM needed. The future of voice AI is built with Velma.
MEET VELMA

Audio-native AI that identifies and escalates your risks

AUDIO IN
Live streams
Recordings
Third party integrations
Velma
Understand the full conversation
Audio-native Ensemble Listening Model (ELM)
API OUTPUT

Behaviors detected

187 alerts today · 33 critical · 40 escalated
Just now

Executive Impersonation ScamFraudConfidence 93%

Caller claiming to be CEO solicited payment credentials using urgent payment language.

Alerted Agent
4 min ago

Identity Verification FraudFraudConfidence 91%

DOB, address, and PIN mismatched across three verification prompts.

Manual Verification
11 min ago

Security Protocol BypassCompliance ViolationConfidence 95%

Agent skipped two verification steps and granted full account access to unverified caller.

Urgent Review
19 min ago

Unresolved Billing DisputeComplianceConfidence 98%

Customer referenced unresolved billing dispute from prior call; fix never applied to account.

Claims Review
22 min ago

Threat-Based HarassmentAgent SafetyConfidence 96%

Caller repeated physical threat toward support agent after refund denial; warning issued, threats continued.

Escalated
Conversation analysis
LIVE
Summary
Customer called about a refund on a recent charge and pushed for a better price; agent offered alternatives and a partial discount.
Speaker topics
Customer
Pricing questions
Competitor comparison
Sales Rep
Discount options
Alternative plans
Retention offer
Emotions by speaker
Customer
Frustrated
Disappointed
Anxious
Hopeful
Sales Rep
Calm
Confident
Interested
Concerned
6 Key behaviors detected
Churn risk
Churn Risk
Customer compared pricing with a competitor and referenced canceling.
Retention opportunity
Coaching
Agent offered an alternative plan before escalation — coachable save.
+ 4 more behaviors detected
Deepfake
Synthetic
Accents
American
PII / PHI
2 tags
Sentiment
Negative
Full Transcript
Diarized
Summary
Caller claimed a billing error and pressed the agent to redirect payment to a new account without completing verification.
Speaker topics
Caller
Billing error claim
Payment redirect
Urgency
Agent
Identity verification
Account lookup
Policy
Emotions by speaker
Caller
Stressed
Anxious
Frustrated
Afraid
Confused
Agent
Concerned
Calm
Confident
8 Key behaviors detected
Payment fraud attempt
Fraud Risk
Urgent payment language used to redirect funds to an unverified account.
Verification bypass
Compliance
Caller pushed to skip two required identity checks.
+ 6 more behaviors detected
Deepfake
Synthetic
Accents
Eastern European
+1
PII / PHI
4 tags
Sentiment
Negative
Full Transcript
Diarized
Summary
Prospect confirmed they were ready to move forward; rep kept introducing new features after the verbal commitment.
Speaker topics
Prospect
Ready to sign
Onboarding timeline
Budget
Sales Rep
Add-on features
Upsell
Contract terms
Emotions by speaker
Prospect
Interested
Confused
Bored
Surprised
Tired
Sales Rep
Excited
Confident
Proud
Hopeful
5 Key behaviors detected
Post-commitment overselling
Coaching
Rep introduced two extra features after the prospect already agreed.
Buyer hesitation
Escalation
Confusion cues rose as scope expanded past the original ask.
+ 3 more behaviors detected
Deepfake
Authentic
Accents
British
PII / PHI
1 tag
Sentiment
Mixed
Full Transcript
Diarized
THE VELMA DIFFERENCE

Transcription captures words.
Velma captures meaning.

Words are just the surface. Velma hears the full picture.

Word-based transcription discards the true meaning of a conversation.
Velma leverages acoustic signals to understand conversations like a human.

THE INDUSTRY STANDARD

Transcription + LLM pipeline
Voice signals discarded
Tone, emotion, hesitation, stress, speaker dynamics, intent, sarcasm and many more
WHAT TRANSCRIPTION CAPTURES
1 layer
Words
Captured
The literal transcript
Intent and behavior
Lost
Misunderstands intent and vulnerability
Tone and emotion
Lost
Loses anger, frustration, fear, joy, sarcasm
Prosody
Lost
Ignores pauses or unique delivery
Speaker dynamics
Lost
Overlooks interruptions and side comments
Deception and stress cues
Lost
Misses hesitation and vocal anxiety
Acoustic authenticity
Lost
Cannot catch deepfakes or spoofing

VELMA BY MODULATE

Voice-native AI
Voice signals analyzed
Tone, emotion, intent, rhythm, context, accents, deepfakes, sarcasm, vocal biomarkers and more.
WHAT VELMA CAPTURES
7 layers
Words
Captured
Best-in-class transcription accuracy
Intent and behavior
Captured
Any behavior detectable in real time
Tone and emotion
Captured
20+ emotions from the acoustic signal
Prosody
Captured
Pitch, rhythm, emphasis, pacing
Speaker dynamics
Captured
Multi-speaker diarization and patterns
Deception and stress cues
Captured
Vocal stress, lying, coercion signals
Acoustic authenticity
Captured
#1 deepfake detection on Hugging Face
API OUTPUT

What every Velma API call returns

One call, one model — structured output, no pipeline to assemble and no fine-tuning.

Emotions detected and analyzed

Velma detects emotions to better understand meaning in a conversation.

Speaker identification

Identifies each speaker's role — even with overlapping speakers and noisy audio.

150+ behaviors with reasoning

Fraud, churn, compliance violations, harassment, escalation, and dozens more — detected the moment they happen.

Diarized transcript

Every word, every speaker, every timestamp — clean enough to drop straight into your pipeline.

Custom behaviors you define

Describe what matters to you, in natural language. Velma uses audio signals to surface it.

Conversation topics & sentiment

Surfaces what the conversation is about and how each speaker feels about it — per topic, per speaker.

Conversation summary

A clean, concise summary of every audio file — with higher accuracy than transcription + LLMs.

Accent detection

Recognizes regional and global accents so analysis stays accurate across diverse populations.

Deepfake detection

#1 ranked on Hugging Face. Flags synthetic and cloned voices in real time at a 1.1% equal error rate.

PII / PHI tags

Sensitive details tagged inline, ready to redact.
BEHAVIORS

Define the risks that matter to your business. Velma hears them in the audio.

Tell Velma what matters — edit any behavior or write your own, all in plain language.
Velma uses every audio signal to detect them accurately.
Detect when an agent skips requir
Saved: Unauthorized Data Disclosure

You can also upload SOPs, compliance docs, or playbooks to specify exactly what Velma should catch.

Velma vs. industry standard

Audio-native capabilities from a better architecture.

Velma API
Industry Standard
Native input
Audio
Text
Architecture
Voice-native Ensemble Listening Model (ELM)
Transcription + LLM pipeline
What's in the model?
100+ specialized sub-models, each optimized for a specific signal or task
A transcript without audio signals + text-based LLM
Emotion detection
Understands emotion from audio, not word choice to. 20+ emotions.
None built-in. Requires a separate SER model.
Speech signals
Tone, emotion, prosody, rhythm, vocal stress.
None built-in.
Non-speech signals
Laughing, shouting, crying, shouting, hesitation, pitch, pacing
Invisible
Deepfake detection
98.9% accuracy, #1 on Hugging Face, same API call
Not a feature. Separate model + pipeline stage.
Custom behaviors
Describe in plain English. Velma uses audio + text together for higher accuracy.
Possible via prompt engineering. Accuracy limited to what words alone can reveal.
Out-of-box behaviors
50 by default, 100 more as templates — fraud, churn, compliance & escalation
None. Each requires prompt engineering + ongoing maintenance.
Speaker diarization
Industry-leading, handles overlap and noise
Varies; overlap is a common failure
Integration complexity
Drop-in. Send audio, receive structured JSON. A few lines of code.
Manage STT + LLM separately, plus custom logic to enrich context.
Cost
Starting at $0.75/hr
$2.50–$10/hr
Build with Velma

Build on top of audio understanding, not transcription

Smarter voice agents

AI agents that understand voice signals for better responses.

AI voice guardrails

Monitor what your LLM-powered voice agent is saying — and how callers are reacting to it.

Emotion-driven apps

Personalize every interaction in real time — route, respond, and adapt based on how the caller actually feels.

Conversation analytics

Replace your STT/ASR layer with better conversational insights.

Live coaching tools

Real-time agent assist that surfaces what to say next, based on how the call is going.

Anything you can imagine

Ask Velma to find anything in a conversation, and it does. The only limit on what you build is what you can describe.
Where Velma fits

A drop-in layer for your voice stack

Audio in
Telephony / SIP
Voice agents
Recordings
Browser / Mobile
Understanding layer
Velma API
REST + WebSocket
Your application
Real-time alerts
Agent assist
Dashboards
Data warehouse
Drop Velma into any voice pipeline. The underlying model handles the rest.

Velma is the #1 model
for Conversation Understanding

Conversation Understanding Benchmark — Accuracy vs. Cost
Evaluates a model's ability to identify conversation types, topics, speaker roles and key behaviors. Methodology ↗
Highest accuracy lowest cost
Inference cost
Accuracy score
velma-2-fast
velma-2
grok-4.1-fast-non-reasoning
grok-4.1-fast-reasoning
gemini-2-flash-lite
deepseek-v3.1
gemini-2-flash
deepseek-v3.2
gemini-3-flash-min
deepseek-r1
gemini-3-flash-med
gemini-2.5-pro
gemini-3-pro
grok-3
nova-3-intelligence
scribe-v2
grok-4-heavy
gpt-5-mini
gpt-5.2-pro
gpt-5.2
1
2
3
4
5
6
7
8
9
10
$0.01
0.02
0.03
0.04
0.05
0.06
0.07
$0.08
$0.10
0.50
1.00
$1.50
0
Get started in minutes

Drop-in by design — three steps, one API

1
Send audio
Point Velma at a file or a live stream — or connect the platform you already use (Five9, Genesys, Teams, Twilio, SIP). One endpoint, no pipeline to assemble.
2
Velma analyzes
A single voice-native model does all the work — no separate transcription, LLM, or enrichment services to wire together and keep in sync.
3
Output, where and how you like it
A structured JSON — stream it live, drop it in your warehouse, or trigger alerts. You decide where it goes.
It really is this short — streaming, start to finish:

# 1 · open a connection   2 · stream audio   3 · read results
ws = connect("wss://modulate-developer-apis.com/api/velma-2-streaming?api_key=…")
ws.send(config)            # what to detect — or just use the default package
ws.send(audio_chunk)       # stream your audio
for event in ws:           # clips, behaviors, topics, summary…
    handle(event)

Start building with Velma.

Grab an API key or try the playground to see Velma understand a real conversation.
More from Modulate

Explore Modulate's other leading voice models

Audio-native APIs built for real-time performance — designed to drop right into your stack.

Deepfake Detection

Synthetic voice detection, batch and streaming. #1 on Hugging Face leaderboards.
See how it works

Transcription

Real-time and batch transcription with speaker diarization. Lowest cost, lowest error rate.
See how it works

PII/PHI Redaction

Auto-redact sensitive content from both transcripts and audio. Compliance-ready.
See how it works

Music Detection

Detect music vs. speech in any audio stream. Real-time and batch.
See how it works