Velma API

Understand the true meaning of every conversation

Transcription discards signals like emotion, tone and other audio cues that carry what a conversation actually means. Velma is a voice-native model that listens to the audio itself.
Velma turns voice conversations into signals and behaviors you can act on — out of the box, no LLM needed. The future of voice AI is built with Velma.
MEET VELMA

Audio-native AI that identifies and escalates your risks

AUDIO IN
Live streams
Recordings
Third party integrations
Velma
Understand the full conversation
Audio-native Ensemble Listening Model (ELM)
API OUTPUT

Behaviors detected

187 alerts today · 33 critical · 40 escalated
Just now

Executive Impersonation ScamFraudConfidence 93%

Caller claiming to be CEO solicited payment credentials using urgent payment language.

Alerted Agent
4 min ago

Identity Verification FraudFraudConfidence 91%

DOB, address, and PIN mismatched across three verification prompts.

Manual Verification
11 min ago

Security Protocol BypassCompliance ViolationConfidence 95%

Agent skipped two verification steps and granted full account access to unverified caller.

Urgent Review
19 min ago

Unresolved Billing DisputeComplianceConfidence 98%

Customer referenced unresolved billing dispute from prior call; fix never applied to account.

Claims Review
22 min ago

Threat-Based HarassmentAgent SafetyConfidence 96%

Caller repeated physical threat toward support agent after refund denial; warning issued, threats continued.

Escalated
THE VELMA DIFFERENCE

Transcription captures words.
Velma captures meaning.

Words are just the surface. Velma hears the full picture.

Word-based transcription discards the true meaning of a conversation.
Velma leverages acoustic signals to understand conversations like a human.

THE INDUSTRY STANDARD

Transcription + LLM pipeline
Voice signals discarded
Tone, emotion, hesitation, stress, speaker dynamics, intent, sarcasm and many more
WHAT TRANSCRIPTION CAPTURES
1 layer
Words
Captured
The literal transcript
Intent and behavior
Lost
Misunderstands intent and vulnerability
Tone and emotion
Lost
Loses anger, frustration, fear, joy, sarcasm
Prosody
Lost
Ignores pauses or unique delivery
Speaker dynamics
Lost
Overlooks interruptions and side comments
Deception and stress cues
Lost
Misses hesitation and vocal anxiety
Acoustic authenticity
Lost
Cannot catch deepfakes or spoofing

VELMA BY MODULATE

Voice-native AI
Voice signals analyzed
Tone, emotion, intent, rhythm, context, accents, deepfakes, sarcasm, vocal biomarkers and more.
WHAT VELMA CAPTURES
7 layers
Words
Captured
Best-in-class transcription accuracy
Intent and behavior
Captured
Any behavior detectable in real time
Tone and emotion
Captured
20+ emotions from the acoustic signal
Prosody
Captured
Pitch, rhythm, emphasis, pacing
Speaker dynamics
Captured
Multi-speaker diarization and patterns
Deception and stress cues
Captured
Vocal stress, lying, coercion signals
Acoustic authenticity
Captured
#1 deepfake detection on Hugging Face
BEHAVIORS

Define the risks that matter to your business. Velma hears them in the audio.

Tell Velma what matters — edit any behavior or write your own, all in plain language.
Velma uses every audio signal to detect them accurately.
Detect when an agent skips requir
Saved: Unauthorized Data Disclosure

You can also upload SOPs, compliance docs, or playbooks to specify exactly what Velma should catch.

Velma vs. industry standard

Audio-native capabilities from a better architecture.

Velma API
Industry Standard
Native input
Audio
Text
Architecture
Voice-native Ensemble Listening Model (ELM)
Transcription + LLM pipeline
What's in the model?
100+ specialized sub-models, each optimized for a specific signal or task
A transcript without audio signals + text-based LLM
Emotion detection
Understands emotion from audio, not word choice to. 20+ emotions.
None built-in. Requires a separate SER model.
Speech signals
Tone, emotion, prosody, rhythm, vocal stress.
None built-in.
Non-speech signals
Laughing, shouting, crying, shouting, hesitation, pitch, pacing
Invisible
Deepfake detection
98.9% accuracy, #1 on Hugging Face, same API call
Not a feature. Separate model + pipeline stage.
Custom behaviors
Describe in plain English. Velma uses audio + text together for higher accuracy.
Possible via prompt engineering. Accuracy limited to what words alone can reveal.
Out-of-box behaviors
50 by default, 100 more as templates — fraud, churn, compliance & escalation
None. Each requires prompt engineering + ongoing maintenance.
Speaker diarization
Industry-leading, handles overlap and noise
Varies; overlap is a common failure
Integration complexity
Drop-in. Send audio, receive structured JSON. A few lines of code.
Manage STT + LLM separately, plus custom logic to enrich context.
Cost
Starting at $0.75/hr
$2.50–$10/hr
Build with Velma

Build on top of audio understanding, not transcription

Smarter voice agents

AI agents that understand voice signals for better responses.

AI voice guardrails

Monitor what your LLM-powered voice agent is saying — and how callers are reacting to it.

Emotion-driven apps

Personalize every interaction in real time — route, respond, and adapt based on how the caller actually feels.

Conversation analytics

Replace your STT/ASR layer with better conversational insights.

Live coaching tools

Real-time agent assist that surfaces what to say next, based on how the call is going.

Anything you can imagine

Ask Velma to find anything in a conversation, and it does. The only limit on what you build is what you can describe.
Where Velma fits

A drop-in layer for your voice stack

Audio in
Telephony / SIP
Voice agents
Recordings
Browser / Mobile
Understanding layer
Velma API
REST + WebSocket
Your application
Real-time alerts
Agent assist
Dashboards
Data warehouse
Drop Velma into any voice pipeline. The underlying model handles the rest.

Velma is the #1 model
for Conversation Understanding

Conversation Understanding Benchmark — Accuracy vs. Cost
Evaluates a model's ability to identify conversation types, topics, speaker roles and key behaviors. Methodology ↗
Highest accuracy lowest cost
Inference cost
Accuracy score
velma-2-fast
velma-2
grok-4.1-fast-non-reasoning
grok-4.1-fast-reasoning
gemini-2-flash-lite
deepseek-v3.1
gemini-2-flash
deepseek-v3.2
gemini-3-flash-min
deepseek-r1
gemini-3-flash-med
gemini-2.5-pro
gemini-3-pro
grok-3
nova-3-intelligence
scribe-v2
grok-4-heavy
gpt-5-mini
gpt-5.2-pro
gpt-5.2
1
2
3
4
5
6
7
8
9
10
$0.01
0.02
0.03
0.04
0.05
0.06
0.07
$0.08
$0.10
0.50
1.00
$1.50
0
Get started in minutes

Drop-in by design — three steps, one API

1
Send audio
Point Velma at a file or a live stream — or connect the platform you already use (Five9, Genesys, Teams, Twilio, SIP). One endpoint, no pipeline to assemble.
2
Velma analyzes
A single voice-native model does all the work — no separate transcription, LLM, or enrichment services to wire together and keep in sync.
3
Output, where and how you like it
A structured JSON — stream it live, drop it in your warehouse, or trigger alerts. You decide where it goes.
It really is this short — streaming, start to finish:

# 1 · open a connection   2 · stream audio   3 · read results
ws = connect("wss://modulate-developer-apis.com/api/velma-2-streaming?api_key=…")
ws.send(config)            # what to detect — or just use the default package
ws.send(audio_chunk)       # stream your audio
for event in ws:           # clips, behaviors, topics, summary…
    handle(event)

Start building with Velma.

Grab an API key or try the playground to see Velma understand a real conversation.
More from Modulate

Explore Modulate's other leading voice models

Audio-native APIs built for real-time performance — designed to drop right into your stack.

Deepfake Detection

Synthetic voice detection, batch and streaming. #1 on Hugging Face leaderboards.
See how it works

Transcription

Real-time and batch transcription with speaker diarization. Lowest cost, lowest error rate.
See how it works

PII/PHI Redaction

Auto-redact sensitive content from both transcripts and audio. Compliance-ready.
See how it works

Music Detection

Detect music vs. speech in any audio stream. Real-time and batch.
See how it works