MEET VELMA

Audio-native AI that identifies and escalates your risks

AUDIO IN

Live streams

Recordings

Third party integrations

▼

Velma

Understand the full conversation

Audio-native Ensemble Listening Model (ELM)

▼

API OUTPUT

Just now

Executive Impersonation ScamFraudConfidence 93%

Caller claiming to be CEO solicited payment credentials using urgent payment language.

Alerted Agent

4 min ago

Identity Verification FraudFraudConfidence 91%

DOB, address, and PIN mismatched across three verification prompts.

Manual Verification

11 min ago

Security Protocol BypassCompliance ViolationConfidence 95%

Agent skipped two verification steps and granted full account access to unverified caller.

Urgent Review

19 min ago

Unresolved Billing DisputeComplianceConfidence 98%

Customer referenced unresolved billing dispute from prior call; fix never applied to account.

Claims Review

22 min ago

Threat-Based HarassmentAgent SafetyConfidence 96%

Caller repeated physical threat toward support agent after refund denial; warning issued, threats continued.

Escalated

Conversation analysis

LIVE

Summary

Customer called about a refund on a recent charge and pushed for a better price; agent offered alternatives and a partial discount.

Speaker topics

Customer

Pricing questions

Competitor comparison

Sales Rep

Discount options

Alternative plans

Retention offer

Emotions by speaker

Customer

Frustrated

Disappointed

Anxious

Hopeful

Sales Rep

Calm

Confident

Interested

Concerned

6 Key behaviors detected

Churn risk

Churn Risk

Customer compared pricing with a competitor and referenced canceling.

Retention opportunity

Coaching

Agent offered an alternative plan before escalation — coachable save.

+ 4 more behaviors detected

Deepfake

Synthetic

Accents

American

PII / PHI

2 tags

Sentiment

Negative

Full Transcript

Diarized

Summary

Caller claimed a billing error and pressed the agent to redirect payment to a new account without completing verification.

Speaker topics

Caller

Billing error claim

Payment redirect

Urgency

Agent

Identity verification

Account lookup

Policy

Emotions by speaker

Caller

Stressed

Anxious

Frustrated

Afraid

Confused

Agent

Concerned

Calm

Confident

8 Key behaviors detected

Payment fraud attempt

Fraud Risk

Urgent payment language used to redirect funds to an unverified account.

Verification bypass

Compliance

Caller pushed to skip two required identity checks.

+ 6 more behaviors detected

Deepfake

Synthetic

Accents

Eastern European

+1

PII / PHI

4 tags

Sentiment

Negative

Full Transcript

Diarized

Summary

Prospect confirmed they were ready to move forward; rep kept introducing new features after the verbal commitment.

Speaker topics

Prospect

Ready to sign

Onboarding timeline

Budget

Sales Rep

Add-on features

Upsell

Contract terms

Emotions by speaker

Prospect

Interested

Confused

Bored

Surprised

Tired

Sales Rep

Excited

Confident

Proud

Hopeful

5 Key behaviors detected

Post-commitment overselling

Coaching

Rep introduced two extra features after the prospect already agreed.

Buyer hesitation

Escalation

Confusion cues rose as scope expanded past the original ask.

+ 3 more behaviors detected

Deepfake

Authentic

Accents

British

PII / PHI

1 tag

Sentiment

Mixed

Full Transcript

Diarized

THE VELMA DIFFERENCE

Transcription captures words.
Velma captures meaning.

Words are just the surface. Velma hears the full picture.

Word-based transcription discards the true meaning of a conversation.
Velma leverages acoustic signals to understand conversations like a human.

THE INDUSTRY STANDARD

Transcription + LLM pipeline

Voice signals discarded

Tone, emotion, hesitation, stress, speaker dynamics, intent, sarcasm and many more

WHAT TRANSCRIPTION CAPTURES

1 layer

Words

Captured

The literal transcript

Intent and behavior

Lost

Misunderstands intent and vulnerability

Tone and emotion

Lost

Loses anger, frustration, fear, joy, sarcasm

Prosody

Lost

Ignores pauses or unique delivery

Speaker dynamics

Lost

Overlooks interruptions and side comments

Deception and stress cues

Lost

Misses hesitation and vocal anxiety

Acoustic authenticity

Lost

Cannot catch deepfakes or spoofing

VELMA BY MODULATE

Voice-native AI

Voice signals analyzed

Tone, emotion, intent, rhythm, context, accents, deepfakes, sarcasm, vocal biomarkers and more.

WHAT VELMA CAPTURES

7 layers

Words

Captured

Best-in-class transcription accuracy

Intent and behavior

Captured

Any behavior detectable in real time

Tone and emotion

Captured

20+ emotions from the acoustic signal

Prosody

Captured

Pitch, rhythm, emphasis, pacing

Speaker dynamics

Captured

Multi-speaker diarization and patterns

Deception and stress cues

Captured

Vocal stress, lying, coercion signals

Acoustic authenticity

Captured

#1 deepfake detection on Hugging Face

API OUTPUT

What every Velma API call returns

One call, one model — structured output, no pipeline to assemble and no fine-tuning.

Emotions detected and analyzed

Velma detects emotions to better understand meaning in a conversation.

Speaker identification

Identifies each speaker's role — even with overlapping speakers and noisy audio.

150+ behaviors with reasoning

Fraud, churn, compliance violations, harassment, escalation, and dozens more — detected the moment they happen.

Diarized transcript

Every word, every speaker, every timestamp — clean enough to drop straight into your pipeline.

Custom behaviors you define

Describe what matters to you, in natural language. Velma uses audio signals to surface it.

Conversation topics & sentiment

Surfaces what the conversation is about and how each speaker feels about it — per topic, per speaker.

Conversation summary

A clean, concise summary of every audio file — with higher accuracy than transcription + LLMs.

Accent detection

Recognizes regional and global accents so analysis stays accurate across diverse populations.

Deepfake detection

#1 ranked on Hugging Face. Flags synthetic and cloned voices in real time at a 1.1% equal error rate.

PII / PHI tags

Sensitive details tagged inline, ready to redact.

BEHAVIORS

Define the risks that matter to your business. Velma hears them in the audio.

Tell Velma what matters — edit any behavior or write your own, all in plain language.
Velma uses every audio signal to detect them accurately.

Detect when an agent skips requir

Saved: Unauthorized Data Disclosure

You can also upload SOPs, compliance docs, or playbooks to specify exactly what Velma should catch.

Velma vs. industry standard

Audio-native capabilities from a better architecture.

Velma API

Industry Standard

Native input

Audio

Text

Architecture

Voice-native Ensemble Listening Model (ELM)

Transcription + LLM pipeline

What's in the model?

100+ specialized sub-models, each optimized for a specific signal or task

A transcript without audio signals + text-based LLM

Emotion detection

Understands emotion from audio, not word choice to. 20+ emotions.

None built-in. Requires a separate SER model.

Speech signals

Tone, emotion, prosody, rhythm, vocal stress.

None built-in.

Non-speech signals

Laughing, shouting, crying, shouting, hesitation, pitch, pacing

Invisible

Deepfake detection

98.9% accuracy, #1 on Hugging Face, same API call

Not a feature. Separate model + pipeline stage.

Custom behaviors

Describe in plain English. Velma uses audio + text together for higher accuracy.

Possible via prompt engineering. Accuracy limited to what words alone can reveal.

Out-of-box behaviors

50 by default, 100 more as templates — fraud, churn, compliance & escalation

None. Each requires prompt engineering + ongoing maintenance.

Speaker diarization

Industry-leading, handles overlap and noise

Varies; overlap is a common failure

Integration complexity

Drop-in. Send audio, receive structured JSON. A few lines of code.

Manage STT + LLM separately, plus custom logic to enrich context.

Cost

Starting at $0.75/hr

$2.50–$10/hr

Build with Velma

Build on top of audio understanding, not transcription

Smarter voice agents

AI agents that understand voice signals for better responses.

AI voice guardrails

Monitor what your LLM-powered voice agent is saying — and how callers are reacting to it.

Emotion-driven apps

Personalize every interaction in real time — route, respond, and adapt based on how the caller actually feels.

Conversation analytics

Replace your STT/ASR layer with better conversational insights.

Live coaching tools

Real-time agent assist that surfaces what to say next, based on how the call is going.

Anything you can imagine

Ask Velma to find anything in a conversation, and it does. The only limit on what you build is what you can describe.

Where Velma fits

A drop-in layer for your voice stack

Audio in

Telephony / SIP

Voice agents

Recordings

Browser / Mobile

Understanding layer

Velma API

REST + WebSocket

Your application

Real-time alerts

Agent assist

Dashboards

Data warehouse

Drop Velma into any voice pipeline. The underlying model handles the rest.

Velma is the #1 model
for Conversation Understanding

Conversation Understanding Benchmark — Accuracy vs. Cost

Evaluates a model's ability to identify conversation types, topics, speaker roles and key behaviors. Methodology ↗

Highest accuracy lowest cost

Inference cost

Accuracy score

velma-2-fast

velma-2

grok-4.1-fast-non-reasoning

grok-4.1-fast-reasoning

gemini-2-flash-lite

deepseek-v3.1

gemini-2-flash

deepseek-v3.2

gemini-3-flash-min

deepseek-r1

gemini-3-flash-med

gemini-2.5-pro

gemini-3-pro

grok-3

nova-3-intelligence

scribe-v2

grok-4-heavy

gpt-5-mini

gpt-5.2-pro

gpt-5.2

1

2

3

4

5

6

7

8

9

10

$0.01

0.02

0.03

0.04

0.05

0.06

0.07

$0.08

$0.10

0.50

1.00

$1.50

0

Get started in minutes

Drop-in by design — three steps, one API

1

Send audio

Point Velma at a file or a live stream — or connect the platform you already use (Five9, Genesys, Teams, Twilio, SIP). One endpoint, no pipeline to assemble.

2

Velma analyzes

A single voice-native model does all the work — no separate transcription, LLM, or enrichment services to wire together and keep in sync.

3

Output, where and how you like it

A structured JSON — stream it live, drop it in your warehouse, or trigger alerts. You decide where it goes.

It really is this short — streaming, start to finish:

# 1 · open a connection 2 · stream audio 3 · read results

ws = connect("wss://platform.modulate.ai/api/velma-2-streaming?api_key=…")

ws.send(config) # what to detect — or just use the default package

ws.send(audio_chunk) # stream your audio

for event in ws: # clips, behaviors, topics, summary…

handle(event)

Start building with Velma.

Grab an API key or try the playground to see Velma understand a real conversation.

Get free API access

API DOCS

More from Modulate

Explore Modulate's other leading voice models

Audio-native APIs built for real-time performance — designed to drop right into your stack.

Deepfake Detection

Synthetic voice detection, batch and streaming. #1 on Hugging Face leaderboards.

See how it works

Transcription

Real-time and batch transcription with speaker diarization. Lowest cost, lowest error rate.

See how it works

PII/PHI Redaction

Auto-redact sensitive content from both transcripts and audio. Compliance-ready.

See how it works

Music Detection

Detect music vs. speech in any audio stream. Real-time and batch.

See how it works

Cookie consent notice

Preferences Dashboard

Velma Triage API

Understand the true meaning of every conversation

Audio-native AI that identifies and escalates your risks

Executive Impersonation ScamFraudConfidence 93%

Identity Verification FraudFraudConfidence 91%

Security Protocol BypassCompliance ViolationConfidence 95%

Unresolved Billing DisputeComplianceConfidence 98%

Threat-Based HarassmentAgent SafetyConfidence 96%

Transcription captures words.Velma captures meaning.

Words are just the surface. Velma hears the full picture.

THE INDUSTRY STANDARD

VELMA BY MODULATE

What every Velma API call returns

Emotions detected and analyzed

Speaker identification

150+ behaviors with reasoning

Diarized transcript

Custom behaviors you define

Conversation topics & sentiment

Conversation summary

Accent detection

Deepfake detection

PII / PHI tags

Define the risks that matter to your business. Velma hears them in the audio.

Audio-native capabilities from a better architecture.

Build on top of audio understanding, not transcription

Smarter voice agents

AI voice guardrails

Emotion-driven apps

Conversation analytics

Live coaching tools

Anything you can imagine

A drop-in layer for your voice stack

Velma is the #1 modelfor Conversation Understanding

Drop-in by design — three steps, one API

Start building with Velma.

Explore Modulate's other leading voice models

Deepfake Detection

Transcription

PII/PHI Redaction

Music Detection

Transcription captures words.
Velma captures meaning.

Velma is the #1 model
for Conversation Understanding