New Velma API

MEET VELMA

Audio-native AI that identifies and escalates your risks

AUDIO IN

Live streams

Recordings

Third party integrations

▼

Velma

Understand the full conversation

Audio-native Ensemble Listening Model (ELM)

▼

API OUTPUT

Just now

Executive Impersonation ScamFraudConfidence 93%

Caller claiming to be CEO solicited payment credentials using urgent payment language.

Alerted Agent

4 min ago

Identity Verification FraudFraudConfidence 91%

DOB, address, and PIN mismatched across three verification prompts.

Manual Verification

11 min ago

Security Protocol BypassCompliance ViolationConfidence 95%

Agent skipped two verification steps and granted full account access to unverified caller.

Urgent Review

19 min ago

Unresolved Billing DisputeComplianceConfidence 98%

Customer referenced unresolved billing dispute from prior call; fix never applied to account.

Claims Review

22 min ago

Threat-Based HarassmentAgent SafetyConfidence 96%

Caller repeated physical threat toward support agent after refund denial; warning issued, threats continued.

Escalated

THE VELMA DIFFERENCE

Transcription captures words.
Velma captures meaning.

Words are just the surface. Velma hears the full picture.

Word-based transcription discards the true meaning of a conversation.
Velma leverages acoustic signals to understand conversations like a human.

THE INDUSTRY STANDARD

Transcription + LLM pipeline

Voice signals discarded

Tone, emotion, hesitation, stress, speaker dynamics, intent, sarcasm and many more

WHAT TRANSCRIPTION CAPTURES

1 layer

Words

Captured

The literal transcript

Intent and behavior

Lost

Misunderstands intent and vulnerability

Tone and emotion

Lost

Loses anger, frustration, fear, joy, sarcasm

Prosody

Lost

Ignores pauses or unique delivery

Speaker dynamics

Lost

Overlooks interruptions and side comments

Deception and stress cues

Lost

Misses hesitation and vocal anxiety

Acoustic authenticity

Lost

Cannot catch deepfakes or spoofing

VELMA BY MODULATE

Voice-native AI

Voice signals analyzed

Tone, emotion, intent, rhythm, context, accents, deepfakes, sarcasm, vocal biomarkers and more.

WHAT VELMA CAPTURES

7 layers

Words

Captured

Best-in-class transcription accuracy

Intent and behavior

Captured

Any behavior detectable in real time

Tone and emotion

Captured

20+ emotions from the acoustic signal

Prosody

Captured

Pitch, rhythm, emphasis, pacing

Speaker dynamics

Captured

Multi-speaker diarization and patterns

Deception and stress cues

Captured

Vocal stress, lying, coercion signals

Acoustic authenticity

Captured

#1 deepfake detection on Hugging Face

BEHAVIORS

Define the risks that matter to your business. Velma hears them in the audio.

Tell Velma what matters — edit any behavior or write your own, all in plain language.
Velma uses every audio signal to detect them accurately.

Detect when an agent skips requir

Saved: Unauthorized Data Disclosure

You can also upload SOPs, compliance docs, or playbooks to specify exactly what Velma should catch.

Velma vs. industry standard

Audio-native capabilities from a better architecture.

Velma API

Industry Standard

Native input

Audio

Text

Architecture

Voice-native Ensemble Listening Model (ELM)

Transcription + LLM pipeline

What's in the model?

100+ specialized sub-models, each optimized for a specific signal or task

A transcript without audio signals + text-based LLM

Emotion detection

Understands emotion from audio, not word choice to. 20+ emotions.

None built-in. Requires a separate SER model.

Speech signals

Tone, emotion, prosody, rhythm, vocal stress.

None built-in.

Non-speech signals

Laughing, shouting, crying, shouting, hesitation, pitch, pacing

Invisible

Deepfake detection

98.9% accuracy, #1 on Hugging Face, same API call

Not a feature. Separate model + pipeline stage.

Custom behaviors

Describe in plain English. Velma uses audio + text together for higher accuracy.

Possible via prompt engineering. Accuracy limited to what words alone can reveal.

Out-of-box behaviors

50 by default, 100 more as templates — fraud, churn, compliance & escalation

None. Each requires prompt engineering + ongoing maintenance.

Speaker diarization

Industry-leading, handles overlap and noise

Varies; overlap is a common failure

Integration complexity

Drop-in. Send audio, receive structured JSON. A few lines of code.

Manage STT + LLM separately, plus custom logic to enrich context.

Cost

Starting at $0.75/hr

$2.50–$10/hr

Build with Velma

Build on top of audio understanding, not transcription

Smarter voice agents

AI agents that understand voice signals for better responses.

AI voice guardrails

Monitor what your LLM-powered voice agent is saying — and how callers are reacting to it.

Emotion-driven apps

Personalize every interaction in real time — route, respond, and adapt based on how the caller actually feels.

Conversation analytics

Replace your STT/ASR layer with better conversational insights.

Live coaching tools

Real-time agent assist that surfaces what to say next, based on how the call is going.

Anything you can imagine

Ask Velma to find anything in a conversation, and it does. The only limit on what you build is what you can describe.

Where Velma fits

A drop-in layer for your voice stack

Audio in

Telephony / SIP

Voice agents

Recordings

Browser / Mobile

Understanding layer

Velma API

REST + WebSocket

Your application

Real-time alerts

Agent assist

Dashboards

Data warehouse

Drop Velma into any voice pipeline. The underlying model handles the rest.

Velma is the #1 model
for Conversation Understanding

Conversation Understanding Benchmark — Accuracy vs. Cost

Evaluates a model's ability to identify conversation types, topics, speaker roles and key behaviors. Methodology ↗

Highest accuracy lowest cost

Inference cost

Accuracy score

velma-2-fast

velma-2

grok-4.1-fast-non-reasoning

grok-4.1-fast-reasoning

gemini-2-flash-lite

deepseek-v3.1

gemini-2-flash

deepseek-v3.2

gemini-3-flash-min

deepseek-r1

gemini-3-flash-med

gemini-2.5-pro

gemini-3-pro

grok-3

nova-3-intelligence

scribe-v2

grok-4-heavy

gpt-5-mini

gpt-5.2-pro

gpt-5.2

1

2

3

4

5

6

7

8

9

10

$0.01

0.02

0.03

0.04

0.05

0.06

0.07

$0.08

$0.10

0.50

1.00

$1.50

0

Get started in minutes

Drop-in by design — three steps, one API

1

Send audio

Point Velma at a file or a live stream — or connect the platform you already use (Five9, Genesys, Teams, Twilio, SIP). One endpoint, no pipeline to assemble.

2

Velma analyzes

A single voice-native model does all the work — no separate transcription, LLM, or enrichment services to wire together and keep in sync.

3

Output, where and how you like it

A structured JSON — stream it live, drop it in your warehouse, or trigger alerts. You decide where it goes.

It really is this short — streaming, start to finish:

# 1 · open a connection 2 · stream audio 3 · read results

ws = connect("wss://modulate-developer-apis.com/api/velma-2-streaming?api_key=…")

ws.send(config) # what to detect — or just use the default package

ws.send(audio_chunk) # stream your audio

for event in ws: # clips, behaviors, topics, summary…

handle(event)

Start building with Velma.

Grab an API key or try the playground to see Velma understand a real conversation.

Get free API access

API DOCS

More from Modulate

Explore Modulate's other leading voice models

Audio-native APIs built for real-time performance — designed to drop right into your stack.

Deepfake Detection

Synthetic voice detection, batch and streaming. #1 on Hugging Face leaderboards.

See how it works

Transcription

Real-time and batch transcription with speaker diarization. Lowest cost, lowest error rate.

See how it works

PII/PHI Redaction

Auto-redact sensitive content from both transcripts and audio. Compliance-ready.

See how it works

Music Detection

Detect music vs. speech in any audio stream. Real-time and batch.

See how it works

Velma API

Understand the true meaning of every conversation

Audio-native AI that identifies and escalates your risks

Behaviors detected

Executive Impersonation ScamFraudConfidence 93%

Identity Verification FraudFraudConfidence 91%

Security Protocol BypassCompliance ViolationConfidence 95%

Unresolved Billing DisputeComplianceConfidence 98%

Threat-Based HarassmentAgent SafetyConfidence 96%

Transcription captures words.
Velma captures meaning.

Words are just the surface. Velma hears the full picture.

THE INDUSTRY STANDARD

VELMA BY MODULATE

Define the risks that matter to your business. Velma hears them in the audio.

Audio-native capabilities from a better architecture.

Build on top of audio understanding, not transcription

Smarter voice agents

AI voice guardrails

Emotion-driven apps

Conversation analytics

Live coaching tools

Anything you can imagine

A drop-in layer for your voice stack

Velma is the #1 model
for Conversation Understanding

Drop-in by design — three steps, one API

Start building with Velma.

Explore Modulate's other leading voice models

Deepfake Detection

Transcription

PII/PHI Redaction

Music Detection

Cookie consent notice

Preferences Dashboard

Velma API

Understand the true meaning of every conversation

Audio-native AI that identifies and escalates your risks

Executive Impersonation ScamFraudConfidence 93%

Identity Verification FraudFraudConfidence 91%

Security Protocol BypassCompliance ViolationConfidence 95%

Unresolved Billing DisputeComplianceConfidence 98%

Threat-Based HarassmentAgent SafetyConfidence 96%

Transcription captures words.Velma captures meaning.

Words are just the surface. Velma hears the full picture.

THE INDUSTRY STANDARD

VELMA BY MODULATE

Define the risks that matter to your business. Velma hears them in the audio.

Audio-native capabilities from a better architecture.

Build on top of audio understanding, not transcription

Smarter voice agents

AI voice guardrails

Emotion-driven apps

Conversation analytics

Live coaching tools

Anything you can imagine

A drop-in layer for your voice stack

Velma is the #1 modelfor Conversation Understanding

Drop-in by design — three steps, one API

Start building with Velma.

Explore Modulate's other leading voice models

Deepfake Detection

Transcription

PII/PHI Redaction

Music Detection

Transcription captures words.
Velma captures meaning.

Velma is the #1 model
for Conversation Understanding