Velma API by Modulate — A drop-in understanding layer for your voice stack

Velma API

A drop-in understanding layer for your voice stack.

Transcription discards voice signals that provide the true intent and meaning behind a conversation. Velma is the only voice-native AI model built that uses audio signals to understand conversations in all of their depth.

The Velma API turns any voice conversation into structured representation that you define — tell Velma what to surface, and it does. What you build on top is limited only by your imagination.

Get free API access → API DOCS

MEET VELMA

The future of voice AI is unlocked through audio signals

AUDIO IN Live audio Recorded audio

▼

Velma

Audio-native Ensemble Listening Model (ELM)

▼

API OUTPUT

Audio Signals

The acoustic primitives behind every Velma output.

Emotions
Diarization
Topic groups
Summary
Key behaviors
Accents
Authenticity
Speaker dynamics

→ see all default endpoints

You define what to find

Describe a behavior in plain English. Velma combines audio signals with transcription to surface it — more accurately than text alone.

// example

"Flag callers who sound frustrated
before the agent acknowledges it"

→ see custom behaviors example

// DEFAULT ENDPOINTS

What Velma returns by default

Every Velma API call returns these by default — no configuration, no fine-tuning.

Emotions detected and analyzed

Velma detects emotions to better understand meaning in a conversation.

Speaker identification & diarization

Industry-leading accuracy at knowing who said what — even with overlapping speakers and noisy audio.

Topic categorization

Automatically classifies what each conversation is about — billing, retention, support, sales, complaints, and more.

Conversation summary

A clean, concise summary of every audio file — with higher accuracy than transcription + LLMs.

150+ key behaviors & events

Fraud, churn, compliance violations, harassment, escalation, and dozens more — detected the moment they happen.

Accent detection

Recognizes regional and global accents so analysis stays accurate across diverse populations.

Deepfake detection

#1 ranked on Hugging Face. Detects synthetic and cloned voices with 98.9% accuracy in real time.

Speaker dynamics

Detects who's leading the conversation, who's interrupting, turn-taking patterns, and dominance — the social shape of every call.

// next

Want more than the defaults?

Define your own custom behaviors — describe what to find, and Velma surfaces it.

→

// custom behaviors

Detect anything using natural language

Describe what matters to you.
Velma uses every audio signal — not just words — to surface it accurately.

You can also upload SOPs, compliance docs, or playbooks to specify exactly what Velma should catch.

Velma vs. competitors

Audio-native capabilities from a better architecture.

Most "voice AI" stacks are transcription pipelines with an LLM bolted on top. Velma is voice-native — built from the audio signal up.

Velma API

Competitors

Native input

Audio

Text

Architecture

Voice-native Ensemble Listening Model (ELM)

Transcription + LLM pipeline

Emotion detection

Understands emotion

Non-speech sounds

Yes — laughing, crying, shouting, etc.

Custom behaviors

Better accuracy using voice signals

Lower accuracy without voice signals

Reliability for business

Proven for 150+ business-relevant behaviors

Requires dedicated prompt engineers

Accuracy

Best-in-class

Limited (text-centric)

Cost

Starting at $0.75/hr

$2.50–$10/hr

Build with Velma

Build on top of audio understanding, not transcription

Smarter voice agents

AI agents that understand voice signals for better responses.

AI voice guardrails

Monitor what your LLM-powered voice agent is saying — and how callers are reacting to it.

Emotion-driven apps

Personalize every interaction in real time — route, respond, and adapt based on how the caller actually feels.

Conversation analytics

Replace your STT/ASR layer with better conversational insights.

Live coaching tools

Real-time agent assist that surfaces what to say next, based on how the call is going.

Anything you can imagine

Ask Velma to find anything in a conversation, and it does. The only limit on what you build is what you can describe.

Where Velma fits

A drop-in layer for your voice stack

Audio in

Telephony / SIP

Voice agents

Recordings

Browser / Mobile

→

Understanding layer

Velma API

REST + WebSocket

→

Your application

Real-time alerts

Agent assist

Dashboards

Data warehouse

Drop Velma into any voice pipeline. The underlying model handles the rest.

Get started in minutes

Understand a conversation in a few lines of code

velma_batch.py

import os
import requests

API_KEY = os.environ["MODULATE_API_KEY"]
ENDPOINT = "https://api.modulate.ai/v1/velma/understand"

with open("call.mp3", "rb") as f:
    response = requests.post(
        ENDPOINT,
        headers={"X-API-Key": API_KEY},
        files={"audio": f},
        data={
            "diarization": True,
            "emotion": True,
            "intent": True,
            "deepfake": True,
        },
    )

result = response.json()
print(result["transcript"])
print(result["signals"])     # emotion, intent, prosody, dynamics...
print(result["behaviors"])   # any matched behaviors
        

import asyncio, json, websockets

API_KEY = os.environ["MODULATE_API_KEY"]
URL = f"wss://api.modulate.ai/v1/velma/stream?api_key={API_KEY}"

async def stream(audio_source):
    async with websockets.connect(URL) as ws:
        # Stream audio chunks in
        async for chunk in audio_source:
            await ws.send(chunk)

        # Receive signals as they happen
        async for message in ws:
            event = json.loads(message)

            if event["type"] == "utterance":
                print(event["speaker"], event["text"])
            elif event["type"] == "behavior":
                print("⚡", event["behavior"], event["confidence"])

asyncio.run(stream(my_audio_source))
        

import requests

API_KEY = os.environ["MODULATE_API_KEY"]

# Define a custom behavior in plain English.
# No fine-tuning, no training data — just describe it.
behavior = requests.post(
    "https://api.modulate.ai/v1/velma/behaviors",
    headers={"X-API-Key": API_KEY},
    json={
        "name": "escalation_risk",
        "description": (
            "Flag when a caller becomes frustrated and the agent "
            "fails to acknowledge it within 15 seconds."
        ),
    },
).json()

# Now use it on any audio — batch or streaming.
result = requests.post(
    "https://api.modulate.ai/v1/velma/understand",
    headers={"X-API-Key": API_KEY},
    files={"audio": open("call.mp3", "rb")},
    data={"behaviors": ["escalation_risk"]},
).json()

print(result["behaviors"])
# [{"name": "escalation_risk", "matched": True, "at_ms": 47200, ...}]
        

Start building with Velma.

Grab an API key or try the playground to see Velma understand a real conversation.

Get free API access → API DOCS

Explore Modulate's other leading voice models

Audio-native APIs built for real-time performance — designed to drop right into your stack.

Velma Transcribe

Real-time and batch transcription with speaker diarization. Lowest cost, lowest error rate.

/api/speech-to-text →

Deepfake Detection

Synthetic voice detection, batch and streaming. #1 on Hugging Face leaderboards.

/api/deepfake-detection-model →

PII/PHI Redaction

Auto-redact sensitive content from both transcripts and audio. Compliance-ready.

/api/pii-phi-redaction →

Music Detection

Detect music vs. speech in any audio stream. Real-time and batch.

/api/music-detection →