Velma API by Modulate — A drop-in understanding layer for your voice stack

Velma API

A drop-in understanding layer for your voice stack.

Transcription discards voice signals that provide the true intent and meaning behind a conversation. Velma is the only voice-native AI model built that uses audio signals to understand conversations in all of their depth.

The Velma API turns any voice conversation into structured representation that you define — tell Velma what to surface, and it does. What you build on top is limited only by your imagination.


MEET VELMA
The future of voice AI is unlocked through audio signals
AUDIO IN Live audio Recorded audio
Velma
Audio-native Ensemble Listening Model (ELM)
API OUTPUT
1
Audio Signals
The acoustic primitives behind every Velma output.
  • Emotions
  • Diarization
  • Topic groups
  • Summary
  • Key behaviors
  • Accents
  • Authenticity
  • Speaker dynamics
→ see all default endpoints
2
You define what to find
Describe a behavior in plain English. Velma combines audio signals with transcription to surface it — more accurately than text alone.
// example
"Flag callers who sound frustrated
before the agent acknowledges it"
→ see custom behaviors example

// DEFAULT ENDPOINTS
What Velma returns by default

Every Velma API call returns these by default — no configuration, no fine-tuning.

Emotions detected and analyzed
Velma detects emotions to better understand meaning in a conversation.
Speaker identification & diarization
Industry-leading accuracy at knowing who said what — even with overlapping speakers and noisy audio.
Topic categorization
Automatically classifies what each conversation is about — billing, retention, support, sales, complaints, and more.
Conversation summary
A clean, concise summary of every audio file — with higher accuracy than transcription + LLMs.
150+ key behaviors & events
Fraud, churn, compliance violations, harassment, escalation, and dozens more — detected the moment they happen.
Accent detection
Recognizes regional and global accents so analysis stays accurate across diverse populations.
Deepfake detection
#1 ranked on Hugging Face. Detects synthetic and cloned voices with 98.9% accuracy in real time.
Speaker dynamics
Detects who's leading the conversation, who's interrupting, turn-taking patterns, and dominance — the social shape of every call.
// next
Want more than the defaults?
Define your own custom behaviors — describe what to find, and Velma surfaces it.

// custom behaviors

Detect anything using natural language

Describe what matters to you.
Velma uses every audio signal — not just words — to surface it accurately.

You can also upload SOPs, compliance docs, or playbooks to specify exactly what Velma should catch.

Velma vs. competitors

Audio-native capabilities from a better architecture.

Most "voice AI" stacks are transcription pipelines with an LLM bolted on top. Velma is voice-native — built from the audio signal up.

Velma API
Competitors
Native input
Audio
Text
Architecture
Voice-native Ensemble Listening Model (ELM)
Transcription + LLM pipeline
Emotion detection
No
Non-speech sounds
Yes — laughing, crying, shouting, etc.
No
Custom behaviors
Better accuracy using voice signals
Lower accuracy without voice signals
Reliability for business
Proven for 150+ business-relevant behaviors
Requires dedicated prompt engineers
Accuracy
Best-in-class
Limited (text-centric)
Cost
Starting at $0.75/hr
$2.50–$10/hr

Build with Velma

Build on top of audio understanding, not transcription

Smarter voice agents

AI agents that understand voice signals for better responses.

AI voice guardrails

Monitor what your LLM-powered voice agent is saying — and how callers are reacting to it.

Emotion-driven apps

Personalize every interaction in real time — route, respond, and adapt based on how the caller actually feels.

Conversation analytics

Replace your STT/ASR layer with better conversational insights.

Live coaching tools

Real-time agent assist that surfaces what to say next, based on how the call is going.

Anything you can imagine

Ask Velma to find anything in a conversation, and it does. The only limit on what you build is what you can describe.


Where Velma fits

A drop-in layer for your voice stack

Audio in
Telephony / SIP
Voice agents
Recordings
Browser / Mobile
Understanding layer
Velma API
REST + WebSocket
Your application
Real-time alerts
Agent assist
Dashboards
Data warehouse

Drop Velma into any voice pipeline. The underlying model handles the rest.


Get started in minutes

Understand a conversation in a few lines of code

velma_batch.py
import os
import requests

API_KEY = os.environ["MODULATE_API_KEY"]
ENDPOINT = "https://api.modulate.ai/v1/velma/understand"

with open("call.mp3", "rb") as f:
    response = requests.post(
        ENDPOINT,
        headers={"X-API-Key": API_KEY},
        files={"audio": f},
        data={
            "diarization": True,
            "emotion": True,
            "intent": True,
            "deepfake": True,
        },
    )

result = response.json()
print(result["transcript"])
print(result["signals"])     # emotion, intent, prosody, dynamics...
print(result["behaviors"])   # any matched behaviors
import asyncio, json, websockets

API_KEY = os.environ["MODULATE_API_KEY"]
URL = f"wss://api.modulate.ai/v1/velma/stream?api_key={API_KEY}"

async def stream(audio_source):
    async with websockets.connect(URL) as ws:
        # Stream audio chunks in
        async for chunk in audio_source:
            await ws.send(chunk)

        # Receive signals as they happen
        async for message in ws:
            event = json.loads(message)

            if event["type"] == "utterance":
                print(event["speaker"], event["text"])
            elif event["type"] == "behavior":
                print("⚡", event["behavior"], event["confidence"])

asyncio.run(stream(my_audio_source))
import requests

API_KEY = os.environ["MODULATE_API_KEY"]

# Define a custom behavior in plain English.
# No fine-tuning, no training data — just describe it.
behavior = requests.post(
    "https://api.modulate.ai/v1/velma/behaviors",
    headers={"X-API-Key": API_KEY},
    json={
        "name": "escalation_risk",
        "description": (
            "Flag when a caller becomes frustrated and the agent "
            "fails to acknowledge it within 15 seconds."
        ),
    },
).json()

# Now use it on any audio — batch or streaming.
result = requests.post(
    "https://api.modulate.ai/v1/velma/understand",
    headers={"X-API-Key": API_KEY},
    files={"audio": open("call.mp3", "rb")},
    data={"behaviors": ["escalation_risk"]},
).json()

print(result["behaviors"])
# [{"name": "escalation_risk", "matched": True, "at_ms": 47200, ...}]

Start building with Velma.

Grab an API key or try the playground to see Velma understand a real conversation.