A drop-in understanding layer for your voice stack.
Transcription discards voice signals that provide the true intent and meaning behind a conversation. Velma is the only voice-native AI model built that uses audio signals to understand conversations in all of their depth.
The Velma API turns any voice conversation into structured representation that you define — tell Velma what to surface, and it does. What you build on top is limited only by your imagination.
Every Velma API call returns these by default — no configuration, no fine-tuning.
Describe what matters to you.
Velma uses every audio signal — not just words — to surface it accurately.
Most "voice AI" stacks are transcription pipelines with an LLM bolted on top. Velma is voice-native — built from the audio signal up.
AI agents that understand voice signals for better responses.
Monitor what your LLM-powered voice agent is saying — and how callers are reacting to it.
Personalize every interaction in real time — route, respond, and adapt based on how the caller actually feels.
Replace your STT/ASR layer with better conversational insights.
Real-time agent assist that surfaces what to say next, based on how the call is going.
Ask Velma to find anything in a conversation, and it does. The only limit on what you build is what you can describe.
Drop Velma into any voice pipeline. The underlying model handles the rest.
import os import requests API_KEY = os.environ["MODULATE_API_KEY"] ENDPOINT = "https://api.modulate.ai/v1/velma/understand" with open("call.mp3", "rb") as f: response = requests.post( ENDPOINT, headers={"X-API-Key": API_KEY}, files={"audio": f}, data={ "diarization": True, "emotion": True, "intent": True, "deepfake": True, }, ) result = response.json() print(result["transcript"]) print(result["signals"]) # emotion, intent, prosody, dynamics... print(result["behaviors"]) # any matched behaviors
import asyncio, json, websockets API_KEY = os.environ["MODULATE_API_KEY"] URL = f"wss://api.modulate.ai/v1/velma/stream?api_key={API_KEY}" async def stream(audio_source): async with websockets.connect(URL) as ws: # Stream audio chunks in async for chunk in audio_source: await ws.send(chunk) # Receive signals as they happen async for message in ws: event = json.loads(message) if event["type"] == "utterance": print(event["speaker"], event["text"]) elif event["type"] == "behavior": print("⚡", event["behavior"], event["confidence"]) asyncio.run(stream(my_audio_source))
import requests API_KEY = os.environ["MODULATE_API_KEY"] # Define a custom behavior in plain English. # No fine-tuning, no training data — just describe it. behavior = requests.post( "https://api.modulate.ai/v1/velma/behaviors", headers={"X-API-Key": API_KEY}, json={ "name": "escalation_risk", "description": ( "Flag when a caller becomes frustrated and the agent " "fails to acknowledge it within 15 seconds." ), }, ).json() # Now use it on any audio — batch or streaming. result = requests.post( "https://api.modulate.ai/v1/velma/understand", headers={"X-API-Key": API_KEY}, files={"audio": open("call.mp3", "rb")}, data={"behaviors": ["escalation_risk"]}, ).json() print(result["behaviors"]) # [{"name": "escalation_risk", "matched": True, "at_ms": 47200, ...}]
Grab an API key or try the playground to see Velma understand a real conversation.
Audio-native APIs built for real-time performance — designed to drop right into your stack.
Real-time and batch transcription with speaker diarization. Lowest cost, lowest error rate.
Synthetic voice detection, batch and streaming. #1 on Hugging Face leaderboards.
Auto-redact sensitive content from both transcripts and audio. Compliance-ready.
Detect music vs. speech in any audio stream. Real-time and batch.