A New Standard for Voice AI

In the last few years, there’s been a wave of interest in voice-based AI – whether to understand us human beings, or to interact with us directly. But organizations using this newest wave of AI face a challenge, because understanding voice is hard. We've spent years processing and analyzing real-world speech to give insights into user behaviors. Now, were excited to announce early access to test out our underlying voice intelligence models to see just how powerful and flexible our tech can be! Read on to find out how to get involved.
The Challenge of Effective Speech Analysis
We know speech analysis is not a matter of mere transcription – people inject emotion into the way we perform our speech that carries deep significance. Sarcasm, friendly banter, and other nuanced speech patterns require a level of contextual understanding that even the best AIs have struggled to reach.
But even when it is a matter of mere transcription, that problem is hard enough on its own! Sure, plenty of companies have built transcription models that support nice, clean audio recordings made by someone trying to be understood – for instance, someone enunciating crisply to be heard by their home assistant, or intentionally altering their speech patterns to ensure an AI agent gets what they’re trying to say. But accurately understanding speech the way we humans talk to each other – filled with sharp emotional turns, mumbled comments, background noise and multiple speakers, and all often being shouted at a half-decent microphone struggling to pick up the full range of frequencies – is another story entirely.
From the beginning, Modulate’s goal has been to crack the code here. We don’t just want to make AI tools; we want to make tools that actually understand the ways real people socialize, conduct business, and learn about the world. And we’ve had tremendous success in doing so – helping top gaming platforms including Call of Duty and GTA Online recognize the difference between friendly banter and harmful intent; and working with global B2C brands to recognize frustrated callers or spot and prevent would-be fraud.
We’re extremely proud of the products we’ve built to unlock this value, including ToxMod and VoiceVault. And we’ve recently been thinking – what if we could give everyone the tools to do the same?
Introducing Modulate's Voice Intelligence API
Under the hood of ToxMod and VoiceVault are unique, custom-built models for transcription, emotion modeling, deepfake detection, and much more. And the more we’ve learned, the more we’ve realized that these models exceed what’s on the market today in crucial ways.
Now, we’re not just saying that as a brag about our machine learning team (though they are incredible!) Our data is actually critical to our success. Thanks to our work in both gaming and enterprise, we’ve been able to analyze hundreds of millions of hours of real, conversational audio, showcasing the full range of how people speak to each other both professionally and socially.
Take transcription as one example. Most modern transcription models are trained either on overly pristine datasets, built out of studio recordings or other similar environments; or are simply scraping everything they can find from platforms like Youtube or Spotify, which don’t actually reflect real-world conversations so much as a certain type of performance.
Top AI companies have been able to make great strides with these datasets, but still tend to struggle on noisy conversations and variable audio quality. On these kinds of messy datasets, Modulate’s transcription substantially outperforms – for instance, our Word Error Rate (WER) exceeds OpenAI’s latest whisper-large-v3 model by 40%, with roughly 15x faster inference to boot.
This is why we’re so excited not just about the potential for VoiceVault and ToxMod alone – but we also believe our underlying models have the potential to massively improve AI systems across the board, helping all of our agents and classifiers understand real human beings, in real conversations, like never before.
Try It Out Yourself
If this gets you excited, we’d love to hear from you! We’re in the process of opening up APIs to our underlying models – to join the waitlist and share more about how you hope to use next-level transcription, emotion analysis, deepfake detection, voice-based age estimation, or more, please fill out the quick form here.