Cookie consent notice

By clicking “Accept All”, you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. View our Privacy Policy for more information.

Accept all

Deny

Preferences

Preferences Dashboard

Accept all

save my preferences

Reject all cookies

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Try it for yourself: Get a preview of what Velma can do
Technology
Solutions

Industry

Insurance & Banking
CX & Contact Centers
Gaming

Use Cases

Agent Attrition
Commercial Fraud
Resources
Blog
Newsroom
Resources
Newsletter
Customer Success Stories
Ethics
Privacy and Data Management
About Us
Team
Careers
Contact
Get in Touch

Conversation Benchmark

Benchmarking the leading AI models on real-world, noisy, complex audio conversations. 

Why the Conversation Understanding Benchmark

The most popular audio benchmarks are designed to evaluate narrow models like Speech-to-Text (STT), Text-to-Speech (TTS), or synthetic voice detection. Other common audio benchmarks like Big Bench Audio, SpeechR and MMAR test the ability of an AI system to respond accurately to audio questions, not their ability to monitor conversations between third parties and unpack deeper details. 

The Conversation Understanding benchmark takes a different approach. Rather than the understanding of only a narrow data element, or its ability to participate in a conversation, the Conversation Understanding benchmark is the first of its kind specifically designed to evaluate a model's complete understanding of an audio conversation, regardless of its own participation in that conversation.

Overview 

Benchmark Objective - When given a voice conversation, a model is tested on its ability to accurately identify the conversation type, number of speakers and their role, emotions and key behaviors they’re exhibiting. 

Dataset

Over 100 real world conversations are derived, representative in audio quality, content, and behavioral dynamics of real conversations seen from our 5 years of work with Fortune 500 companies. Specifically, for each conversation, a structured conversation template is first generated with key details to ensure a realistic conversation, including the conversation type, number of speakers, speaker roles and key behaviors in the conversation. We then use generative AI to create a transcript of a conversation between 5 and 60 minutes long, ensuring the transcript matches the criteria laid out in the template. Finally, synthetic voices are used to create a new audio file. 

The generated voices reading the transcript are instructed to vary their emotions and cadence, add interruptions and vary the audio quality to represent the real world audio environments appropriate to that conversation. The end result is a set of audio files. Each audio files contains a recorded conversation under simulated real-world noisy environments, that is now associated with a structured ground-truth representation of the included conversation type, number of speakers, speaker roles and key behaviors.

Running the test

Models are given (1) A list of possible conversation types, roles, emotions and behaviors (2) the generated audio (3) the schema for how to build a structured conversation output reporting their conclusions on the conversation type, number of speakers, speaker roles and key behaviors in the conversation. The test is then run through a model, with a structured output produced. 

Scoring accuracy

Results are scored by comparing their generated structured representations to the ground truth ones from what the conversations were built from. Points are awarded when it correctly contains information from the ground truth structured conversation template. Points are deducted for missing data, incorrect data and extraneous incorrect data. The final score is an average accuracy across 100+ conversations. A higher score means the model did a better job at understanding the conversation.

Benchmark test cost

The cost represented in the benchmark is cost per conversation, calculated by taking the total actual cost spent running for the test, divided by the number of conversations tested (100+). 

Benchmark Testing Notes: Different models have varying native capabilities, so we designed the benchmark to ensure fair and accurate comparisons:

  • Multi-modal models (e.g., Gemini 3, Grok 4, Velma) received the raw audio input directly, along with the benchmark task instructions.
  • Non-multi-modal models were paired with the strongest complementary model for their specific limitation:
    • For transcription-limited models like DeepSeek, Gemini 2, and older models: We used Velma-2 (the highest-accuracy transcription model) to generate the best possible transcript, giving the LLM the strongest foundation for accuracy scoring.
    • For transcription-only models like ElevenLabs and Deepgram: We transcribed the audio with their native systems, then fed the output to Grok-4-heavy (a top-performing audio-capable model) to complete the full conversation understanding evaluation, enabling these models to have a benchmark accuracy score. The cost was calculated based on the total cost of using the model combinations.

Additional Notes

  • Data privacy — For data privacy reasons, we were not able to use customer conversations, so we’ve simulated them as closely as possible.
  • Velma (Modulate’s proprietary Ensemble Listening Model / ELM) uses its own in-house transcription models — no external dependencies.
Resources
Blog
Newsletter
Ethics
Compliance
About Us
Team
Culture
Careers
Data and Privacy
Privacy and Data Management
Privacy Policy
Tufts Transparency Files
Legal
Copyright ©2026 Modulate