In our effort to continuously build better and better voice skins, Modulate's ML team conducts many experiments in parallel to optimize different components of our neural networks. Evaluating the effects that these changes have is difficult, especially due to the unsupervised nature of adversarial machine learning. There are no sure signals, such as a single objective loss value, to provide easy quantitative comparison of the outputs of different models. While some heuristics can help, the ultimate arbiter of model performance is human evaluation.
Evaluating audio is tricky. Not only can audio quality be highly subjective, as with any media; but audio is also inherently serial. You can listen attentively to only one audio clip at a time, and it should be done at 1x speed if you're looking for defects, distortion, etc.
Early on in Modulate's life, it became apparent that anything we could do to increase our efficiency in comparing audio outputs of different models would be paid back many times over. We therefore set out to build a QA tool to streamline creating, evaluating, and understanding audio outputs from our experiments.
From the outset, we knew that the QA tool should be usable by anyone on the team, without complicated technical configuration. This made a web-based tool an obvious choice - coordinating python scripts even between engineers using uniform operating systems and dev environments can be a pain; so trying to enforce that for the entire team would be a losing battle. This also necessitated some kind of server hosting - for us AWS - to keep the site up 24/7.
We also knew that we would need to build tools to easily pipeline a model from completed training to evaluation. At 8-12 experiments per engineer, relying on a complicated set of manual instructions to perform basic QA would quickly become a bottleneck. We were especially concerned with avoiding a buildup of experiments "waiting for evaluation"; so making the model pipeline seamless, alongside an easy-to-use website, would remove any blockers beyond simply person-time to listen to the audio.
REQUIREMENTS WE DISCOVERED ALONG THE WAY
During initial use of the QA tool, we found ourselves constantly wanting more information about our results. For example, we always have at least two individuals evaluate each model - and we had several occurrences where two models seemed equal at first glance, but each evaluator ultimately thought there was a clear, but different, winner. Dividing out results by evaluator would isolate those cases - do all annotators agree, or is there variance - but still require a debrief to understand why the annotators disagreed.
Since this debrief takes up valuable in-person time, we looked for a way to quickly understand these kinds of discrepancies, and found that adding in multiple dimensions for evaluation helped significantly. Some evaluators are very sensitive to high frequency noise, and will choose the lower noise model as the winner in all cases; others care less about noise and more about intelligibility, such that any distortion which decreased the evaluator's understanding caused that model to lose.
In addition, listening to a clip repeatedly can improve an evaluator's understanding of the clip's contents, which can change their opinion - often by causing them to understand the clip more easily and so mark down intelligibility issues less harshly. This is particularly a problem since the two model outputs (based on the same input clip) are listened to serially, so the second clip can get an automatic advantage. A quick fix was to randomize the order of each model's output onscreen, to average out that advantage over time.
We also found some instances where, even though one model was better than another in the average case, it would have rare but particularly bad failure modes which prevented us from using it in production. In other cases, even bad models could sometimes outshine their competitors on certain clips. All such instances are interesting, so we added a note-recording functionality to the tool, which would save pairs of clips along with a brief description from the annotator on why those clips were saved. When comparing models, we would both examine the evaluation statistics, and read through all of the notes to find outlier behavior.
THE QA TOOL
As you can see above, our QA tool displays a head-to-head comparison between a new experiment and a baseline model, to help determine which model produces better audio outputs.
The main action takes place in the top left, where the two model output waveforms are displayed in random order, to be listened to by the evaluator. The input clip is also included, to help judge how well phonetic content, tone, etc. were conveyed by the outputs. The waveforms can be clicked on to start playing from a particular point in the clip - this is particularly useful, as we often want to hear a word or phrase several times in a row to get a good understanding of a specific error or distortion.
The top right shows a spectrogram view of the audio, which is useful for getting a quick view of the level of noise in the clip, and the coherence of harmonics.
The bottom left lists the various evaluation criteria we use to compare models. We've added to the dimensions we use over time to measure different aspects that are sometimes independent from overall preference.
- Champion - Best Overall: The original evaluation measure, shows overall preference between clips.
- Intelligibility: How easily the phonetic content of the audio can be understood.
- Energy: The degree to which the emotion of the output clip matches that present in the input clip.
- Precision: The amount of background noise in the output clip, compared to the input clip (typically input clips have little noise, so this becomes a "quantity of noise" measure).
- Input Fidelity: How well the output matches the characteristics of the input clip, beyond just Energy.
- Voice Profile Fidelity: How well the output clip matches the target voice.
- Heavy Weight Champion: A bonus option for if the evaluator hears overwhelmingly better performance by one model, to note that this model should likely be preferred even if many other clips showed no clear difference.
Each dimension defaults to "Skip" indicating that that dimension wasn't important in evaluating the two clips against each other. "Both" is slightly different, indicating that this aspect was considered but both clips were approximately even. "Both" is most useful for overall preference, as a token that that clip was considered thoroughly but neither model won. This helps distinguish between the case where two models only rarely give different performance over many evaluations, from the case where two models are frequently different, but few clips were evaluated.
Finally, the Notes section allows a user to write a note on one of the clips, to indicate something that was not captured in the evaluation criteria. This is often used to indicate the nature of a specific heavyweight preference, or to save clips that sound exceptionally good to give us a high water mark for a specific voice or input clip.
THE RESULTS PAGE
Equally important to evaluating clips is understanding results of the evaluation. The results page is built to provide the evaluation data in as many easily accessible ways as possible, as evaluations tend to be done in groups when an entire class of experiments finish.
On the top left is a histogram summarizing all of the results of the evaluation across all clips and evaluators. This is further broken up into individual histograms per-evaluator to discover differences in evaluator preference. Finally, "Load clips for specific annotator" will show all pairs of clips for a single evaluator, along with their ratings, to help understand what aspects of a clip a specific evaluator might have been hearing.
On the right, the saved notes across all experiments are presented in a dropdown, alongside the models which the note corresponds to. For any note, the audio from that evaluation page can also be loaded for context.
Lastly, there is a button to load the tensorboard plots of the models being evaluated. This is helpful for developing heuristics around model performance with respect to many training statistics which we measure (losses, dead relus, gradient magnitudes, etc.)
This isn't the final incarnation of Modulate's QA Tool. The next direction we're moving in is around providing more powerful statistical tools to correlate metrics from model training with the model’s final audio performance. We're very interested in giving users of the QA Tool direct access to scalar values from tensorboard inside jupyter notebooks or other web-accessible frameworks. We think that having this capability directly next to the QA evaluation data will prove invaluable - by directly comparing automated metrics and human listener results, we can build better metrics to understand new experiments at-a-glance, and make QA more efficient overall.