Real-Time Audio Logging

March 9, 2020

When deploying a neural audio system into production, logging audio inputs/outputs is critical.  Even with a rigorous QA process, your neural network will likely see many new audio environments, be run in unexpected contexts, and be heard by users with different expectations and biases.  This can result in the network behaving in an unexpectedly interesting or undesirable manner, where you'd like to understand what happened and potentially mitigate it.  Even if the network itself is behaving correctly, audio systems are complicated,  hardware-dependent, and performance-sensitive - so determining whether the network was at fault can be non-trivial.

Therefore, it's desirable to log the neural network inputs and outputs, at least within internal test builds, to reconstruct and understand errors in production.  However, the act of logging can itself cause performance issues - so having a logging setup which can slot transparently into the audio chain, without consuming too much compute,  performing blocking operations, or introducing unnecessary latency, is key.

This post gives a brief overview of the audio logging system which Modulate has built.  We use these loggers to capture inputs and outputs surrounding our voice skins, and save them locally to a directory as a series of WAV files.  These WAVs are easily shared back with us by customers if they choose to do so, letting us easily understand how their microphone setup, audio environment, and input speech affect the output of our voice skins.  


The logging system consists of two components: a base WavLogger which handles reading from an audio stream, writing to a file, and managing log files and WAV headers; and a threaded wrapper which provides an easy interface for putting the file writing on a lower priority thread.


The WavLogger is a circular buffer that optimizes for fast,  non-blocking writes from the audio thread.  Since additional memory allocation can potentially block, the buffer can't be resized as new audio comes in, so if the circular buffer is in danger of overwriting itself, the logger simply drops the incoming audio frame.  While this  behavior could create an unfaithful log, we prefer failing to maintain a faithful log over risking disruption to the rest of the audio pipeline.

bool WavLogger::add_audio_nonblocking(const float* audio, size_t num_samples) {
 int tail_lower_bound = tail.load();
 int head_value = head.load();

 // If the buffer can't fit the new samples, just continue and the log will skip
 if((head_value + (int)num_samples) > (tail_lower_bound + (int)buffer_size))
   return false;

 for(int i = 0; i < (int)num_samples; i++) {
   int index = (head_value + i) % buffer_size;
   buffer[index] = audio[i];

 return true;

Importantly, add_audio_nonblocking is not threadsafe, and is designed  to be called only from the audio thread.  On the other hand, the write_outstanding_samples_to_file method, which reads from the circular  buffer and writes the result to a WAV file, is threadsafe and should be called from a lower priority thread.  This method locks a mutex, writes the contents of the circular buffer, up to the head pointer, to a file, then checks the size of the file and optionally closes it to open a new one.  There is some final logic to reset the head and tail pointers if they've grown very large, to avoid overflow on 32bit ints.  

void WavLogger::write_outstanding_samples_to_file() {
 std::lock_guard lock(writer_mutex);
 int tail_value = tail.load();
 int head_lower_bound = head.load();

 int volume = (1<<15)-1;
 for(; tail_value < head_lower_bound; tail_value++) {
   int index = tail_value % buffer_size;
   write_word(f, (int)(buffer[index] * volume), 2);

 // Start new log file if needed
 size_t file_length = f.tellp();

 // Avoid overflow on 32bit ints - only relevant if we record for over 6 hours...
 if(tail_value > (1<<30)) {
   int reduction_amount = ((1<<30) / (int)buffer_size) * (int)buffer_size;

The WavLogger class also contains logic around creating WAV file headers, closing WAV files, and generating new log filenames.  For more information, see our GitHub repository here.


The ThreadedWavLogger is a thin wrapper around WavLogger which manages its own WAV file writing thread, while still exposing the add_audio_nonblocking method to the audio thread.  It also includes explicit logic for changing sample rates, which is a relatively rare occurrence in most audio contexts, but nonetheless needs to be handled gracefully when it comes up.  

void ThreadedWavLogger::log_task() {
 const float buffer_fraction = 0.25;
 while(!should_finish_logging) {
   const int latest_sample_rate_value = latest_sample_rate.load();
   if(latest_sample_rate_value != wav_logger_ptr->sample_rate) {
     wav_logger_ptr->sample_rate = latest_sample_rate_value;

   const float delay_seconds = ((float)wav_logger_ptr->buffer_size / (float)wav_logger_ptr->sample_rate) * buffer_fraction;
   std::this_thread::sleep_for(std::chrono::milliseconds((int)(delay_seconds * 1000)));

void ThreadedWavLogger::start_logging_thread() {
 should_finish_logging = false;
 thread_running = true;
 logging_thread = std::thread([this]{log_task();});

void ThreadedWavLogger::stop_logging_thread() {
 if(thread_running) {
   should_finish_logging = true;
   thread_running = false;

It should be noted that the log could become unfaithful if the audio thread changes sample rate and then begins writing new audio in between the sample rate check and outstanding samples write in log_task.  As sample rate changes happen infrequently, this has not caused us any issues in practice, and means that we can keep the coordination logic thin to maintain performance.



At Modulate, we use a simple, real-time friendly WAV file logger to monitor the inputs and outputs to our neural networks.  We've described the main points of a straightforward implementation, with the full code for this WAV logger available on GitHub.  We typically deploy two of these loggers, one calling add_audio_nonblocking on the input to the voice skin network, and the other calling it on the output from the voice skin network.  In this way, we get insight into anomalous behavior from our networks without sacrificing performance.