Real-Time Neural Network Inferencing for Audio Processing

Jatin Chowdhury
8 min readFeb 16, 2021


Over the past few years, the world of audio effects has seen an increasing interest in the use of neural networks to process audio in real-time. As a brief introduction here are a handful of academic papers, commercial products, and open-source projects that employ neural networks to create interesting audio effects:

Steinmetz and Reiss’ Randomized Overdrive Neural Networks (left), and Bloemer’s Smart Guitar Amp (right)

While considerable literature already exists for training neural networks of this sort, I’ve found a relative lack of information when it comes to actually implementing these networks to run as part of a real-time system. What I’m hoping to do here is explain some of the difficulties with more “traditional” approaches to this implementation problem, and introduce my RTNeural library, which will hopefully solve some of these problems!

Deep Learning Libraries

Most neural networks these days are trained using large deep-learning libraries, such as PyTorch or TensorFlow. These libraries have several benefits: most of them have GPU support which is critical for quickly performing the computations used by neural networks, plus they support Python, which is a convenient language to use for training networks.

Unfortunately, most audio processing is restricted to running on CPUs, and needs to be implemented in languages with better performance, typically C++. With that in mind, each library also supports a C++ API, that can be used to implement neural networks directly in C++. However, the performance of these libraries for audio processing tasks is not great. For example, Steinmetz’s RONN plugin uses libtorch, the PyTorch C++ API, but unfortunately, I can only run a few convolutional layers before maxing out my rather modest CPU.

Performance Concerns

While I don’t know for certain what causes this poor performance, I have a few guesses:

  • In general, the neural networks used for real-time audio processing are pretty small compared to the larger neural networks these libraries are optimized to run. I’ve noticed that libtorch takes about as long to process a large convolutional layer as it does to process a small one, which makes sense, as these libraries need to be able to scale to run very large networks with good performance.
  • Similarly, commonly used neural networks typically have a much slower data rate. A standard data rate for audio systems is 48000 samples per second, and for real-time audio processing, the network needs to output data at the same rate. By contrast, networks used for things like object detection and natural language processing can operate at a much slower data rate, and libraries like libtorch are probably optimized accordingly.
  • Finally, there are certain operations that we try to avoid in real-time audio programming, like memory allocation, or large-scale load/store operations. While I don’t know for certain, I wouldn’t be surprised if the developers of libtorch broke some of the “audio programming rules”, since their library is meant for more general-purpose use.
Real-time object detection with neural networks

Rolling Your Own

Given the performance issues with using the C++ APIs provided by the large deel learning libraries, it might make sense to roll your own inferencing engine for real-time computation. However, this approach comes with its own set of difficulties.

Reinventing the Wheel

When I first wanted to roll my own inferencing engine, I thought that I could pull from the code of others who had done the same. At the time, the only other hand-rolled inferencing engine I could find was written by Eero-Pekka Damskagg and Lauri Juvela from Aalto University. Unfortunately, they had only implemented convolutional layers, while I needed some recurrent layers for my network. Eventually, I realized that since everyone pretty much only implements the neural network layers that they need for their own project, it can be very difficult to re-use code written by others for your own project, leading to a lot of renventing the wheel.

Recreating Non-Intuitive Implementations

Since you probably don’t want to implement your own neural network training code as well, a common approach is to train a neural network with TensorFlow or PyTorch, and then “export” the neural network weights to be used by an inferencing engine. However, the way that the large libraries format their weights is not always very intuitive.

For example, when trying to implement my own Gated Recurrent Unit, I found that Tensorflow stores the biases for their GRU in a 2D vector, rather than a 1D vector as I had expected. Eventually, I discovered that some of the biases are added before the activation functions are applied, while others are added after; a discovery which cost me close to a week of manic frustration. After talking to some other implementers, I’d found that they had encountered this problem as well, meaning that the reinventing the wheel problem doesn’t just apply to the specific functions that need to be implemented, but also to understanding the quirks of the library implementations.

Using Vector Computation Libraries

Since neural networks contain a lot of computations that can be vectorized, it makes sense to use some third-party library that implements vectorized computations to speed up your inferencing engine. Some common choices are linear algebra libraries like Eigen, or SIMD libraries like xsimd. However, I ran into trouble when trying to use these libraries with my hand-rolled inferencing engine. For one of projects, I was trying to run my neural network on a low-level embedded device, and had trouble getting these libraries to compile on my device. Eventually I had to resort to re-writing my hand-rolled implementation using only the C++ STL.

Visualizing vector multiplication with SIMD instructions

Further, I’ve found that the performance of these libraries can vary a lot depending on the size of the network. This further complicates life for folks trying to roll their own implementation. How can you know which library will be fastest for your network unless you implement your network with a few different libraries, and test it out?

Introducing RTNeural

My solution to the problems listed above has been to take my hand-rolled implementation and attempt to make it as flexible and re-useable as possible. I’ve done this with my RTNeural library. Basically, the goal is to provide fast implementations of commonly used neural network layers, that can be used in real-time code. While I mainly use this library for audio programming, I’m sure it can be useful for other tasks as well.

While I won’t explain here how to use RTNeural, there is some documentation in the project README, as well as an example project that demonstrates how to use RTNeural within a real-time audio plugin. Instead, I’d like to focus on how RTNeural solves some of the problems listed above.

Choosing a Computation Backend

In order to fix the problem of choosing a vector computation library, I’ve chosen to provide three implementations of each layer: one using Eigen, one using xsimd, and one using only the C++ STL. You can choose your backend by setting a CMake variable when compiling the library, or providing a preprocessor definition. By default, RTNeural uses the Eigen backend, but I’ve found it best to measure the performance of my model with all three backends, and choose my backend based on those results.

Performance Considerations

While I believe there is still room for improvement with the performance of RTNeural, I’ve tried to use the following rules to ensure optimal performance even for very small networks:

  • No memory allocation except when constructing/destroying a layer.
  • Store all the layer weights in a way that they can be immediately used for inferencing.
  • Make each inferencing function as minimal as possible.

I’ve also taken some time to measure the performance of RTNeural compared to libtorch, the results of which are shown in the plots below. I’m measuring speed in terms of “real-time” factor, i.e. how long it takes the layer to process one second of audio at a 48 kHz sample rate. Note that for RTNeural, I’m using whichever backend gives the best performance for that layer size. The code for this performance comparison, as well as up-to-date performance results are available on GitHub.

What’s Next?

In the coming months, I’m hoping to spend less time working on RTNeural directly, and more time using the library to create some cool audio tools. That said, there are many things in RTNeural that I’m hoping to improve, and I’m hoping the community could help with as well! Here’s a few:

  • Weight Exporters: I currently have a Python script that can be used to export neural network weights from TensorFlow into a json form that can be loaded by RTNeural. In the future, I’m hoping to make this script much more robust, and to create a similar script for exporting PyTorch networks.
  • Performance Improvements: I would like RTNeural to be even faster than it already is, particularly the Conv1D layer, and some of the activation layers.
  • More Layers: While I’ve implemented most of the layers that I use commonly, I would like to support a larger set of layers. If you see a layer you use often that’s missing from the library, please feel free to let me know, or contribute your own implementation.
  • “Smart” Backend Selection: I think it would be pretty cool if RTNeural could automatically select the optimal backend for each layer of your network based the size of the layer, and the processor architecture on which the network is being run.

Anyway, I hope that this article has been informative, and that RTNeural can be a useful library for many others in the future. Onward!

Update: An academic paper describing the design and implementation of RTNeural has now been published on the ArXiv.