Over the past couple years, I’ve dedicated a significant amount of research time to the implementation of neural networks for real-time audio applications, specifically real-time audio effects that can be used on desktop computers, mobile devices, and embedded devices. Along the way I’ve released a couple of audio plugins containing neural networks, and developed a neural network inferencing engine in C++, with an emphasis on real-time performance.
I wanted to take some time to think a little more deeply about performance for real-time neural networks, and share some analysis I’ve been working on recently with trying to predict the real-time performance of a neural network before training it.
The most fundamental way to quantify performance for any system that is intended to run in real-time is if it is fast enough to meet the real-time “deadline”. For example, in a typical audio system, a neural network might be asked to process or produce samples at a rate of 48000 samples per second. With that in mind, the real-time deadline for a neural network that processes individual samples would be 0.02 milliseconds. If a neural network requires more time to process a single sample, then it is not fast enough to run in real-time.
Therefore, a useful metric when quantifying the performance of a real-time neural network is the “real-time factor”. For example, if a neural network running at 48 kHz sample rate requires (on average) 0.01 milliseconds to process a sample, then we would say that it has a real-time factor of 2x. We can use the real-time factor as a “score” to compare the performance of different real-time systems, where a higher score is better, and a score less than 1 means that the system cannot run in real-time. It is important to note that the real-time score is dependent on the system sample rate, and may vary depending on the CPU being used for testing.
What’s A Good Score?
In real-time audio systems, the performance constraints are highly dependent on the context in which the system is deployed. In the context of audio plugins, most users expect to be able to run many plugins simultaneously, meaning that any neural network inferencing must happen fast enough to allow for other processes to happen without going past the real-time deadline. For plugins, I usually try to aim for a real-time score greater than 10x, at a 48 kHz sample rate, meaning that the user could run 10 instances of the plugin on a single thread without going past the real-time deadline.
On the other hand, when running a real-time neural network on an embedded device, the neural net may be one of only a handful of processes running on the device, meaning that a lower score may be acceptable, though it is worth noting that most embedded devices have less powerful CPUs than the average desktop computer, so a network which runs at 10x real-time on a modern desktop might achieve a significantly lower score on a less powerful embedded device.
Now let’s think about how we could predict the performance of a neural network. In signal processing literature, one common way to compare the performance of various algorithms is to count the operations used by each algorithm and compare. However, there are a few difficulties with this form of comparison. For instance, it is generally accepted that some operations (like multiplication/division) are more expensive than other operations (like addition/subtraction), but it is often unclear exactly how much more expensive. Is a multiply operation 10x more expensive than addition? 20x? 100x? Further, this approach fails to consider that when the algorithm is implemented in code, some of the multiplies or additions may be combined into “vectorized” operations using SIMD instructions.
With that in mind, I’ve been working on trying to predict neural network performance using the following approach:
- Implement the relevant neural network layers (using SIMD instructions wherever possible).
- Count the operations used by the neural network as a function of the network hyper-parameters.
- Measure the network performance for a variety of hyper-parameter choices.
- Use a regression to estimate how long each operation will take.
Example: Dense Network
Dense networks are often used in real-time audio processing for memoryless systems, for example in State Trajectory Networks.
First let’s think about our network hyper-parameters. A dense network has a given number of inputs (N) and outputs (O), as well as a number of hidden layers (L), each with a given size. For now let’s only think about networks where each hidden layer has the same size, which we’ll call the “hidden size” (W). Finally, the network performance will also depend on how many values can fit into a SIMD register on the given platform (V). For most CPUs that support either SSE or NEON SIMD instructions, the largest SIMD registers are 128 bits wide; large enough for 4 single-precision floating point numbers, or 2 double-precision floating point numbers.
- SIMD Multiplies + SIMD HSums + Scalar Adds: W * (N + L * W + O) / V
- SIMD Adds: ((L + 1) * W + O) / V
- SIMD Activations: (L + 1) * W / V
Note that a “SIMD HSum” refers to the operation of summing all the elements in a SIMD register.
Finally, I have constructed a handful of dense networks with 2 inputs, 2 outputs, ReLU activation functions, using single-precision SSE SIMD instructions, and timed how long it took for each network to process 100 seconds of audio at 48 kHz on my desktop computer (running a 3.2 GHz 6-Core Intel Core i7). After running a regression on the data, I ended up with the following estimations for how much of the processing time was spent in each operation:
- SIMD Multiplies + SIMD HSums + Scalar Adds: 5.45500463e-03 seconds
- SIMD Adds: 1.47277176e-25 seconds
- SIMD ReLU Activations: 7.23480113e-26
Finally, we can use this information to predict the real-time factor for a dense/ReLU network of any size. The plot below and to the left shows how the real-time factor scales with the network size, along with thresholds for real-time factors of 1x, 10x, and 100x. The same is shown for equivalent networks using tanh activation functions on the right.
Example: Recurrent Networks
As another example, let’s look at recurrent neural networks, which have been popular for modelling stateful nonlinear systems in audio signal processing, typically using one or more input sample and a single output sample at each time step.
A similar process was repeated as described with the dense networks above, yielding similar plots below. It’s interesting to note that for the recurrent networks, the performance stays relatively constant as the input size increases. It’s also worth noting that the GRU networks can achieve much better performance than the LSTM networks at smaller network sizes.
Why Is This Useful?
Now that we’ve seen one potential way for predicting the performance of a real-time neural network, it’s worth asking how this information can be useful for us in training and implementing these types of networks. I think the main significance of this analysis is in selecting network architectures for training and implementation.
For example, maybe you’re trying to train a recurrent network, and you notice that an LSTM model achieves slightly better accuracy compared to a GRU model of the same size. Depending on how important the real-time performance is relative to the accuracy improvement, you might consider using the GRU model even though it is less accurate.
On a larger scale, performance prediction could be used for automated hyper-parameter tuning. Typically hyper-parameter tuning accounts for things like accuracy and training time, but by being able to quantify and predict network performance, it is possible to tune your network hyper-parameters to automatically obtain the right balance of accuracy and performance for your use case.
I hope sharing this information and the process for analyzing and predicting real-time neural network performance is useful. Feel free to contact me if you have any questions about the code or data related to the examples shown here!