If you’re using neural networks to process signals in the time domain, you’ve probably run into the following problem: I trained my neural network on data at one sample rate, but now I want to use that network to process data at a different sample rate.
For some types of neural networks, this problem isn’t really a problem at all. If you’re network is “memoryless,” like most networks made up of only fully-connected layers, then the neural network is already sample-rate agnostic. However, for “stateful” neural networks including convolutional or recurrent networks, processing signal at a different sample rate than the training data will lead to a wildly different result.
As an example, I took an LSTM network that was trained at 96 kHz sample rate, and passed a 100 Hz sine wave through the network, first at 96 kHz, and then at 192 kHz (doubling 96 kHz). As shown below, the output at the higher sample rate does not match the expected output. In this case, the LSTM network was designed to process audio signals, and the difference between the output at the training sample rate and the higher sample rate would be significantly noticeable for the listener.
One potential solution would be to “resample” the signal to the training sample rate before processing it through the neural network, and then resampling back to the original sample rate afterwards. For most “offline” processing this solution works fine. However, for real-time applications, resampling the signal can add significant computational overhead, along with potentially adding latency to the signal, particularly since the signal may need to be resampled by a non-integer factor (for example, resampling from 44.1 kHz to 48 kHz in audio signals).
Another idea would be to train multiple neural networks for several commonly used sample rates, and use the the network with the training sample rate closest to the target sample rate. However, since training neural networks is a stochastic process, there is no guarantee that each network will produce the same result. Also, this solution would not work for well in situations where the neural network could be used at sample rates well outside the range of the training sample rates.
What I’d like to present here is a better solution for “adapting” recurrent neural networks to process signal at any sample rate, which applies to networks using LSTMs and GRUs.
Let’s start by examining the signal flow of a simple recurrent neural network:
The important thing to notice is how the network handles the state of the recurrent layer: delaying that state by one sample (z^-1). Recurrent neural networks are sample rate dependent because the delay time for the network state depends on the sample rate. For example, if the recurrent network is run at 48 kHz sample rate, the recurrent state is delayed by 0.02 milliseconds between processing steps. However, at 96 kHz, the delay time is reduced to 0.01 milliseconds.
The solution that I’m proposing is as follows: What if we delay the recurrent state by a different number of samples depending on the sample rate at which the network is being used?
Case: Integer Multiple Sample Rate
For example, if the neural network is being run at a sample rate that is 2x the training sample rate, then we just need to delay the recurrent state by two samples instead of one. As we can see below, this solution works pretty well when the target sample is an integer multiple of the training sample rate.
Case: Non-Integer Multiple Sample Rate
What if the target sample rate is not an integer multiple of the training sample rate? In that case, we need to use a fractional delay instead of an integer-sample delay. While there are several types of delay-line interpolation that can be used to create this fractional delay, for now we’ll stick with the simplest method: linear interpolation:
While it may be possible to improve the output quality by using a higher-order interpolation method, the linear interpolation approach works pretty well in the test example that we’ve been using here.
The limitation of the solution proposed here is that it only works when the target sample rate is larger than the training sample rate, since implementing a fractional delay for a delay time less than 1 sample would introduce a delay-free loop. However, a good workaround in this case would be to upsample the signal by an integer factor until it is at or above the training sample rate.
Here we’ve described one possible solution for constructing sample-rate agnostic neural networks. At the moment I’ve implemented this scheme in a couple of real-time audio effects, but I hope that these ideas will be useful outside of the audio domain as well. The code for generating the plots and other figures in this article can be found on GitHub.