Faster Tape Emulation with SIMD

Published in

CodeX

9 min readAug 19, 2021

For the past few years, I’ve been working semi-regularly on an open-source tape emulation audio plugin called CHOWTapeModel. In that time, I’ve received a lot of great user feedback including ideas for new DSP functionality, reports of bugs that I may not have found on my own, and even one wonderful user who offered his services to help re-design the plugin’s GUI (thanks Margus!).

However, one piece of feedback that I’ve received time and again is that the plugin is “computationally expensive”, i.e. it uses up a lot of computing resources. On some level, this result is to be expected, as simulating the physics behind the tape magnetization process is certainly not a trivial task (see my 2019 DAFx paper on the topic). But as a user who may want to use 10–20 instances of the plugin in one of my sessions (on top of all my other plugins), the compute power required by CHOWTapeModel may be reason enough to switch to some other computationally cheaper plugin to achieve the desired sound.

As I’ve recently started working towards the release of version 2.9.0 of CHOWTapeModel, I took it upon myself to search for ways that I could improve the plugin’s performance. This search has culminated in a recent commit that has improved the performance (in most processing modes) by 30–60%. I wanted to write about the thought process behind this improvement to help me understand them a little bit better myself, and also to share these ideas with anyone else who may find them interesting.

Warning: What follows will be a rather technical discussion, assuming a little bit of background knowledge of C++ and computer architecture.

How Fast Are Plugins Really?

Before getting into the improvements themselves, it’s important to come up with a robust way to measure the plugin’s performance. While this question leads to a larger conversation about measuring software performance in general, I think that for audio processing, it’s important to think about “real-time” performance: i.e. how long does it take my plugin to process one second of audio? If my plugin needs 10 seconds (on average), then it clearly can’t run in real-time. If my plugin needs 0.9 seconds on average, it can probably run in real-time, but you probably can’t run more than one instance of the plugin on a single thread. Plus, if the average time is 0.9 seconds, the “worst-case” time may be greater than 1 second, which could result in audible glitches.

For my plugins, I try to aim for a maximum processing time of ~0.1 seconds, which means that 10 instances of the plugin can run on a single thread in real-time. Since my laptop can use up to 8 threads at a time, I could theoretically have a DAW session with 80 tracks each running one instance of the plugin before running out of processing power (assuming I don’t want to use any other plugins). In a real session, I probably wouldn’t need 80 tracks, but I would want to use other plugins, plus the DAW requires a bit of computing power to do its own operations.

How fast is CHOWTapeModel?

For CHOWTapeModel, the performance depends a lot on the “Hysteresis Mode” used by the plugin. Before making the most recent improvements, here’s how long it took CHOWTapeModel to process 1 second of audio in each mode, at 44.1 kHz sample rate and 8x oversampling (with all other settings at their default values):

RK2: 0.073 seconds
RK4: 0.132 seconds
NR4: 0.172 seconds
NR8: 0.306 seconds
STN: 0.066 seconds

There’s a few things to note here. First, I haven’t included the “V1” hysteresis mode, since it uses the “RK4” processing internally. For more information on how I’m generating these performance measurements, see the benchmarking tool source code. All the performance measurements were done using a MacBook with an Intel(R) Core(TM) i7–9750H CPU @ 2.60GHz.

Prior Work

I should mention here that I have spent significant time working on optimizing the tape emulation algorithm. In fact, looking back at some forum posts from December 2019, I remember a time when I could barely run one instance of the plugin before running out of compute power. Along the way, I’ve learned a lot of optimization techniques, including how to cache values so they don’t need to be recomputed, branchless programming, and how to read the assembly code generated by different C++ compilers. However, there was one important optimization technique I hadn’t yet been able to apply to CHOWTapeModel: SIMD parallelization.

What is SIMD?

SIMD is an acronym for “single instruction, multiple data”. The basic idea is as follows. If you give the computer two numbers and tell it to multiply them (i.e. a * b = c), it typically goes through the following steps:

store the first number (a) in a register
store the second number (b) in a register
multiply the contents of the two registers, and store the result in a third register (c)

The generated assembly for multiplying two floating-point numbers

Now let’s say instead, you have a two sets of 4 numbers each, and you want the computer to multiply the numbers in each set. The obvious approach is to repeat the above steps 4 times, which will naturally take 4 times as long to complete.

The generated assembly for multiplying 4 floating-point numbers

Alternatively, SIMD instructions allow the computer to store multiple values in a single register, enabling the following approach:

store the four numbers from the first set in a SIMD register
store the four numbers from the second set in a SIMD register
multiply the contents of the two SIMD registers and store the result in a third SIMD register

The generated assembly for multiplying four floating-point numbers with SIMD registers

This enables the computer to multiply sets of 4 numbers with the same number of CPU instructions as it would take to multiply individual numbers! This process is often known as “vectorization” or ‘writing vectorized code”.

SIMD Instruction Sets

There are three commonly used SIMD instruction sets of which I’m currently aware. Almost all Intel CPUs support the SSE instruction set, which supports 128-bit SIMD registers: large enough to store 4 single-precision floating-point numbers (typically called “floats”), or 2 double-precision floating-point numbers (“doubles”). The AVX instruction set supports 256-bit SIMD registers, wide enough for 8 floats, or 4 doubles, but are not supported by some older Intel CPUs. Finally, the NEON instruction set for ARM CPUs supports SIMD registers of 4 floats and 2 doubles, but double-precision support is only available in more recent versions (more on this later…).

Limits of SIMD Parallelization

While SIMD optimization is incredibly powerful, there are limits to where it can be applied. In general, for two operations to be done in parallel, they must be completely independent. For example, let’s say I have two multiply operations, a * b = c, and c * d = e. I can’t do these operations in parallel since the second operation depends on the outcome of the first!

With that in mind, there are typically two situations in audio processing where SIMD optimization can be done. One is if you have a “feed-forward” process, such that the outputs are only dependent on current and previous inputs. This case is typically useful for optimizing things like convolution, delay line interpolation, or neural networks. The other situation is if you have two parallel audio streams, for example, a synthesizer with 4 parallel voices, or a multi-channel audio stream. It’s this second situation that will be useful for optimizing CHOWTapeModel.

A fun SIMD visualization

Optimizing CHOWTapeModel

Since CHOWTapeModel is configured for stereo processing, the fundamental strategy behind this optimization is to process the stereo channels in parallel through the “hysteresis” section of the plugin.

Why not parallelize the whole plugin?

There are a few reasons why it doesn’t make sense to try to parallelize the whole plugin. First, in several sections of the plugin, the processing between the two channels is not independent, meaning that parallelization is not possible. Second, parallel processing requires “interleaving” the stereo channels into SIMD registers before processing, and then de-interleaving the stereo channels afterwards. This operation incurs a little bit of overhead, so continually interleaving/de-interleaving in between the different sections of the plugin could actually make the plugin a little bit slower!

The hysteresis processor is a good candidate for parallelization since the stereo channels are processed independently, and the processor is quite computationally expensive. Since the processing is so expensive, the benefits of SIMD parallelization far outweigh the overhead of interleaving/de-interleaving the channels.

Optimizing…

The actual work for implementing this optimization required a few steps:

Refactoring
Converting operations to use SIMD
Finding replacements for special functions
Measuring performance

I don’t want to discuss the refactoring step in too much detail (those interested are encouraged to read the commit), but a lot of the leg-work centered around converting class member functions into “free” functions, that could be more easily configured to take either floating-point registers or SIMD registers. I wanted to keep backwards compatibility with a non-vectorized implementation that I could use for testing and comparison.

For converting floating-point operations to SIMD operations, replacing double with juce::dsp::SIMDRegister<double> worked in most cases as a drop-in replacement, but there were a few spots that required more attention, particularly conditional statements. For example, let’s say that I have a number x, and I want to double it if x > 6. With floating-point numbers, I could do this operation with the ternary operator as follows:

x = x > 6.0 ? x * 2.0 : x;

With SIMD registers, this type of thing is a little bit more difficult:

using Float = juce::dsp::SIMDRegister<double>;x = (((Float) 2.0 * x) & Float::greaterThan (x, (Float) 6.0)) — (x & Float::lessThanOrEqual (x, (Float) 6.0));

Finally, the hysteresis processing was using a couple of special math functions, namely std::isnan() and std::tanh(). In order to get these functions working, I needed to bring in a third-party library, xsimd, that contains SIMD implementations of these operations.

After going through all of the above steps, I was able to compile the improved version of CHOWTapeModel, and test it out on my computer. Phew!

What about STN mode?

There was one hysteresis mode that I had to handle as a bit of a special case. Since the neural inferencing engine used by the STN mode already uses SIMD optimizations under the hood, I was not able to optimize it any further. As a result, the performance when using that mode should remain unchanged.

One more snag…

Unfortunately, there was one more stumbling block that took some time to get past. Remember earlier, I had mentioned that some versions of the ARM NEON instruction set did not contain double-precision floating point registers? The JUCE SIMDRegister implementation was developed with that in mind, and does not currently contain a vectorized implementation for double-precision SIMD registers for ARM NEON. The result is that on devices with ARM CPUs, including the iPad, and the new Mac M1 computers, the hysteresis processing would actually run much slower than the non-vectorized implementation.

Eventually, I took it upon myself to make a fork of the JUCE DSP module, and implement the double-precision SIMD registers for ARM NEON myself. Hopefully, the JUCE folks will add this support more officially in future versions of the module.

All Done!

And that’s it! We’ve finished vectorizing the hysteresis processor for CHOWTapeModel. Now it’s time to see how much better the plugin actually performs:

RK2: 0.053 seconds (38% improvement)
RK4: 0.088 seconds (50% improvement)
NR4: 0.111 seconds (55% improvement)
NR8: 0.188 seconds (63% improvement)
STN: 0.066 seconds (no change)

Overall that’s a pretty significant improvement! In particular, the “NR8” mode was pretty much unusable for me before this change, so I look forward to using it a bit more now that it’s more than 1.5x faster.

Finally…

Now that this optimization is complete, there’s only a few more things I want to get to before releasing CHOWTapeModel version 2.9.0. In the meantime, if you can’t wait, feel free to try out the plugin’s nightly builds, though be warned that they could be potentially unstable.

I know this article has gone pretty deep into the guts of writing low-level DSP code, but I hope you’ve found interesting, and thanks for making it all the way to the end! Onwards…