# Fighting with FFTs

A few weeks ago I was looking through the code of PFFFT, a C library for computing Fast Fourier Transforms, when I decided to embark on a little project. PFFFT supports the use of SSE and NEON SIMD intrinsics to achieve better performance on CPUs with SIMD support. However, since the PFFFT library was originally written, AVX intrinsics have become increasingly common on Intel and AMD CPUs. My “little” project idea was to modify PFFFT to use AVX intrinsics on compatible CPUs.

Having done similar projects in the past, I figured this project wouldn’t be too hard. SSE registers require 16-byte alignment and can operate on 4 “packed” 32-bit floating-point numbers at a time, while AVX registers are twice as big, requiring 32-byte alignment and operating on 8 floats at a time. In my past projects, moving from an SSE implementation to an AVX implementation was a simple matter of increasing the required alignment and converting from batches of 4 floats to batches of 8.

However, I would soon discover this project to be significantly more involved.

# PFFFT Background

PFFFT supports two types of Fourier transforms: complex and real/half-complex. Complex transforms take an input series of complex numbers, and can transform them into the Fourier domain and back. However, when taking the Fourier transform for a series of real numbers, the resulting information in the Fourier domain consists of a series of “half-complex” numbers, where the second half of the numbers are equal to the reversed complex conjugate of the first half (aside from the first and middle value which are special cases). The half-complex numbers can then be transformed “backwards” into a series of real numbers.

# Complex/Complex Transforms

After setting up some pre-computations and basic math operations, I was ready to try porting the complex/complex transforms. The first task was to figure out how to “interleave” the scalar data into AVX registers, which turned out to be relatively simple after getting to know the *_mm256_permute2f128_ps* and *_mm256_permutevar8x32_ps** *intrinsics. Then I ran into my first real challenge: *pffft_cplx_finalize**.*

The FFT is a sort of “recursive” algorithm, and like all recursive algorithms, it needs a “base case” where the recursion stops. The PFFFT implementation stops at the base case of a 4-point FFT, which is implemented in this “finalize” method, and matches naturally with the 4-wide SIMD registers that are available with SSE and NEON SIMD intrinsics. Porting the FFT algorithm to AVX would require changing the base case to an 8-point FFT. While I had learned about the FFT algorithm in my undergraduate studies, I’d forgotten many of the details. Fortunately, I found some wonderful notes from a math class at UC Davis which explained the 8-point FFT in extreme detail.

While there are several FFT algorithms used by various implementations, the first and most well-known is the Cooley-Tukey algorithm, which is the algorithm used by PFFFT as well as the algorithm explained by the UC Davis notes.

## Bit Reversed Order

One thing that many notice about visualizations of the Cooley-Tukey FFT is the “butterfly” shape that naturally comes about from the structure of the algorithm (as in the image above). However, this “butterfly” pattern also results in a strange ordering of the input values, specifically for the 8-point FFT: [0, 4, 2, 6, 1, 5, 3, 7].

Possibly the simplest way to follow this pattern is to think of it as a bit-reversal pattern. Start by writing out the numbers (0, 1, 2, 3, …) in binary, and then reversing the bits. Converting the reversed binary numbers back to base-10 reveals the correct pattern.

## Omega (ω) Terms

Another thing I found confusing when comparing the PFFFT implementation with the UC Davis notes was some “missing” multiplier terms. The UC Davis notes refer to these terms as ω^k, where ω=exp(2*π*j/N). The “butterfly” diagram above shows that both the 4-point and 8-point FFTs contain several of these ω-terms, however the PFFFT 4-point implementation didn’t seem to contain any multiply operations, just additions and subtractions.

As it turns out, since k is always an integer, when N=4, the ω^k always works out either ±1 or ±j, allowing the multiply operations to be replaced with simple sign-changes. For N=8, ω^k will be one of [±1, ±j, ±s ±j*s], where s = sin(π/4) = 1/sqrt(2). So unlike the 4-point FFT, the 8-point FFT does require a couple of (complex) multiply operations, but with some clever algebraic manipulation most of the work can still be done with additions/subtractions.

## Backwards

For the backwards FFT, I had to work out the forward algorithm in reverse, again with help from the UC Davis notes. In all, I spent roughly 2–3 evenings porting the complex/complex transforms, but the toughest part was yet to come.

# Real/Half-Complex Transform

First, the easy part: PFFFT’s real/half-complex FFT implementation contains a couple of other methods for re-ordering data: *reversed_copy* and *unreversed_copy*. I was able to implement these methods similar to the interleaving operations needed for the complex/complex transform.

Now for the hard part: *pffft_real_finalize* was a bit more complicated than its complex counterpart. I was able to figure out pretty quickly that it wasn’t computing the same 4-point FFT used in the complex transform, but it wasn’t obvious to me exactly what *was *being computed. After several evenings attempting to find patterns and reverse-engineer the PFFFT implementation, I decided to re-read the PFFFT documentation.

PFFFT started as a port of a FORTRAN library called FFTPACK. The author of PFFFT auto-translated the FFTPACK code into C, then manually modified the translated code, and added SSE and NEON optimizations. The author only left one note about the optimization process:

Adapting fftpack to deal with the SIMD 4-element vectors instead of scalar single precision numbers was more complex than I originally thought, especially with the real transforms, and I ended up writing more code than I planned.

So it seems that the author of PFFFT ran into some of the same challenges! In search of more answers, I read through some documentation for FFTPACK, which led me to a pair of papers written by Clive Temperton: one paper from 1982 discussing complex transforms, and the other paper from 1983 discussing real/half-complex transforms. The 1983 paper explains the real/half-complex FFT algorithm used by FFTPACK and in turn by PFFFT, specifically breaking down small transforms for 2 ≤ N ≤ 6. Since I needed the special case N=8, I still needed to derive the transform for myself.

After another evening of writing out the transform equations, first by hand and then in Mathematica, I had the real/half-complex forward transform working. The backwards FFT then required the inverse of the same operations. The simplest way to do this required constructing a “transition matrix” based on the operations used in the forward transform, and then transposing the matrix. Note that these matrices are 16x16, since the transform is being computed over 8 complex numbers, and each complex number is split up into 2 real numbers.

# Results

I wanted to see the difference between the SSE and AVX implementations of the FFT algorithm, so I ran a little benchmark on my desktop PC (with an AMD Ryzen 7 CPU). For FFT sizes ranging from 2⁷ to 2¹⁹, the AVX complex transforms performed (on average) 39% faster, and the AVX real/half-complex transforms performed 36% faster. I think I might be able to improve the AVX implementation a bit more with some additional effort.

At the moment, my FFT implementation is available on GitHub. I do plan to make my testing and benchmarking code publicly available as well, along with some of my Mathematica notebooks, but I need to do a bit of clean-up first. If you have some ideas for further optimizations, feel free to jump in to the code!

# End Notes

## Choosing AVX vs. SSE

I had to make some choices to balance when the FFT library uses the AVX vs. SSE implementations. The current strategy is to check if the host CPU supports AVX instructions as well as if the AVX algorithm supports the requested FFT size. If both are true, then the AVX implementation will be used, otherwise the SSE implementation will be used as a fallback.

## R.I.P. Clive Temperton

The author of the papers underlying my FFT implementation (via PFFFT and FFTPACK) was Clive Temperton, who passed away in 2022. According to the Beverly Grammar School’s “Notable Alumni” page:

Clive was a world-renowned meteorologist/computational scientist who developed the systems we currently use for weather forecasting. He worked for the Met-Office and the European Centre for medium-range weather forecasting, as well as for the Canadian government and in collaboration with NASA.

In his paper on real/half-complex transforms, Temperton publishes measured times from running his algorithm on the CRAY-1 supercomputer. According to his results, a 256-point real transform on the CRAY-1 could be computed in 37.6 microseconds. On my desktop PC, the same transform can be computed in ~6 microseconds.

In case you would like to try Temperton’s algorithm for yourself on period-accurate hardware, the Living Computers Museum is auctioning off their CRAY-2 supercomputer (estimated price: $250,000–350,000).