Fast FIR Filtering

FIR Filtering Background

For an impulse response of length N, the output of a time-domain FIR filter can be computed as follows:

Inner Product

In order to obtain a faster FIR filtering algorithm, let’s see if we can use the C++ standard library function std::inner_product to avoid using extra loops. First we must define two arrays: h[] to store the filter kernel, and z[] to store the filter state (x[n], x[n−1], x[n−2], etc. from the equation above). Since it would be inefficient to shift the entire state buffer by a sample at every time step, we can use a "state pointer", zPtr, to iterate through the state array, and then "wrap" back to the beginning when it reaches the end of the array. Finally, by paying attention to where the state pointer needs to "wrap", we can implement an FIR filtering algorithm that replaces any extra loops with two calls to std::inner_product. The code example below demonstrates the algorithm.

struct InnerProdFIR
{
int zPtr = 0; // state pointer
float z[FILTER_ORDER]; // filter state
float h[FILTER_ORDER]; // filter kernel
void process (float* buffer, int numSamples)
{
float y = 0.0f;
for (int n = 0; n < numSamples; ++n)
{
z[zPtr] = buffer[n]; // insert input into state
// compute two inner products between kernel and wrapped state buffer
y = std::inner_product (z + zPtr, z + FILTER_ORDER, h, 0.0f);
y = std::inner_product (z, z + zPtr, h + (FILTER_ORDER - zPtr), y);
zPtr = (zPtr == 0 ? FILTER_ORDER - 1 : zPtr - 1); // iterate state pointer in reversebuffer[n] = y;
}
}
};

Double-Buffered Inner Product

While the above implementation looks nice, I started wondering if the two calls to std::inner_product could be condensed into a single call. It turns out that they can, by replacing the filter state buffer with a double-buffer, thereby ensuring that a contiguous segment of the state data is always available for computing the inner product. This implementation could look something like this:

struct InnerProdNoWrapFIR
{
int zPtr = 0; // state pointer
float z[2 * FILTER_ORDER]; // filter state
float h[FILTER_ORDER]; // filter kernel
void process (float* buffer, int numSamples)
{
float y = 0.0f;
for (int n = 0; n < numSamples; ++n)
{
// insert input into double-buffered state
z[zPtr] = buffer[n];
z[zPtr + FILTER_ORDER] = buffer[n];
// compute inner product over kernel and double-buffer state
y = std::inner_product (z + zPtr, z + zPtr + FILTER_ORDER, h, 0.0f);
zPtr = (zPtr == 0 ? FILTER_ORDER - 1 : zPtr - 1); // iterate state pointer in reversebuffer[n] = y;
}
}
};

SIMD Optimization

Having now determined that we can reduce the majority of the FIR filtering algorithm to a single call to an “inner product” function, I started thinking about the possibility of further optimizing the algorithm by using a custom “inner product” implementation that is further optimized using SIMD instructions. For those who may be unfamiliar, SIMD is an acronym for “Single Instruction Multiple Data”, and refers to a type of parallel computing that allows the same operation to be computed on multiple data simultaneously. Most modern CPUs support some form of SIMD instructions. Fortunately, the JUCE framework contains wrappers that allow different types of SIMD instructions to be accessed through a common API. Using the JUCE style, the “inner product” function can be implemented as follows:

// inner product using SIMD registers
inline float simdInnerProduct (float* in, float* kernel, int numSamples, float y = 0.0f)
{
constexpr size_t simdN = dsp::SIMDRegister<float>::SIMDNumElements;
// compute SIMD products
int idx = 0;
for (; idx <= numSamples - simdN; idx += simdN)
{
auto simdIn = loadUnaligned (in + idx);
auto simdKernel = dsp::SIMDRegister<float>::fromRawArray (kernel + idx);
y += (simdIn * simdKernel).sum();
}
// compute leftover samples
y = std::inner_product (in + idx, in + numSamples, kernel + idx, y);
return y;
}

Benchmarking Results

In the above sections, we’ve seen some potential ways to improve the performance of FIR filtering algorithms. However, as Nassim Taleb often notes: “In theory, there is no difference between theory and practice. In practice, there is.” In programming, it is typical to validate performance improvements with benchmarks, to show that an algorithm is sufficiently optimal in practice.

Finally

Unfortunately, I have yet to be able to obtain reliable benchmarking data from any Linux machines. If you are able to run benchmarks on your machine, or if you have ideas for improving the performance of any of the above algorithms, I’d love any community contributions! Feel free to check out this project on GitHub.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store