Lookup Tables Are Bad Approximators
I remember the first time I thought seriously about the performance of an audio program. I was working on an audio effect, and I tried running it inside the Visual Studio CPU profiler. The profiler gave me a breakdown of which parts of my code the CPU was spending the most time in, and many of the functions that stood out were math operations like tanh
or sin
.
At the time, I was using implementations of these math operations that were provided by the C++ standard library (std::tanh
, std::sin
, etc). I looked for some advice on Stack Overflow and was met with a variety of potential solutions, the most common of which was to replace the standard library implementations with a “lookup table” approximation.
In the time since, I’ve been told to use lookup tables as approximators many times, including from a reviewer of an academic paper that I wrote, who almost rejected my paper because I wasn’t using them. Along with being a somewhat “inelegant” solution to the problem at hand (in my subjective view), lookup tables also aren’t as performant as many people believe. Hopefully this article will show why I’ve grown to dislike them.
What is a Lookup Table?
In concept a lookup table as an approximator is very simple: to approximate some one-to-one function f(x)
, create a large array that stores the result of f(x)
for a range of input values. Then when you need to compute the result of the function, find the table value for the nearest input and just use that value. From there, we can improve the lookup table by “interpolating” between nearby values, so if we have the values f(1)
and f(1.1)
stored in the table, but the user wants the value f(1.05)
we can approximate that more closely by taking the average of the two nearby table values. That would be an example of “linear” interpolation, but more complex interpolation schemes like cubic interpolation may also be used.
Lookup tables can also be used to approximate many-to-one functions (e.g. f(x, y, z)
), however this requires a “multi-dimensional” lookup table which will generally be more complicated to implement.
Approximating sin(x)
As a motivating example, let’s say that we have a program that uses sin(x)
in a lot of places, and we want to try to optimize the program by reducing the amount of compute needed to calculate sin(x)
. The first thing we need to is determine how accurate our computation needs to be… For the purpose of this example, I’ll say arbitrarily that the computation needs to be accurate to within an absolute error of 2 * 10^-5. Next we need to choose a range on which to approximate the function. Since sin(x)
is periodic, we only need to approximate the function in the range [-π, π].
If we implement a “basic” lookup table (i.e. no interpolation), we need a rather large lookup table to achieve this accuracy: 320,000 elements. With linear interpolation, we can get away with a smaller 500 element table.
Let’s also compare our lookup tables with another approximation that does not use a lookup table. Colin from mooooo.ooo has a fantastic blog post describing the derivation of a polynomial approximation for sin(x)
, using Chebyshev polynomials. I derived an approximation using his method, and found that the required accuracy could be achieved with a 7th-order polynomial.
We can plot the absolute error of each approximation (relative to std::sin(x)
). It turns out thethat polynomial approximation is actually slightly more accurate than the lookup table approximators.
Scalar Performance
Now let’s see how fast these implementations perform! As a test, I timed how long it took for a simple test program to compute sin(x)
for 10,000,000 numbers. I did one run where the input values increase linearly like a “ramp”, as might be the case if using sin(x)
to generate an oscillator, and a second run where the input was drawn from a random number generator. I ran the same program for 4 implementations of sin(x)
: the standard library implementation, along with the polynomial approximation and both lookup tables described above.
Here were the timing results on my desktop computer (AMD Ryzen 7):
RAMP:
std: 18.92 seconds
math_approx-7: 5.157 seconds (3.67x)
LUT-320k: 9.596 seconds (1.97x)
LUT_lin-500: 16.95 seconds (1.12x)
RANDOM:
std: 44.74 seconds
math_approx-7: 4.439 seconds (10.1x)
LUT-320k: 8.917 seconds (5.02x)
LUT_lin-500: 16.22 seconds (2.76x)
Clearly the polynomial approximation is the fastest! For both types of input, the polynomial approximation is ~2x faster than the basic lookup table, and almost 4x faster than the linear interpolating lookup table.
I also ran the timing comparisons on my M1 MacBook; those results can be found at the bottom of this article. While some of the differences are less drastic in that comparison, the general ordering remains the same.
(As an aside, if anyone knows why the standard library implementation is more than 2x(!) slower on random input please let me know!)
SIMD Performance
However, scalar performance only tells us part of the story. In a lot of applications, we may have the opportunity to “vectorize” our code using SIMD. Let’s see how our approximations perform when computing multiple values in parallel!
Converting the polynomial approximation to SIMD is pretty much trivial… all of the scalar multiply and add operations are replaced with the corresponding SIMD operations. However, the lookup tables are a different story. In general, SIMD works by performing the same operation on data that is located sequentially in memory. If our SIMD register contains 4 values, that means we need to access 4 elements from our table. And those 4 elements from the table are not guaranteed to be located sequentially in memory… in fact they will almost never be located sequentially in memory! So essentially, we need to split up the values in our SIMD register, find the table value for each one, and then store the results in a new SIMD register. At least in the case of the interpolating lookup table, we can compute the interpolation using SIMD operations, so that’s nice.
I re-ran the same timing comparison from above using SIMD (SSE 2.0 on Intel and 64-bit NEON on ARM). Since the C++ standard library doesn’t currently support SIMD math operations, I used the implementation of the sin
function from the XSIMD library as a reference.
RAMP:
xsimd: 33.94 seconds
math_approx-7: 8.521 seconds (3.98x)
LUT-320k: 24.43 seconds (1.39x)
LUT_lin-500: 18.97 seconds (1.79x)
RANDOM:
xsimd: 38.34 seconds
math_approx-7: 8.705 seconds (4.4x)
LUT-320k: 25.76 seconds (1.49x)
LUT_lin-500: 19.17 seconds (2x)
This time the polynomial approximation appears to be more than 2x faster than the lookup table approximations.
Why?
At first glance these results might seem puzzling… how could a polynomial approximation that requires a whole bunch of multiply and add operations ever be faster than the basic lookup table that only needs to load a value from memory? Even when doing linear interpolation, the number of CPU instructions required by the lookup table is significantly fewer than the number of instructions needed for the polynomial approximation.
It turns out that accessing memory can be one of the slowest things that we ask modern computers to do! It just depends on where the memory is located.
Since the “basic” lookup table is so big (over 1MB) it doesn’t fit in any of the “cache” levels available on my computer, meaning that table accesses could be quite slow! The interpolating lookup table is small enough to fit into L1 cache, but accessing L1 cache is still roughly half an order of magnitude slower than a single CPU cycle, meaning the polynomial approximation has plenty of time to get started on its computation while the interpolating lookup table is waiting for memory access.
I should also mention that my test program was pretty much the ideal scenario for a lookup table… Since the program was only computing sin(x)
and nothing else, it is likely that lookup tables stayed in cache the entire time (or at least as much of the lookup table as could fit in cache). In a real program, you’re probably going to be doing other things that require memory which might evict the lookup table from cache. So it’s possible that the lookup table approximations could be even slower than the timing comparisons show, depending on how they’re being used in a real program.
Conclusion
Now despite what I’ve shown above, I’m not saying that you should never use lookup tables… I’m sure there are some instances in which a lookup table is in fact the most performant option. But I think it’s important to try out some alternatives to the lookup table as well, and measure what ends up being the fastest in your program. Remember that context matters, and that “idealized” benchmarks may not always reflect the complexity of real-world scenarios, especially as it relates to cache usage.
I’m going to link to an article from Nima Badizadegan’s “Speculative Branches” blog, that explains some of the performance issues with lookup tables in greater detail, and provides some common sense advice for when it may or may not make sense to use one.
I will say that one reason that people rely on lookup tables as much as they do is because developing good polynomial approximations is hard. This is a much bigger topic that I might write about later on, but I’d recommend checking out a few open-source libraries that contain some useful polynomial (and other) approximations:
- fastapprox (by GitHub user “romeric”)
- fastmaths (by GitHub user “Tremus”)
- math_approx (by me)
Appendix: M1 Macbook Results
scalar_compare:
RAMP:
std: 33.27 seconds
math_approx-7: 22.82 seconds (1.46x)
LUT-320k: 25.30 seconds (1.32x)
LUT_lin-500: 29.75 seconds (1.12x)
RANDOM:
std: 38.70 seconds
math_approx-7: 5.393 seconds (7.18x)
LUT-320k: 7.496 seconds (5.16x)
LUT_lin-500: 12.32 seconds (3.14x)
----------------------------------------
simd_compare:
RAMP:
xsimd: 52.92 seconds
math_approx-7: 26.40 seconds (2x)
LUT-320k: 30.30 seconds (1.75x)
LUT_lin-500: 35.86 seconds (1.48x)
RANDOM:
xsimd: 41.99 seconds
math_approx-7: 9.329 seconds (4.5x)
LUT-320k: 21.64 seconds (1.94x)
LUT_lin-500: 18.91 seconds (2.22x)