Programming Performance Myths vs. Realities
When it comes to writing high-performance code, I’ve received a lot of advice over the years, and have heard a lot of qualified people express a lot of very strong opinions about what makes a computer program run fast or slow. Coming in as a young programmer, it can be difficult to decide which advice and opinions to follow, so I’ve typically tried to implement different ways of writing my code, and then measure which way(s) perform the best.
I wanted to share some of the most commonly repeated performance tips that I’ve received over the years, and discuss which ones have held up under rigorous testing and which have not. Calling these myths vs. realities might be a little bit extreme (if a claim is true 90% of the time, is the 10% when it’s wrong enough to justify calling it a myth?), but hopefully the reader can look past the slight hyperbole.
Myth: “Virtual function calls are free/cheap”
This is a claim that I’ve heard most often from modern C++ developers, particularly those who tend to prefer a more object-oriented programming approach. The idea is that the overhead of a virtual function call is small enough that the programmer does not need to worry about it.
Fortunately, as with most performance questions, this claim is something that can be experimentally verified. Anton Ertl has published micro-benchmarks comparing different “dispatch techniques” across a number of different CPUs. In the general case, a virtual function call measures between 1.25–5x slower than a “regular” function call on most modern CPUs, or 1–2x slower than a switch-case function call. So evidently virtual function calls are not “free”, but are they “cheap”?
The answer is that it depends. In one of the “worst” comparisons from Ertl’s benchmarks (the Ryzen 7 5800X), an virtual function call requires 24 cycles per dispatch, as opposed to 4.2 for a regular function call, for a difference of 19.8 cycles per dispatch. If the work being done per-dispatch is large (i.e. on the order of 1000s of cycles) then the overhead of the virtual function call is likely going to be negligible.
Function Inlining
For functions that are performing a small amount of work, there is another reason why virtual function calls can have a negative performance impact: the compiler may not be able to inline a virtual function call.
For some background, inlining is an optimization technique, in which the compiler replaces a function call with the instructions that would have made up the body of the function. Other optimizations may then take place, for example, moving an operation to occur outside, rather than inside, a loop.
Now it is sometimes possible for some compilers to inline a virtual function call, as in the example below:
However, the compiler will quickly resort back to using indirect calls once the use-case becomes even a little bit more complex:
Link-Time Optimisation
Link-Time Optimisation (LTO) might deserve it’s own myth-vs-reality section, but I wanted to mention it here since I’ve witnessed several misconceptions about what LTO can/can’t do with regards to virtual function calls.
As an example , consider a program that loads a set of dynamic libraries (DLLs). Each DLL provides a class which is derived from a common base class, and overrides several virtual functions. One proposal to improve the performance of such a system could be to compile the entire system to a single binary rather than loading external DLLs, with the rationale that link-time optimisation could help to improve performance, specifically by inlining function calls across translation units.
While there are many reasons why it might be preferable to compile the entire system to a single binary, LTO likely won’t make any measurable difference in the system’s performance, at least as it relates to the virtual function calls, since virtual function calls typically can’t be inlined.
“Compile-Time Polymorphism”
One alternative to traditional C++ polymorphism with virtual function calls is what I call “compile-time polymorphism” (first demonstrated to me by Eyal Amir). The typical use-case is when you have multiple types with a common “interface”, and you want to be able to loop through a bunch of them. For example, consider the following code, with two filter types, each of which is responsible for processing some “frame” of data:
struct Filter1
{
void process (Frame& frame);
};
struct Filter2
{
void process (Frame& frame);
};
struct Global_Filter
{
Filter1 filter1;
Filter2 filter2;
void process (Frame& frame)
{
filter1.process (frame);
filter2.process (frame);
}
};
Now suppose we wanted to make Global_Filter
more “generic” so that it could be used with a “chain” of arbitrary filters? A common approach would be to use run-time polymorphism:
struct Filter_Base
{
virtual void process (Frame& frame) {}
};
struct Filter1 : Filter_Base
{
void process (Frame& frame) override;
};
struct Filter2 : Filter_Base
{
void process (Frame& frame) override;
};
struct Global_Filter
{
std::vector<std::unique_ptr<Filter_Base>> filters;
void process (Frame& frame)
{
for (auto* filter : filters)
filter->process (frame);
}
};
// example usage
Global_Filter global_filter;
global_filter.filters.emplace_back (std::make_unique<Filter1>());
global_filter.filters.emplace_back (std::make_unique<Filter2>());
global_filter.process (frame);
While this implementation is functionally correct, the performance drawbacks that we’ve been discussing could make the processing code unacceptably slow. An alternative could be to use “compile-time polymorphism” as follows:
struct Filter1
{
void process (Frame& frame);
};
struct Filter2
{
void process (Frame& frame);
};
template <typename... Filters>
struct Global_Filter
{
std::tuple<Filters...> filters;
void process (Frame& frame)
{
for_each_in_tuple (filters,
[&frame] (auto& filter) { filter.process (frame); });
}
};
// example usage
Global_Filter<Filter1, Filter2> global_filter;
global_filter.process (frame);
Now there are several drawbacks to this approach as well. For one, for_each_in_tuple()
is not a standard function, so you’ll need to provide your own implementation (although several can be easily found online). More generally, the templated code will take longer to compile, and will generally be less flexible to work with.
That said, in terms of run-time performance, the compile-time polymorphic approach will perform better, and the difference will become more noticeable as the amount of work being done by each filter becomes smaller. I’ve used the approach outlined here to achieve better performance for critical tasks including real-time neural network inferencing and circuit simulation.
Reality: “Reading/writing local variables is faster than reading/writing member variables”
This statement seemed a little bit strange to me when I first heard it. Why would there be any difference between accessing different types of variables?
Consider the following code:
struct Filter
{
float state {};
};
void process1 (Filter& filter, std::span<float> data)
{
for (auto& x : data)
{
x = 0.5f * x + 0.5f * filter.state;
filter.state = x;
}
}
void process2 (Filter& filter, std::span<float> data)
{
auto temp_state = filter.state;
for (auto& x : data)
{
x = 0.5f * x + 0.5f * temp_state;
temp_state = x;
}
filter.state = temp_state;
}
The only difference between process1()
and process2()
is that in process2()
the filter state being read/written inside the loop is a temporary variable rather than a member variable. In my own measurements with this type of code, I’ve consistently seen the process2()
variant of the code out-perform process1()
.
For a while, I was content with my measurements, and didn’t really feel the need to understand why the two implementations had different performance characteristics, but luckily Gustav Andersson had a wonderful talk at the 2023 ADC conference where he discusses this example in more depth:
Myth: “Auto-vectorization can match/beat explicit vectorization”
For the purposes of this discussion, I’m thinking about vectorization in terms of SIMD (single-instruction multiple-data) operations. While it is possible to write “plain” code in C/C+++/Rust and hint to the compiler how it could be vectorized, this rarely works out well in practice, for two reasons.
First, I’ve found that the best performance for vectorized operations can typically be achieved by writing code that tells the compiler exactly what you want your code to be doing. This even includes writing “raw” SIMD intrinsics as opposed to using “wrappers” such as XSIMD or Eve (although those libraries are very useful). In a recent example, I was able to achieve a ~2x performance improvement when switching from an XSIMD implementation to a “raw” SIMD implementation.
Second, I’ve found that code which is intended to be auto-vectorized by the compiler ends up being very fragile, in that seemingly insignificant changes to the code or even updates to the compiler itself can wipe away many hours of hand-tuning an auto-vectorized algorithm. Matt Pharr of NVIDIA expressed this frustration quite eloquently in a 2018 blog post:
The problem with an auto-vectorizer is that as long as vectorization can fail (and it will), then if you’re a programmer who actually cares about what code the compiler generates for your program, you must come to deeply understand the auto-vectorizer. Then, when it fails to vectorize code you want to be vectorized, you can either poke it in the right ways or change your program in the right ways so that it works for you again. This is a horrible way to program; it’s all alchemy and guesswork and you need to become deeply specialized about the nuances of a single compiler’s implementation — something you wouldn’t otherwise need to care about one bit.
Myth: “Memory is free/cheap”
As with the myth regarding virtual function calls, the trouble with this one is that it is true… but only up to a point.
A useful example for thinking about this problem is the lookup table. Early on in my programming career, I was often encouraged to use lookup tables as way to avoid computing “expensive” math functions. The idea here is that you’re trading off computational work for accessing some “pre-computed” value stored in memory.
Following that line of reasoning (and ignoring loss of precision) it might seem that nearly every math operation in your program could be replaced with a lookup table, and thereby achieve some performance improvement. A quick experiment will demonstrate this hypothesis to be false. Simple operations like f(x) = 2 * x
are almost always going to be faster to compute directly than to fetch from a lookup table, because fetching a value from memory has a cost. Notably, the cost of fetching memory depends greatly on where that memory is stored… in particular accessing cache memory is typically 1–2 orders of magnitude faster than accessing main memory (RAM).
The fact that accessing memory has limitations in both latency and bandwidth has important implications, particularly for systems that need to meet some real-time time deadline. As pointed out by Sebastian Aaltonen, even if the host device for your program offers a lot of RAM, the amount of RAM that the program can access within the constraints of the deadline is severely limited.
So even if modern PCs have so much memory available that it may seem “free”, that doesn’t give you a pass to ignore how your program is using memory. In most cases the most performant implementation will also be the one that uses memory the most efficiently. For example, in some optimizations I was working on recently, reducing the program’s memory usage by ~3x coincidentally resulted in a 10% speed improvement as well!
Reality: “Platform specificity is good”
In my first job as a programmer, I was told that it was important for me to write “portable” code, which in that context meant that the code should work for most MacOS and Windows CPUs from the previous 5–10 years. However, since that time I’ve written code for an incredibly wide range of platforms, including desktop and mobile CPUs, low-power embedded processors, and massive 64-core Linux servers. Trying to write code that would simultaneously work across all of these platforms is likely do-able, but trying to write code that would simultaneously achieve the best possible performance across all the available platforms seems like an impossible task.
Instead, I’ve found that it’s best to re-write performance critical code several times for different target platforms. Doing so allows each implementation to target specific optimizations that might not be available or might end up pessimizing performance on the other platforms. A common example of this in my programs is to implement separate DSP kernels for platforms that support AVX intrinsics as opposed to platforms that don’t.
That said, I think this type of optimisation is one of the few that fits into the oft-discussed adage that “pre-mature optimization is the root of all evil”. Writing multiple implementations of the same code for different platforms makes the cost of changing that code later on much higher, since any changes need to be ported and tested across all of the implementations. With that in mind, I think it’s usually best either to start with a “generic” implementation, or to start by developing only on one platform. Then later in the development process, after all of the important decisions about functionality have been finalized, implementing and testing platform-specific optimizations can be done without too much risk.
To wrap up, I think it’s important to mention that all of the topics I’ve discussed above are complicated topics with lots of nuance. There’s certainly times where the advice I’ve given may not be applicable, so make sure to verify everything for yourself, with your own use-cases.
In any case, I hope this has been a fun read. Onward!