Introduction to Writing Efficient C/C++ Code
How can you make a C/C++ program faster? This question comes up all the time. Responses to such questions are frequently misguided, too smart by half, or just plain wrong. That is why I decided to write an article about a reasonably portable tried and true optimization techniques that are basic enough to be generic yet useful enough to yield significant performance gains.
As a general rule, in order of preference:
This is particularly important in code that the compiler can vectorize. Additionally:
Since Pentium MMX, vector integer operations have been supported in the x86 line of processors. These typically only support 32-bit integers, in the case off the original MMX instructions, 2 at a time. This means that using integers types that are not 32-bit in size will cause the code to not vectorize – casting a char to an int is typically expensive enough to undo the benefit of vectorization. The other aspect to consider when dealing with integers is pointer size. When iterating over an array in a loop, it is inefficient to use non-word-sized iterators. Pointers are word sized. Adding an offset that is non-word sized requires the iterator to be promoted, and then the new pointer applied. This adds significant overhead in loops and prevents vectorization.
Even if your SIMD vector unit can handle doubles, it can typically handle twice as many floats as doubles for the same operation in the same amount of time. There is a common misconception that most modern FPUs take no penalty from working with doubles instead of floats, because the internal computation is performed using double precision arithmetic anyway. That the FPU handles everything as doubles internally may or may not be true – but it is also irrelevant. There are at least 3 more important factors to consider:
Most modern FPUs are designed to handle SIMD/vector operations (often referred to as SSE on x86 or AltiVec on PowerPC). The SIMD registers are typically 128-bits wide, and allow for either 4 floats or 2 doubles to be packed in them. Regardless of whether floats are handled with double precision internally, a single vector operation will work on twice as many floats as it does doubles. In other words, it will still go at least twice as fast when performed on multiple data
When the data no longer fits on the CPU cache(s), a round trip to RAM is required to fetch it. This is typically an order of magnitude slower than fetching the data from cache. Floats are half the size of doubles. Thus, twice as many of them can be kept in the caches before they spill out. Thus, all other things being equal, if a double array doesn’t fit in the cache, it can be up to an order of magnitude slower to work with than a float array that does fit in the cache.
This is not such a huge issue any more, but Pentium III cannot vectorize doubles. SSE1 only supports floats. Pentium 4 and later x86 processors support SSE2 which can handle doubles, but this doesn’t negate point 1. above. Other processors’ vectorization capabilities (ARM Neon, PowerPC Altivec) may require additional considerations, so know your platform.
Use a good compiler for your target platform. Sadly, GCC isn’t great. It’s not even good. When it comes to tight vectorized loops and you don’t want to reach for the hand crafted assembler, I have seen performance boosts of between 4.5x (on P3 / Core2) and 8.5x (P4) when using Intel’s ICC compiler instead of GCC. IBM’s XL/C compiler for the PowerPC yields similar improvements on that platform, and Sun have their own optimizing compiler for the SPARC. They are almost certainly worth looking into. Learn your compiler’s features and switches. Enable reporting of warnings and remarks. Listen to what the compiler is telling you, and work with it. It is on your side. If it says that it couldn’t vectorize a loop, establish why, and if possible, re-write the loop so that it does vectorize. Performance benefits from this can be huge.
Vectorization is essential for getting most out of the modern processors. Unfortunately, even the best compilers often need a bit of help from the developer with assorted little tweaks to help them to vectorize the code, but they are generally not ugly, and are most of the time limited to one of the following:
static unsigned int i; static float ii; static float baz; for (i = 0, ii = 0; i < 16; i++) baz[i] *= ii; // vectorizes //baz[i] *= i; // doesn't vectorize
You may also feel clever here and decide to try to cut a corner with the old counting down hack, by rewriting the loop:
for (i = 16, ii = 16; i--;) baz[i] *= --ii;
In theory, this should yield code that is a few bytes smaller, and a little faster. Unfortunately, the benefits of this are questionable at best, and counter-productive more often than not. CPU Caches are typically optimized for forward rather than backward reads, and performance has been observed to actually reduce when using this trick on some otherwise superb compilers. At the very least benchmark your code/compiler combination to see if there is a worthwhile benefit to be had.
Consider your divisions carefully. Divisions are expensive. If you are dividing by the same value multiple times, get 1 / value and multiply by that instead.
Move as much code as you can outside a loop. If there are parts of your calculation that can simplify to a single expression, calculate it outside the loop.
Declare variables as static where possible. This especially goes for large arrays. Static variables are allocated on the heap, rather than on the stack, and only get allocated once. That means that if you keep calling a function, a dynamic array will keep getting malloc()-ed and free()-ed, and this is expensive. A static array will not. If you know the maximum size you’ll need for an array, static declare it once, and re-use it.
For example, where powers of 2 are involved: Integer Modulo:
x % 8 x & 0x7
Bitwise AND to truncate everything but the modulo bits. Integer Multiply:
x * 8 x << 3
(Big-endian) Left shift by n to multiply by 2^n Truncating floats: *((unsigned int*)&b) &= 0xFFFFFF00; Truncates float b to 16-bit precision (8-bit exponent)
If your function parameters are changing partially, evaluate them partially and cache the results for each part so you don’t have to re-calculate. For example, if your function is something like: Y = a * (bX^2 + c) + d; Arrange your loops so the outer-most one works on the inner-most evaluation (in this case X*X, followed by multiplication by b, followed by addition of c, followed my multiplication by a, followed by addition of d in the outermost loop). You can then cache the values of X*X (which is, incidentally, much faster than (pow(X,2)), b*X^2, b*X^2+c, etc, so when your outer parameters change, you don’t have to re-calculate the inner terms. How you design your caches is also important. This can cause more overhead than it saves, so you have to optimize it very carefully while keeping your underlying algorithm structure in mind. For a quick and dirty implementation you can use the STL map, along the lines of:
typedef map Cache1; Cache1 MyCache; MyCache[a] = MyFloatArrayPointer
This will enable you to establish the cache hit rates, but the chances are that the performance will be too slow for real world use. Write your own, custom caching storage/lookup algorithm specifically suited to your application. If you don’t need the extra precision, consider a float truncation technique, as described above. Floats can be slightly unstable with extreme compiler optimizations enabled. This may disrupt the cache keys and cause a miss when there should be a hit. Truncating a float will reduce the possibility of this happening.
Write the optimized code for your specific application yourself. Where performance isn’t an issue, 3rd party libraries are great for rapid development or proof of concept prototypes, but there is a price to pay in terms of performance when using generic code vs. bespoke code specifically optimized for a particular task.
Learn the compiler switches for your compiler. Test the accuracy of your resulting numbers. When you start cutting corners (e.g. “-ffast-math -mfpmath=sse,387” on GCC or “-fp-model fast=2 -rcd” on ICC) you may get more speed, but sometimes the precision on your floating point variables will reduce. This may or may not be acceptable for your application.
All the optimizing tweaks in the world won’t fix a bad algorithm. Optimize your design before your implementation. For example, if you are doing a calculation that is a modulo of a power of 2 on an int, use a bit-wise AND instead of a modulo operator. It is an order of magnitude faster. (e.g. instead of X %= 4, do X &= 0x3). It is impossible to summarise all possible optimization techniques, and they will differ from project to project and won’t all be appropriate all the time. But hopefully – this will get you started on the right path.