Choice of Compilers – Part 1: X86
In a previous article about writing efficient code, I briefly touched upon the subject of compilers. In this article, we will look at the major C++ compilers available for Linux and compare their performance, cost, and highlight any pitfalls encountered.
For this review, the following platforms will be used.
|Athlon XP||1.53 GHz||RedHat 9|
|Pentium III||1.49 GHz||RedHat 9|
|Pentium 4 Celeron||1.70 GHz||RedHat 9|
|Core2 x86-64||1.89 GHz||CentOS 5|
|Pathscale PathCC||3.0, CentOS 5 only|
|Sun’s SunCC||12.0, CentOS 5 only, glibc too old on RedHat 9|
- A full x86-64 software stack is used on the Core2 system.
- Due to library version requirements, SunCC is only tested on the Core 2 system.
- Due to only having one trial licence for PathCC, it was only tested on the Core 2 platform. Due to this being an x86-64 platform, the static binaries produced didn’t work properly on IA32 platforms, due to library dependencies.
- All of the above compilers were current, up to date, state of the art versions at the time of writing this article.
Benchmarks are largely meaningless. The only worthwhile benchmark is your specific application. If you are reading articles on this site, the chances are that you are a developer, and that you already know this, but I still feel I have to stress it. The results here are for my application. It was chosen because that is the heavy number crunching application I have worked on recently, and not for reasons of testing any specific set of features on any of the hardware or software mentioned.
Now that the boring disclaimer is over, the application in question fits curves to arbitrary sampled data. It has 4 innermost loops for fitting 4 curve parameters and it implements it’s own partial caching using a technique described in a previous article on this site. The code implements most of the optimizations also mentioned in the said article, in order to improve vectorization and speed of the application.
The loops for curve fitting act as a good test for vectorization capabilities of compilers, and the operations performed in those loops are a combination of trigonometric functions and multiplication / addition. The caching implementation also throws in a pointer chasing element. All the iterators are unsigned integers and all the curve parameters and sampled data are single precision floats.
The nature of the algorithm for curve fitting in the test program is iterative. Different compilers with different optimization parameters suffer from different rounding errors in floating point operations. These rounding errors cause minor differences in convergence, and affect the number of iterations required to converge the solution. This may give one compiler an unfair (and coincidental) advantage over another, so instead of total time taken to converge, the results will be given in terms of iterations/second. Each iteration does the same amount of work, so this is a more reliable performance metric.
Selecting the correct compiler switches is important for achieving best performance from your code. They can also break things, and make the code that produces results that are quite clearly wrong. The switches used in the tests were the ones found to produce the fastest code without producing results that are wildly inaccurate (some numeric instability is acceptable, as long as it is limited).
Compiler switches used are separated into 2 parts for each compiler. The common part, and the machine specific part. The common part is the same for the given compiler on all platforms, and the machine specific part varies according to the processor. The switches used are heavily based on what the compiler developers deemed a good fully optimized combination (based on the compiler’s -fast shortcut switch or equivalent references in the documentation). Some options, however, have been added/removed/changed due to producing poor results, either in accuracy or speed.
On Core 2, the nocona target architecture is used. This is due to there being no Core 2 specific optimization in this version of the compiler, and Nocona is the latest supported Intel CPU.
|Common||-O3 -fno-strict-aliasing -ffast-math -foptimize-register-move -frerun-loop-opt -fexpensive-optimizations -fprefetch-loop-arrays -fomit-frame-pointer -funroll-loops -ftree-vectorize -ftree-vectorizer-verbose=5 -fpic -Wall|
|Athlon XP||-march=athlon-xp -mtune=athlon-xp -mmmx -msse -m3dnow -mfpmath=sse -m32 -malign-double|
|Pentium III||-march=pentium3 -mtune=pentium3 -mmmx -msse -mfpmath=sse -m32 -malign-double|
|Pentium 4||-march=pentium4 -mtune=pentium4 -mmmx -msse2 -mfpmath=sse -m32 -malign-double|
|Core 2||-march=nocona -mtune=nocona -mmmx -msse3 -mfpmath=sse -m64|
|Common||-O4 -Mfprelaxed -Msingle -Mfcon -Msafeptr -Mcache_align -Mflushz -Munroll=c:1 -Mnoframe -Mlre -Mipa=align,arg,const,f90ptr,shape,libc,globals,localarg,ptr,pure -Minfo=all -Mneginfo=all -fpic|
|Athlon XP||-tp=athlonxp -Mvect=sse|
|Pentium III||-tp=piii -Mvect=sse|
|Pentium 4||-tp=piv -Mvect=sse -Mscalarsse|
|Core 2||-tp=core2-64 -Mvect=sse -Mscalarsse|
|Common||-O3 -ansi-alias -fp-model fast=2 -rcd -align -Zp16 -ipo -fomit-frame-pointer -funroll-loops -fpic -w1 -vec-report3|
|Athlon XP||-msse -xK -march=pentium3 -mcpu=pentium3 -mtune=pentium3 -cxxlib-icc|
|Pentium III||-msse -xK -march=pentium3 -mcpu=pentium3 -mtune=pentium3 -cxxlib-icc|
|Pentium 4||-msse2 -xW -march=pentium4 -mcpu=pentium4 -mtune=pentium4 -cxxlib-icc|
|Core 2||-msse3 -xP|
- Athlon XP is using code targetted for the Pentium III because ICC doesn’t specifically support AMD processors. However, Athlon XP and Pentium III share the same instruction set, so it’s close enough.
- ICC 9.1.051 doesn’t support Core 2 specific optimizations. The latest it seems to support is Core 1. ICC 10.x supports Core 2, but at the time of writing of this article it was not yet deemed to be of production quality, so the older, stable version was used.
|Common||-O3 -fno-strict-aliasing -finline-functions -ffast-math -ffast-stdlib -funsafe-math-optimizations -fstrength-reduce -align128 -fomit-frame-pointer -funroll-loops -fpic -gnu4 -Wall -LNO:vintr_verbose=ON|
|Core2||-march=core -mcpu=core -mtune=core -mmmx -msse3 -m64 -mcmodel=small|
|Common||-xO5 -xalias_level=simple -fns=yes -fsimple=2 -nofstore -xbuiltin=%all -xdepend=yes -xlibmil -xlibmopt -xunroll=4 -xprefetch=auto -xprefetch_level=1 -xregs=frameptr -xipo=2 -xldscope=global -Yl,/usr/bin -Qpath /usr/bin -features=extensions|
|Core 2||-xarch=sse3 -xchip=opteron -xtarget=opteron -xvector=simd -m64 -xcache=native -xmodel=small|
- -Yl,/usr/bin and -Qpath /usr/bin are work-arounds for problems with the SunCC linker that makes it fail to link to some dynamic libraries. Instead the system ld is used when these parameters are specified, which may adversely affect performance – but it was the only option available to get the test completed.
- RedHat 9 did not have sufficiently up to date libraries to use the compiler, and attempts to produce static binaries for testing failed. This is why this compiler was only tested on CentOS 5.
- Opteron target architecture was used because no Intel x86-64 targets were documented. Opteron target worked fine, though.
Let’s churn out some numbers, shall we? The number in bracket is the normalized score, relative to GCC (GCC = 100%, more is better).
|Athlon XP||GCC||1973 i/s||100%|
|Athlon XP||PGCC||1653 i/s||84%|
|Athlon XP||ICC||5192 i/s||263%|
|Pentium III||GCC||1812 i/s||100%|
|Pentium III||PGCC||1590 i/s||88%|
|Pentium III||ICC||6231 i/s||344%|
|Pentium 4||GCC||1169 i/s||100%|
|Pentium 4||PGCC||1130 i/s||97%|
|Pentium 4||ICC||8437 i/s||722%|
|Core 2||GCC: 3270 i/s||100%|
|Core 2||PGCC: 2715 i/s||83%|
|Core 2||ICC: 16814 i/s||514%|
|Core 2||PathCC: 3600 i/s||110%|
|Core 2||SunCC: 3212 i/s||98%|
A quick note about PGCC’s performance – the test code uses some C math library functions (mainly sinf()). These are not implemented in PGCC in a vectorizable form, and I am told that this is why it’s performance suffers. There are re-assurances that this is being actively worked on and that a fix will be available shortly, but it was not going to be available before my trial licence expires. Having said that, GCC doesn’t currently have sinf() in vectorizable form either, and it it still came out ahead. Another thing worth pointing out is that PGI seem to focus more on building distributed applications on clusters. This review covers no such application, so PGCC may still be a good choice for your application – it just isn’t a good choice for my application.
So, what do the results seem to say? Looking at the directly comparable results, PGCC’s performance is consistently about 12-17% behind GCC. It even almost catches up, on the Pentium 4 being only 3% behind on that platform. Both GCC and PGCC seem to suffer when moving to a faster Pentium 4. Their performance drops by around 30% despite an increase in clock-speed of 13%. This is a rather unimpressive result. ICC’s performance leaves GCC’s and PGCC’s performance in the realm of embarrasing. Not only does it outperform GCC and PGCC by between 2.63x-7.22x and 3.14x-7.47x respectively, but it’s performance increases on the Pentium 4 instead of decreasing as it does with GCC and PGCC. And not only does it increase, it even increases by 20% relative to the clock speed, despite the Pentium 4 Celeron having a much smaller Level 2 cache than the Pentium III in this test (128KB vs. 512KB). This is a truly impressive result. Not only does it show that ICC is a very good compiler, but it also shows that a lot of the perception of Pentium 4’s poor performance comes from poorly performing compilers, rather than from badly designed hardware.
On the Core 2, PGCC appears to close the gap from being 7.5x slower on the Pentium 4 down to a marginally less embarrasing result of being only 6.2x slower than ICC. GCC remains safely ahead ahead of PGCC in performance, but it’s lag behind ICC has increased (5.14x slower) compared to Athlon XP and Pentium III results (2.63x and 3.44x respectively).
Sun’s compiler seems to be somewhat lacking in performance. Its results are roughly on par with GCC. Pathscale’s compiler, however, manages to beat GCC to 2nd place, behind ICC. This is quite a respectable achievement considering that parts of the PathCC compiler suite are based on GCC.
Looking at the logged compiler remarks, it is clear that ICC is gaining massive advantage from knowing how to vectorize loops. Since the code in question uses only single precision floating point numbers, full vectorization performance is achieved on all of the tested processors. Just about all inner loops in the code vectorize. Even though Portland explain that PGCC currently lacks vectorizable versions of more complex math functions (e.g. sinf()), the compiler also failed to vectorize much simpler operations in loops like addition and multiplication. Instead it chose to unroll the loops. The response from PGI support is that it is deemed that unrolling loops with small iteration counts is faster than vectorizing them. Clearly, this seems to be wrong, since both both GCC and ICC vectorized these simple loops, and outperformed PGCC.
Diagnostic output about loop vectorization did not seem to be available from SunCC, as the only similar switches seemed to be related to OpenMP parallelization which was not used in this test.
For those that don’t know what vectorization is all about – you may have heard of SIMD: Single Instruction Multiple Data. The idea is that instead of processing scalars (single values) you can process whole vectors of (i.e. arrays) scalars simultaneously, if the operations you are doing on them is the same. So, instead of processing 4 32-bit floats one at a time, you pack them into a vector (if you have them in an array, or they are appropriately aligned in memory) and process the vector in parallel. x86 processors since the Pentium MMX have had this capability for integer operations and since the Pentium III for floating point operations. Provided that the compiler knows how to convert loop operations on arrays into vector operations, the speed-up can be massive.
Not all of the compilers reviewed here are free. Some are completely free (GCC, SunCC), some have free components due to being based on GPL-ed code (PathCC), some are free for non-commercial/non-academic use (ICC), and some are fully commercial (PGCC).
For the non-free compilers, the costs listed on the vendor’s web sites are:
|ICC||$449 (free for non-commercial/non-academic use)|
A few other issues have been encountered with the compilers during testing. They were resolved, but I still felt they should be mentioned.
PGCC licencing engine proved to be problematic and got in the way of using the compiler even when the instructions were followed correctly. The problem turned out to be caused by a broken component on the PGI servers for generating the trial licence keys. This lead to a day or so of frustration, but was resolved in the end. While I understand that companies have to look after their intellectual property, Intel’s licencing engine worked much more simply and without any problems. Whereas PGI require a key to be generated on their web site and downloaded to a local file with the correct permissions, Intel simply provide a hex string key that can be pasted in at install time when setting up ICC. Intel’s method worked flawlessly. GCC and SunCC don’t have any licence control mechanisms, so this type of issue doesn’t apply to them.
SunCC required -features=extensions to correctly compile the test application, which took a bit of figuring out and a query on the support forum.
Intel, Sun and PGI all provide an online forum where any support issues can be raised and advice sought. In all cases support has been quite fast and at least informative, even if the issues couldn’t be resolved. The nature of the forums allows help to be provided both by the support staff and more experienced end users. For GCC there is extensive community support and mailing lists.
Choice of compilers can make a huge difference to performance of your code. If speed is a consideration in your code (and in some applications it may not be), then the chances are that you cannot afford to ignore the benefit that can be gained by switching to a different compiler. On the x86 platform, ICC is completely unchallenged for the title of producing the fastest code. It is also free for non-commercial use. Another thing worth mentioning is that ICC also produces fastest code on AMD’s x86 processors.
On the most current x86-64 platform tested (Core 2), ICC produces code that runs in the region of 5x faster then the competing compilers. If it is a cost based choice between buying 5x faster (or 5x more) hardware or buying a better compiler, upgrading the compiler wins by an order of magnitude.