Grok The Compiler

The compilers optimizer is widely touted in the literature by proponents of good design to do a really good job anyway, so why bother. -- BlackHat

The compiler's optimizer is really good and if you are not better do not bother, please do not bother. If however (like me:) you are really good then do bother, because you can make frequently a large difference. If (Like me :) you are REALLY good you will know that what you do to optimize code is usually fundamentally evil and should not be done any more than is needed. One of the ways that you know tuning code is evil is it is fun. -- YellowHat

So why do we have to Grok the compiler?

Well short of breaking out and writing assembler, you have to understand why the compiler didn't just write the fast code in the first place. What you have to do is write your high level code such that the compilers optimizer is capable of optimizing it. Doing this is actually harder than writing the assembler but it is more maintainable and portable. If the optimizations were made on the basis of GrokTheLanguage and not GrokTheMachineArchitecture then the optimization ports as well.

So What is it about compilers that must be Grok'ed? Unfortunately, that information is stored in context-indexed memory and most of it I can't remember it until I need it.

A Short List of major known compiler characteristics is shown below.

-- AlanChristiansen

I generally discourage this type of optimization. Sure, it's "portable" - but the optimizations don't generally port across. If portability is a concern, then the "general compiler" will optimize best that code which most accurately expresses its intent. Many compilers know how to combine base and offset into a single pointer variable; by hand-converting to a pointer variable you can say goodbye to some other optimizations which may have been hiding. Other optimizations make the code fly on one compiler and drag on the next. GrokTheCompiler is a dangerous game if you really do care about portability, and irrelevant if you don't.

Optimizations of this nature are generally late in the day, when you have exhausted all your algorithmic ideas. The speed gain is a small factor. Much better to buy a better compiler, or downcode the inner loop (i.e. use a more suitable compiler, or use a more suitable language).

-- EddieEdwards

I both agree and disagree with you. I always discourage any optimization of the code if reasonable algorithm optimization can get you there instead. However the optimizations I propose are "portable" in the sense that all compliant compilers compile the tricks I am referring to. It is reasonably portable in that I have not required the sorts of extremities of compliance used in BlitzPlusPlus.

The optimizations do port, in that

are problems that a better compiler usually cannot fix. Taking advantage of these optimizations requires wholistic knowledge that only a human can provide. For example

If the programmer writes:

    for(it = c.begin(); it!=c.end(); ++it) {....}

then the compiler usually cannot know that the operations in {....} do not change the result of the c.end() function from one iteration to the next. If the programmer knows that it does not change, the programmer can copy it into a local variable. This optimization ports.

The fact that the programmer must 'KNOW that it does not change' is the reason I always discourage optimization of code. The obvious risk is that the maintainer may not know, and the patch in version 3.7.13 may change things so that it is not true.

The above paragraph sounds like a good reason to have an assertion in the loop body to, err, assert that c.end() doesn't change during the loop. UseAssertions.

Yes algorithm gains of factors can be in the order of 10^9 and are big, but changes in the coding of just a few routines have speeded up applications I have worked on by factors of between 10 to 20. These factors are smaller than the algorithmic ones.

Yes I agree that writing the inner loop in assembler in the final optimization It is in fact easier to do that writing really fast C++ but then, where I work I really would be the only one who knew what it meant. I would also have to write sparc and pentium versions, I would also have to learn about optimal instruction scheduling, which is a task I currently leave to the compiler. Although, perhaps there, is a peephole based optimizing assembler that shuffles assembler instructions for you. In my experience, my best C code gets within less than a factor of two of my best assembler code unless there is severe bit twiddling involved (eg DES), that uses rol and all sorts of other cute Pentium machine instructions.

GrokLoops is an optimization that I think compilers could do, but I have never found one that does.

Your warnings about how some optimizations effect one compiler and architecture one way and another architecture another is a timely reminder of where and why I started the page the way I did. If you are not better than the compiler, do not optimize code. If you are writing portable fast code you have to better than two compilers on two different machines at once. If, as was the case once with me, the project is 3 months' work it may be worth 1 day to write the 10 or so lines of code that execute 95% of the time blindingly fast. I use two metrics when I do this. I compile to assembler. I do this to check what it thought I meant it to do. Then I time it to make sure it still scheduled the instructions reasonably.

-- AlanChristiansen

What annoys me more than most of the annoying things I have to deal with day to day, is the fact that a lot of these optimizations should be possible in the compiler, but most commercially-available compilers are lacking in this area.

In the ComputerGamesIndustry, there is a real need for excellent optimizing compilers. Not only for IA32 but also (mainly?) for the MIPS chips and the various vendor-specific vector units.

With compiler technology where it is, we're moving back from mostly C/C++ and a little assembler, towards a lot of assembler again.

So I guess my objections to GrokTheCompiler are based more in theory than in practice. Where compilers are inadequate it is probably better to "work around" this by "getting to know" your compiler, than it is to downcode everything.

Maybe, rather than GrokTheCompiler, a largish games company should WriteTheCompiler?.

-- EddieEdwards

Grok was good compiler. Thog make good programs with Grok. -- Thog

WARNING WARNING DANGER DANGER

Be aware that many things that you may think a good compiler should do may not be legal to do and still be ANSI compliant.

Please read GrokAliasing for one of the major restrictions on what a C compiler can do. Aliasing (lack of) is I believe one of the big reasons Fortran is fast.

If you still think compilers should do better please give examples of C source code and assembler. If it gets too big to put here email me. Note optimization of this kind happens in the inner loop ONLY so a neat little cut down example should be possible.

-- AlanChristiansen

I've been down this road before. I give you an example of some code which the compiler clearly should optimize, then you give me the winner of an obfuscated C contest which breaks the optimization. I'd rather have a dialect of C++ which wasn't 100% ANSI compliant but did the obvious optimizations anyway.

Now, if there were obvious optimizations which had REASONABLE ways to break them, that would be different ... but there aren't. All the ways to break obvious optimizations involve pointer aliasing.

Example: one guy I talked to told me that of course the compiler couldn't change this to a downcounter:

 for (ii = 0; ii < d; ii++)
 {
   *dptr++ = *sptr++;
 }

because what if dptr points near d? This would be a bug if I wrote it, and if someone who worked for me wrote it to deliberately change d, they'd be fired ASAP ;) -- EddieEdwards

It's a bug regardless. In C++ (and probably C) pointer arithmetic is only well-defined within a single object or array. So if 'd' is a normal auto integer whose address is never taken, and we nevertheless change 'd' through 'dptr', we already have undefined behaviour and the compiler can do anything it wants.

The real problem here is to ensure that dptr and sptr don't point to overlapping memory, because that is both allowed and sensible (The intent may be to zero out the array.) (However, the optimizer can produce 2 versions of the loop, a fast backwards one for when there is no overlap, and a slow forwards one for use when there is. Which raises issues of bloat and code cache size, of course.) -- DaveHarris

pardon my ignorance, but why is the backwards version faster?

Standard debug C build of this code might become:

mov ii,0 b eol loop: ld tmp,sptr st tmp,dptr inc sptr inc dptr inc ii eol: cmp ii,d blt loop

A faster implementation (usually!) and the one an assembler coder would "naturally" write is:

mov ii,d beq skip loop: ld tmp,sptr st tmp,dptr inc sptr inc dptr dec ii bne loop skip:

Converting an up-counter to a down-counter is faster, because the compare is implicit on a CPU with flags.

So, here the C++ aficionados can help me.

In working, practical C++ code, how many times do people actually use aliasing in a compiler-defined way? This boils down to the following: two pointers must be taken to an array; they must be then sent to a function which does not know the pointers point into the same arrays. And the function must not be memcmp, memcpy, strcmp or any standard C library function (since these are always optimized and always handle aliasing correctly). Of course, this can happen in a widely separated fashion - e.g. a pointer is taken at the start of day, and another taken later on, so even the author of the program might "forget" that aliasing happens.

Is this really a problem for modern designs, or just a problem for legacy code?

Also, do people take pointers and keep them (e.g. pointers into an array belonging to a class - I guess the char*[] operator on string counts) or do they only take pointers temporarily and then lose them? Seems to me that holding onto internal pointers is extremely dangerous.

The ultimate question is: would a hypothetical C++ compiler with no constraints regarding legacy code really need to lose these optimizations to aliasing?

I'm reading TheDesignAndEvolutionOfCpp right now, and Stroustrup mentions that CRAY wanted some additional alias-forbidding keywords to enable their vector optimizations (I'm interested in the same area). Anyone here know how/if CRAY resolved these difficulties?

-- EddieEdwards

CRAY (and other supercomputer vendors) often resolved the problem by using FORTRAN for serious performance-oriented code.

Stroustrup mentions that. It's worth considering, provided you can solve the linkage problems. FortranLanguage is, at least, easy to learn.

Having recently come from a high performance SGI shop (remembering that SGI bought Cray), I have done a high performance optimization workshop run by SGI and they teach GrokTheCompiler combined with GrokThePlatform? as gospel.

Note that the SGI compiler literally has thousands of command line options for tweeking its multiple optimization stages (it does both language optimizations as well as object code optimizations). Yes you can turn on "non-ANSI-compliant" optimizations, yes you can also include various #pragma's to permit the compiler to make smarter optimization decisions.

Of course the #1 source of speedups (provided you are already using a good algorithm) is eliminating cache misses and TLB flushes. You can do all the assembly tweeks you like, but you'll loose all your gains if you add just one extra cache miss. And if you should end up with an extra TLB flush? Well you're really slowing things down.

Targets for optimization include:

UnrollLoop? (allows more EfficientCacheUse?)
ReverseLoopOrder? (fundamental in EliminateCacheAliasing?)
AdjustArraySize? (can EliminateFalseCacheAliasing?)

Of course always ProfileFirst! -- AndraeMuys

Occasionally it becomes necessary to GrokTheHardware? (or GrokTheCpu?) as it relates to the compiler's optimizations. The classic example is DuffsDevice.

In one memorable scenario (a comms program), I had to keep a pre-assembled patch that I would load in a debugger and overlay on the compiler's output. I had a routine that contained essentially junk code that had a known compiled size where we could search for the byte pattern and paste in the (assembled) fix. Heady stuff.

Interfacing to hardware is probably not a fair measuring stick for a compiler, but I keep running into variations on this theme.

-- GarryHamilton

CategoryOptimization CategoryCompilers