My profiling tool tells me:
When I first profile a slow program, I usually get some methods whose method time is outrageously high. This is where people do stupid things. As the stupidity is limited to one method, I can rewrite that code and fix it. Then I look at methods whose method time is high, but only because they are called so many times. By looking up the stack from there, I can sometimes find a place where people are calling other methods too much, instead of caching results. Once I have eliminated those places, I get to the stage people have been calling UniformlySlowCode - the code is slow, but there are no blatant hot spots.
At this stage, I look at the cumulative times. Some methods are expected to have high cumulative times, e.g. main. Some methods should not, and it is only by knowing the code intimately and having a suspicious nature that you can detect these. After you look at the code, and think about it, you sometimes realize that the person who wrote it was an idiot and did it all wrong, but they were such a smart idiot that to fix it you are going to have to change large chunks of the program. To do that requires architectural change.
At this stage you mercilessly apply OnceAndOnlyOnce, reusing other parts of the code wherever possible (as always). You then profile again, and look for methods with high method times. Chances are, some new method has appeared as a hot spot, and you can optimize that. That's good, because optimizing a single method is much easier than architectural change.
As the architectural changes proceed you still have uniform distribution of execution speed, but you move from being slow to fast.
I think that overall, optimizing later is a good technique, but by using the right architecture in the first place you can decrease the number of architectural changed and increase the number of method optimizations to achieve fast code.
-- JohnFarrell