Uniformly Slow Code

Our code is too slow but ProfileFirst doesn't show any hotspots.

If execution speed is not a problem, our processor time may still be uniformly distributed, but it's not UniformlySlowCode. What makes it UniformlySlowCode is not the absence of bottle-necks alone, it must also be too slow.

Uniformly distributed execution speed is usually a result of repeatedly applying ProfileFirst and OptimizeLater. This approach produces diminishing returns with each iteration. If our code is too slow, and it looks like we'll lose to diminishing returns before reach our performance target, then we are approaching UniformlySlowCode. Approaching UniformlySlowCode, like approaching the speed of light, is more difficult to the closer we get. Once we have determined that we are approaching UniformlySlowCode we need to think of an alternative strategy.

It's often difficult to see the bigger picture when optimizing. We might think that spending an afternoon shaving off a few cycles here or there is well worth our time. It might be (but only if our performance goal is also within a few cycles.)

Although we usually find UniformlySlowCode at the end of an optimizing phase, it is not necessarily optimal code for either the machine or the language at that stage. Sub-optimal, but evenly distributed, performance can result from: bad compilers, a bad mismatch between program and architecture, or an inefficient coding style for a particular language. In such cases there is little you can do after the fact that doesn't involve a substantial amount of ReWriting.

Up front experimentation in performance of the compiler, the platform, and the language, allows us to write coding guidelines for the project that will allow us to get further before we hit UniformlySlowCode.

Example: In C++, passing objects to all functions by value, and returning by value can worsen UniformlySlowCode due to the constant construction and destruction of temporary objects.

(In C++1x, move semantics mean that passing and returning by value can actually be the faster option: http://cpp-next.com/archive/2009/08/want-speed-pass-by-value/)

Example: QuakeGame by IdSoftware, compiled using Acorn C on the ARM600. Quake uses float for everything. The ARM CPU has no FPU. All floating point is emulated by the OS via the compiler.

Example: A CRUD-database application that makes one RPC call across the internet per data row and must load multiple data rows per screen and the RPC mechanism is so terrible it takes 3 seconds per call. It's atrociously bad but the profiler can't find the problem. (The RPC used mass generated code so each call listed independently in the profiler.)

Example: A LispLanguage compiler, written in LISP and pseudo-compiled. The runtime library is slow. The compiled code is slow. CLOS (LISP's object system) is slow.

Why was the last example deleted?

I didn't delete it, but I can speculate. I think it was deleted because the thing described (a LispLanguage compiler written in Lisp) actually exists and the performance is quite reasonable. In fact, the Lisp compiler in cmucl fits the description, and compiles Lisp code much more quickly than a C++ compiler written in C (GnuCpp) compiles C++ code. It's a bogus and inflammatory example. See LispStrawMen.

Of course, you cannot compare compile times without remarking on optimization settings. GCC, or any C/C++ compiler, is pretty fast (in execution time) on -O0 settings (no optimization) - it's -O3 that makes things take forever. (Having a big pile of templates is another way to slow things down, of course). The example given above concerned pseudo-compilation...I'm assuming of the compiler itself, not of the compiled code. (Unfortunately, the above example does confuse the two - a slow compiler can and often does produce fast code whereas a fast compiler can and often does produce slow code. Given the expense of many optimization strategies; that is what is often expected).
At any rate, 'tis probably would have been better had the original deleter remarked why. (Or better yet if the post had remained but a comment challenging it were issued, allowing the community to discuss the merits of the claim above.)
If it's intended as a critique of Lisp, it's clearly a straw man. It is evidently not true that all compiled Lisp implementations are "slow". The example is basically saying "imagine if there was a really slow Lisp implementation. Whew, that hypothetical implementation sure is slow!" It applies equally to a slow implementation of CeePlusPlus. OTOH, if it means literally what it says - "a slow implementation of Lisp would be slow" - then it would serve as an adequate example of a program that is uniformly slow because of the environment in which it runs, rather than any fault of the code itself.

Alternatively you could try changing the architecture. This may be simpler and cheaper. In the Quake example, play it on a different machine, In the Lisp example, write a better runtime.

A lot depends on one's objectives. OptimizeLater is, generally, an effective strategy. Like all strategies it will occasionally not succeed. If the problem really is a mismatch between program and architecture then IterativeDevelopment with an early, real, deliverable will show this fastest before you go too far down the wrong path. -- TomAyerst

Apple's port of Smalltalk-80, running on a 512K "Fat Mac" in 1985, was unacceptably slow. SqueakSmalltalk, which is essentially the same port of Smalltalk-80, runs on modern PowerPC Macs quite quickly.... the major change is that the CPU speed is now over 100 times faster (and the CPU architecture is different, too.)

Please give an example of an optimization to be applied "throughout the program". Discuss in terms of ExactlyOnce. Thanks. -- RonJeffries

LazyEvaluation?

I'd expect lazy evaluation to occur locally, although I suppose you could have to go through a lot of CreationParameter? methods and do the inits. It'd hardly be global, though, I'd think. And that would just move the delay to the beginning of the operation rather than the middle, not reduce the time, n'est pas?

Ok not very global, I should have given context. I agree that this is not the best example, just a quick one off the top of my head. I was recalling a project where l.e. would have been a great optimization, if only there weren't all ready 200k lines relying on a rather convoluted memory management scheme that involved a bit of g.c. - this made it nearly impossible to make the local changes work. As I said, not the best example as this project had so many things wrong with it...

InAllMyYearsIveNever seen a program that was UniformlySlowCode. I have optimized later programs in which the hotspot moved, to a few locations as the first ones now went so fast that they vanished, but UniformlySlowCode, never seen such a thing.

I have found a program that when built the simplest way that would work, had to almost completely ripped to shreds in order to make it fit around a new algorithmic optimization that had serious consequences for lots of other things. In this sense the optimization was throughout the program.

I have run into UniformlySlowCode using the method you yourself advocate. The problem is that solving one hotspot often results in multiple other hotspots, whose relative intensities are smaller. As you solve these hotspots, their relative effects are felt elsewhere, again distributed amongst other modules. After several iterations of playing hotspot whack-a-mole, you get this sort of cosmic background hotspot, in other words, a uniformly slow program. To go any faster, you either need to boost the underlying virtual machine's performance, the underlying processor's performance, or adopt a totally different language altogether for those modules which need improvements in performance, or (often preferably) rewrite the entire application to utilize a new central metaphor and different overall logic. -- Samuel A. Falvo II

It occurs to me it might be amusing to see this visually... render the codebase in some suitable scale, colour-coded according to how hot the profiler says it is under test. Accumulate these renderings over time for an animation. Watch entropy redistribute that heat...

You can often go far by finding another metric for hotspots... another 'view' of the cosmos, one might say. E.g. rather than focusing on how much time is spent in a code section, perhaps consider how much data flows through a code section (which may be relevant if specializations can often be applied that might be much faster), or latencies introduced by a code section (very relevant for realtime efforts). Often a different metric will find you 'hotspots' that can speed up the entire program (or a big chunk of it) relative to the previous metric.

You mention real time. Yeah, the quantity of data being bandied about is always a suspect in real time stuff, particularly in the embedded world where we have very limited resources. I had a client last year who was trying to use 6800 microprocessor technology (such as it is) to solve 21st Century control system problems. They were passing relatively large amounts of information back and forth amongst software and hardware components. When I prompted them about changing the mechanism for data exchange I got the usual BS about fixed protocol, installed product base, man-years of effort, yada yada yada. Nonetheless, I was able to rub their noses in the fact that sending all that useless supporting data around when the actual control process didn't need it was a waste of valuable time and memory. Hmph. Might as well talk to the wall.

I'm looking for specifics, related to OnceAndOnlyOnce, because I suspect that if the optimization goes throughout the program, it's because something is spread all over the program that could (and therefore should) have been centralized.

How bad were the ripple effects? It has been 3 years and I still have not worked out all the ramifications and how to make the rest of the application live with the structural consequences of that one optimization. (When I do, the algorithm will go REALLY fast.)

I'll jump into the fray and interject something here. I'm not speaking from experience, but it seems a good example of this would be a runtime system (like a JavaVirtualMachine) that employed a simple, conservative, stop-the-world MarkAndSweep GarbageCollector. The rest of the system doesn't have to know about GC at all for such a scheme to work. But modern, state-of-the-art GC involves generational, compacting, concurrent collection using a "write barrier", and implementing the write barrier involves code all over the VM. It may just be a single method call, but it has to be put in lots of places. Such optimizations do exist, but they're rare. -- GlennVanderburg

I've seen UniformlySlowCode just in the past year. But the problem wasn't an OptimizeLater approach. The problem was a shockingly naive architecture, terribly mismatched to both the problem and the implementation technologies, combined with a big bang development process that didn't even try to run a single test until the system was almost "complete" and changing the architecture was an incredibly expensive option.

OptimizeLater and DoTheSimplestThingThatCouldPossiblyWork are not synonyms for stupidity. -- GlennVanderburg

In answer to an example of an optimization to be applied throughout the program:

I think system-wide optimizations are primarily techniques that you wish you knew when you started and now want to apply. Some C/C++ examples:

use of inline/#define functions
pass parameters by reference instead of value
move functionality into database stored procedures
change error handling to using exceptions
change to a different API set (for example, move from ADO to ODBC)
change from one third party module to another, from third party code to in-house code, or vice-versa
move to a different OS

Essentially, these are improvements in the way you do things, not an improvement in how you implemented one specific function. I don't think this is a violation of OnceAndOnlyOnce, just an aknowledgement that we learn new things as time goes on and there may be value in applying what we have learned to the existing system. -- WayneMack

On modern hardware, the "always pass by reference" guidance is a performance killer. Modern hardware can pass vector types in registers (assuming your business case requires substantial use of them, as video games do), so you want to pass those by value. Passing by reference on the other hand requires the "object" to be flushed to memory so that an address can get pushed into a register. This is very costly, especially since you probably need that vector in a register right away in that function you just called.

Error handling cannot be said to be made considerably faster with the use of exceptions, given that the use of exceptions introduces additional non-trivial overhead to all of the code. I.e. it makes it UniformlySlowCode.

And yet the manual 'if' checking all over the place is demonstrably and significantly cheaper than exceptions?

A well-written exception handler will most definitely give you improvements over traditional error checking. Ignoring the fact that it makes it impossible to miss an error (if you must fail, fail loudly and quickly), it can make the code faster, especially when the error condition triggers the unwinding of large chunks of stack; instead of re-checking your error flag each time you tear down a stack frame (and having to resort either to thread-unsafe global errors or in-band error values) the compiler is free, for example, to stick a single goto in the raising context to the appropriate catch block. Compilers are optimized for the non-exceptional code path, and it pays to keep in mind that your compiler writer is probably Smarter Than You Are—and if he isn’t, I would suggest (plead, even; we need the best we can get) that you get a job writing compilers.

My big example of UniformlySlowCode is extensive use of indirection. Factories and virtual functions, when used throughout the codebase, tend to introduce a lot of extra waiting. Sure, they solve a problem, but a big movement these days (in the game industry, where "winning" in the economic sense is often contingent on better utilization of available resources than your competition) is to try to change the problem so that such a solution is not required. --TomPlunket

I don't doubt the existence of UniformlySlowCode but I don't see an immediate solution which invalidates the OptimizeLater approach.

Let's assume, for the sake of argument, that the system architect is reasonably astute and has come up with an architecture which they expect will solve the problem. While it may be possible, from past experience or advice from wiser souls, to determine which architectural decisions would lead to known performance problems (they key word being "known") we can assume that these features will not appear in a reasonable design. So we end up with an architecture which to the best of the architects knowledge will solve the problem and has no known performance issues.

At this point we go ahead and develop the code. It may run fine, it may be really slow (as stated above the difference between uniformly fast and uniformly slow is a point of view). It won't be known until the system runs (or at least a prototype runs - if we are not confident in the solution). This is the time to make decisions about optimization. The worst case is that the architecture chosen is yet another example to add to the catalogue of "architectures that perform poorly" and hopefully others will learn from the experience.

So what am I trying to say? I think the point is that an architect should generally be providing the best solution that they know how to provide. If the solution ends up being inadequate then it will be optimized and the architect will have learned something. I don't see how the architecture could be optimized earlier if it was the best that the architect (and the resource available to her) could provide. -- PeterSumskas

Here's the problem with your assumption: The customer may not listen. I've seen so many cases when architectures were designed and proven to be a poor match and the customer still insisted on using these poorly suited architectures.

Every single high transaction volume project I've been involved in has run into the same damn problem: The database. The customer wants too much for what the databases can do. Or, they stubbornly refuse to switch to a database platform that will do the required job. This even includes object databases where they're appropriate.

The problem with OptimizeLater is that it assumes that the customer will actually want to optimize. In many cases, this won't happen because the customer will lose face having forced such an inappropriate architecture to be used. However, this does not stop this pattern from repeating itself. -- JeffPanici

We ended up with UniformlySlowCode using the OptimizeLater strategy. Well, almost uniformly slow, anyway. The reason? It became later, and we optimized out the hot-spots. Bang, now it all ran at about the same speed, and since we had gotten the needed performance optimizations, we quit. Dunno what we would have done if that hadn't been enough. Probably gotten more creative... -- JeffBay

My first professional programming assignment (in 1981) was to fix a business application written in Rexon Business Basic that was uniformly slow (and uniformly very buggy, including wiping out its own data files). All files were accessed by record number, requiring dozens of sort and search routines to ensure that the records were in "business key" order within each file. Business basic doesn't even support parameterized subroutines, so OnceAndOnlyOnce would have been a difficult option. (But attractive if I had elected to keep all those silly sort routines.)

So I went through the program and changed all files to use indexed sequential keyed access, as supported by the language and operating system. The system really should have been written to use this in the first place. The original author, when asked, said "oh, I didn't know it could do that." (Remarkable ignorance; RTFM!!!!!!!)

In a sense, I optimized by delegating file key management to the language/operating system, which was a way to achieve OnceAndOnlyOnce.

-- JeffGrigg

I also did a lot of work in dBASE-II, which was well known for being uniformly slow (and very buggy, as far as development languages go).

But hey; who needs to go fast in a GUI business application?

Seriously: A "slow" dBASE application was always faster than manual paperwork. -- JeffGrigg

I've seen code which appeared to be UniformlySlowCode. The profiler's top entries were 10%, 10%, 9%, a few more such numbers. Then I looked at the problem a little differently, refactoring to make sure a few suspicious things were OnceAndOnlyOnce. Once they were, and I profiled them, it became clear that I had been wrong; many modules used some slow bits, but only the slow bits had to be fixed, not all of the modules. (This is similar to the idea at the top of this page regarding the use of slow floating-point math throughout a program.) Having fixed the newly-obvious important parts, the speed was greatly enhanced. Trying to optimize everywhere would have got me nowhere by comparison. -- KyleCordes

I've tuned code that was 10%, or even 1%, of the total CPU usage. A 1% performance improvement that takes me an hour is a productive use of my time. The RefactorLowHangingFruit approach works well for performance issues. Besides, that kind of tuning gives me insight into Java performance characteristics.

"I've seen so many cases when architectures were designed and proven to be a poor match and the customer still insisted on using these poorly suited architectures."

Unless the architecture has an InstallBase? in the millions and a virtual MonoPoly? on the form factor and price range. To specialize an example above: You try getting QuakeGame to run on GameBoyAdvance?, which has an ARM CPU clocked at 16.8 MHz. -- DamianYerrick?

Two cases seem to be intermingled here:

Where the code has some architectural (or other) inefficiency that is widely-distributed throughout the code; such that it doesn't show up on a profile. Use of OS-traps for floating point; compiling C++ with inlining disabled, etc. In some cases, these can be rectified easily; in other cases a redesign is called for.
Where the chosen platform (CPU, memory, network topology, etc.) is simply inadequate to the task. Playing Quake on a GameBoyAdvance?, for one example. Trying to run a weather simulation for tomorrow's forecast on a PalmPilot for another.

The way you deal with one case differs greatly than the way you deal with the other case.

Please give an example of an optimization to be applied "throughout the program".

Quite often I hear people suggest re-writing a program in some other "faster" language. I assume this is an artifact of college classes that teach programming in some language or another. In class, students see only programs written entirely in one language. So they assume that if one section of a program *needs* to be re-written in some other language, then the entire program must be re-written in that other language.

Such people might benefit from learning about AlternateHardAndSoftLayers.

Please give an example of an optimization to be applied "throughout the program".

Switching from AvoidConstCompletely to ConstCorrectness (or back again) tends to require changes throughout the code (because ConstIsaVirus).

I've seen several programs that use something like

  #ifdef NDEBUG
  #define CONST const
  #else
  #define CONST
  #endif

to make it relatively easy to switch back and forth from AvoidConstCompletely to ConstCorrectness, and then *measure* (in the profiler) to see if the extra hassle makes any significant difference. (If it doesn't make any significant difference, we might as well ShipWithAssertionsOn).

Ugh. Show this guy #define const - Joshua
Better still, use -Dconst= on the command-line. It works better, and requires precisely zero source files to be altered. -- Samuel A. Falvo II

Folk, this page is rapidly approaching CategoryRant status. Here we are in 2011, surrounded by enough inexpensive compute power that software performance is almost a non-issue. Even in the realm of EmbeddedSystems we are a lot less concerned about memory space, CPU speed, and throughput than we used to be.

The modern day solution to a system that is uniformly slow is to throw some more compute hardware at it. Geez Louise, guys! I mean! My cell phone has more power than all the development stations I used to create Class II and Class III medical instruments over the last 35 years combined! Why are we still wasting time talking about this?

This is not to say that proper analysis and optimization of code isn't still a valid topic -- it most certainly is. Optimization is a necessary and proper part of our job as professionals. However, it is an ever-diminishing aspect of what we do. Modern design techniques, software development tools, and supporting hardware/OS platforms make the need for squeezing every last CPU cycle out of the system a thing of the past. Let's keep this in mind as we discuss this stuff, okay? Good golly.

-- MartySchrader

[It certainly just got a lot closer....

But I can tell you that there are still times when throwing more hardware at a problem isn't the best solution (or a possible one). Two recent examples come to mind. One is where the customer simply didn't want to purchase new hardware. It wasn't a particularly bright move on their part; but they were set in stone, so you had to do the best you could within the parameters they gave you. The other, a team of six people spent months to bring about a 1% increase in performance. It was more than paid for by the reduction in hardware costs.]

Hmm. If the client simply won't pay for more compute power even after you "run the numbers" for them, then that's what you have. Bummer.

If the other product ran faster with a decrease in hardware cost then it wasn't really an optimization project, was it? It was cost reduction with a side order of run faster. Was the speed increase a byproduct of the new hardware?

[I think you misunderstood. The decrease in hardware costs was caused by the program running faster. Because we optimized the program, we were able to run it on cheaper hardware and still meet the performance/capacity needs.]

Please explain. Are you saying that on the project where you increased the performance by 1% you could also decrease the cost of the hardware? Really? How was the performance increase measured? What hardware was replaced and with what new hardware to decrease the cost and yet still meet the overall performance requirements? The 1% number seems pretty slim to justify decreased hardware capability.

[Yes, really. Some parts of the service run on hardware where the CPU utilization is what was purchased, not the actual hardware itself. For those parts, the performance was measured in CPU usage. A 1% increase in performance is approximately a 1% decrease in the hardware costs. Other parts of the service run on large server farms. For those parts, the performance was measured in terms of throughput per server. A 1% increase in performance is approximately a 1% decrease in the number of servers needed to meet the same number of requests. Here, the savings were realized by repurposing the servers for other needs rather than buying new ones, and not having to buy new servers when the request volume increased. I'll also add that these savings are ongoing, and will continue to be so until enough "code rot" (for lack of a better term) occurs to nullify the performance work.]

Efficiency of our programs also impacts power consumption, battery life, heat dissipation. Mobile computing will only become more popular and pervasive. A cell phone of today might be a supercomputer of years ago, but we should still eliminate inefficiencies from our programs.

Mobile devices are fixed in hardware. So there is no way to ThrowSomeHardwareAtIt?.

The other thing with mobile apps now days is update-pressure. Most of the apps are updated to new versions every month. And forcing-to-update is not uncommon. UniformlySlow? applications when extended ( FeatureCreep ) will become even slower. To the point where they become unusable. I do believe that the mid 2013 facebook app runs smooth on current hardware. But it does not on my 2 year old ZTE Blade. But a 3rd party facebook client, Atrium runs just fine. ( showing that there is a possible solution ) The same is true for most of the apps from Hangouts to Viber . I did not install new software. Just went with the mandatory updates along the way.

In my opinion Performance is an issue to get ever more important, And for me, if i can not justify my UniformlySlow? code by math, then it is on the top of my bug list.

-- ThomasMartinKlein

My laptop started out chugging away quite happily running Windows XP and several applications. After some years' worth of Microsoft updates, it ended up taking over an hour to simply finish booting. I do not exaggerate: it really took that long to settle down enough to start using it.

See: PhasesOfOptimizingLater, TddInLamp? (such as RubyOnRails or Django. ay-yi-yi how could leading-edge test runs GET ANY SLOWER???)

JuneEleven

CategoryOptimization, CategoryCodingIssues