Simple Profiler

If you are ever stuck in a language which doesn't have a ProfilerTool, you can always build your own.

Here's a quick and dirty design document for a simple CodeProfiler? when you need to RollYourOwn.

You need two functions: entryPoint(string[, string]), and exitPoint(string[, string]). You can make these global functions, module level functions, or methods of a singleton SimpleProfiler class.
Your language needs functions to record the current time; the more accurate the better, milliseconds is good.
In whichever functions you want to profile, place entryPoint("functionName") at the beginning of the function, just after the function signature and before any other code. Put a matching exitPoint("functionName") at each return statement and at the end of the function if it has no return value.
It is important that the return statements have no side effects, and that they do not call any other functions. If your code currently has something like 'return otherFunc(param);', you need to replace it with 'temp = otherFunc(param); exitPoint("myFunc"); return temp;'. This can get really messy, but hey, it's a simple profiler. Sometimes 'simple' is better than 'none'.

The entryPoint/exitPoint functions basically record the call stack for the functions you're profiling. entryPoint(functionName) adds a record holding functionName and the current time to a global/static stack object. It also counts the number of times functionName has been called. exitPoint(functionName) uses functionName to verify that the calls are properly bracketed. It also calculates the elapsed time and adds it to the total elapsed time for functionName.

From this setup, you can determine: Which of the monitored functions are being called most often, which of the functions are using the most time, and the average time spent in the function per call.

From this information, you can profile your code starting at the top-most function and going down into the sub-functions that are the bottlenecks. When you reach a function that you want to optimize, you need to profile it to pinpoint the trouble spot. You have two options: Refactor the function into smaller functions and profile each of those; or use bracketed calls to entryPoint(functionName, sectionName) and exitPoint(functionName, sectionName) to break the function into sections, each of which is profiled separately. Sometimes it's best to do a combination of the two.

Extensions

You can actually do some pretty sophisticated analysis with just these two functions. However, the most useful statistics tend to be: Time spent in function only, time spent in function plus sub-functions, number of calls to function, average in-function time per call, average in-plus-sub-function time per call. It would be useful to have the SimpleProfiler automatically capable of outputting sorted lists of these statistics in a report.

Disadvantages

The big disadvantage is that the profiling code itself will slow down the program, sometimes skewing the results for small-but-quick functions that happen to be called a lot. If the timing code inside entryPoint and exitPoint is properly bracketed, this skewing can be reduced quite a lot.
It's a pain to add all these function calls to your code... and then remove them later!
It tends to make your code a lot uglier.
It tends to break elegant one-liners into step-by-step multi-liners. Sometimes this is a good thing.

Workarounds

If the language supports conditional compilation, you can leave the profiling code in without slowing down your released code. Otherwise you'll have to comment out the calls (which can be easy with Perl).
Also, AspectOrientedProgramming could allow you to automate this as well as only turn it on when you want to.
Well-factored code tends to suffer less from the cosmetic changes.

Hrm. If a SimpleProfiler is this straightforward, you hardly even have to write code, you can do it with simple log4j setup. Because log4j messages can be prepended with a timestamp (either absolute or relative) and include where it was call from, it would be easy to just put in calls to do this for you. Step by step (refer to the log4j website for details -- http://jakarta.apache.org/log4j/docs/index.html)

Create a Category appropriately named, perhaps profiler
Configure an Appender for that Category to output timestamp and calling function
Associate the Category and Appender at the DEBUG level
Code calls to Category.log(Priority.DEBUG) at entry points and exit points
Run you program
Analyze the output (the hard part)

This has the advantage that when you want to disable the profiling, just turn off log4j with appropriate configuration settings.

--StevenNewton

The original SimpleProfiler description above was written for times when you're working with a language that doesn't support this kind of thing (there are many). The SimpleProfiler can work pretty much anywhere.

Is it me, or could you just make a perl script to call a method with the line number of every line, and print that to a file, then read back the file to find out which lines get the most use?

For added accuracy you could have the method output time and line? Wouldn't this be better then simply timing the method as a whole? --DavidSalamon

Just putting in my two cents (and paraphrasing Brian Kernighan) the best profiler is no profiler. All you need is a "pause" key. Don't confuse measurement with diagnosis. If you want to find how fast or slow it is, then do that. On the other hand, if you want to find out what's making it slow, just pause it a few times and look around. If something dumb is consuming some fraction of time (say 30%) then there is a 30% chance that you will see it on any given pause. Fix it, collect your 30%, and keep going. Typically the problem is a needless function call that a profiler will never see.

OK, sorry to be so blunt. The problem with this profiler, and nearly every other profiler, is that it concentrates on functions, as opposed to function calls. If function X is active on the call stack 30% of the time, then you could save up to 30% of execution time by drastically optimizing the function. This might not be easy to do. On the other hand, if a CALL to function X is active on the call stack 30% of the time, then eliminating that call would also save you 30%, and that might be easy to do. If your profiler could keep track of the line numbers from which the calls occurred, it might be able to tell you what percent of time each statement was active. That would work, but it is much simpler just to hit the "pause" key, and you don't have to bugger the code at all. You don't need a jillion samples either. In fact, the first time you see a statement appear on the call stack in more than one sample, it is a possible candidate for optimization.

--MikeDunlavey

This is a good bottom-up approach. One additional caution, though: sometimes the bottleneck is an OS call that takes up a lot of time, but which also ignores/disables your pause key. Be sure to look at the statement(s) immediately leading up to the pause point as well.

Agreed. Actually, it means that you cannot "see into" the OS call, but the layers above you in the call stack still contain good information. Example: you could see that 3 levels up, you could be calling down to this level multiple times when once would do the job. A related problem is, in MS Windows, it masks out the pause key until a window message is complete, or some other gross time point. I've got another approach for this, but I didn't want to blather on.

Small point: The word bottleneck gives me a twinge, because it doesn't have a good definition, that I know of. Hotspot is another that is commonly used, but I think it has a good definition, namely as a region of code where the PC is often found. By that definition, hotspots practically never contain function calls, because if they did, the PC would not be there. If you take an inefficient chunk of hotspot code, and put a function call into it, it is still inefficient, but it is no longer a hotspot. In my experience, most inefficient code fragments do contain calls, so a tool that looks for hotspots will not find them. I made up the term slug for any chunk of inefficient code, whether or not it contains calls. Maybe there is a better word. Then, the method I outlined below could be called "dissecting slugs"! :-) -- DougKing

Corollary gripe: compilers make a big deal of their optimizations. Optimizations are basically pointless, except in hotspots. And, except when you really do need that extra 5% speed improvement (i.e. hard real-time).

All compilers ever do is optimize... they start with code that would be very slow to interpret and turn it into code that the processor can handle far more efficiently. They reduce runtime cost at the expense of build-time and (sometimes) load-time costs. They SHOULD make a big deal of their optimizations. Other than correctness and ease-of-use, it's the only selling point for a compiler. And the more useful, the more powerful the optimization, the better... a compiler that can perform algorithm transformations to reduce O(N^2) algorithms to O(N * lg(N)) algorithms without violating programmer intent is fundamentally better than one that can only reduce a cost(N)=(8*N^2+9*N+10) algorithm to a cost(N)=(7*N^2+6*N+5).

Whether a particular 5% speed improvement is worthwhile depends only on the cost of getting the 5% improvement compared to the lifetime runtime cost of the produced code. If you're paying per CPU hour for a supercomputer, and expect to spend a million CPU hours running the program, that 5% improvement translates into real money. If you're in a race with a competitor to acquire and patent protein-folding results, and the cost of being a hundred hours ahead or behind can amount to millions, that 5% improvement translates into real money. Similarly, if you're paying per developer hour, and the developer is compiling the code, the time spent compiling translates into real money. If compiling and optimization effect (money saved later > money spent now), then it was worth it.

You mention hard real-time... that, too, measures as money. The curve, however, is not flat (payment per CPU hour); rather, the cost of taking a particular time-slice is very high, and skyrockets if it violates the real-time parameters.

Also, bottleneck is well (i.e. formally) defined. A process capacity bottleneck exists when one process within a chain of processes slows the whole chain (i.e. causes a wait, or isn't fast enough to (on average) keep up with the work load). Sometimes speeding up the process works to repair a bottleneck, sometimes duplicating the process helps, and sometimes they're impossible to overcome.

Hot spot seems not so well defined... not formally defined, at least. There is no 'point' at which you can clearly say a spot is 'hot', so the definition is vague. I'd say, however, that if you simply counted the number of times a PC touches a line of Assembler over the course of some runtime stage, you'd learn where the hotspots are. You called it "where the PC is often found", and the best way to formally calculate that would be to count where the PC is found.

This is a fun exchange. I admit the world of communicating asynchronous processes places a different set of demands, so if multiple processes end up blocking on a common resource, that certainly is a bottleneck. Back to the single-thread world. To me, hotspot is quite well defined, because if you sample the PC at random time points, and a large percentage of the samples find the PC in a small region of code, that region is a hotspot. I remember when this was actually done to test memory chips by making them hot in particular spots. Back to optimization. Certainly there is a type of application where hotspots are really important, because all the really hairy work happens there, and you want every cycle sqeezed out of that code. I've worked on lots of software, and very little of it had that character. Mostly, it consisted from hundreds to thousands of routines put together by several programmers, with staff turnover, and when some aspect of it was slow enough to be alarming, I would go in, do the pause technique, and find the typical call stack depth of 10 or more. (Maybe this is crummy code, but there's plenty of it out there.) At the bottom of the call stack, where the PC was, it was in malloc(), or free(), or write(), or some math library routine, none of which I or anyone else had the slightest desire or ability to optimize. Further up the call stack, however, we found lots of nonsense, like lazily calling the same lower level function multiple times when once would do. (Typical comment: "That's a really useful function!") These mid-level functions were repaired, getting enormous speedups. Back to compiler optimization: Suppose for simplicity a normal instruction takes 1 cycle, and a routine A takes 100 cycles, except for one instruction, which is a call to a subroutine that takes 10e6 cycles to complete, what do you gain by squeezing cycles out of the caller's code? This especially bugs me with Fortran compilers because they intensively scramble the code to save cycles, making it hard to debug, even though the PC may practically never be there. In any case, I agree, when I'm down to writing a really tight hotspot myself, I do appreciate a compiler that gives me good, clean, tight assembler code. As for changing my algorithm around, my personal preference is for tools that aren't smarter than me :) --MikeDunlavey

You said: "To me, hotspot is quite well defined, because if you sample the PC at random time points, and a large percentage of the samples find the PC in a small region of code, that region is a hotspot." Aye, that seems a fine approach to find hotspots, but to claim it is well defined by this approach requires that both "large percentage" and "small region of code" be well defined. Definitions are funny like that. It's not too hard to get quantitative about that, like for example defining the hot-spot-ness of a set of addresses as the fraction of time the PC is found in them divided by the number of addresses in the set. You can choose an arbitrary operational definition. I'd suggest, with the one you offered, multiplying by the number of codespace addresses in the process (so the total equation == 1.0 for any code region if the PC touches every address the exactly once). I'd also consider using 'contiguous region' rather than 'set' to avoid combinatorial explosion regarding calculations of possible hotspots. This would allow you to discern hotter spots from colder ones, certainly... but there'd still be no particular number you can definitively call a "hot spot" without harming conceptual correctness (e.g. if you choose 10, then is 9.9999 not a hotspot?)

Now, measuring hotspots by the temperature of the memory chip? heh. That's certainly a literal approach.

Anyhow, if all you're using a compiler for is the built-in type checker (cost of error is also very high, after all) with the dubious additional benefit of reducing text code into machine code (because you don't care about speed, especially at the cost of platform flexibility), then what you really want is dedicated correctness testing software (e.g. for unit tests, finding common errors, typechecking, etc.). Well, certainly I don't want my compiler to make the caller's code so slow that it does irritate me. Actually I think languages serve a much more serious purpose, that I can't get into here. That's silly... any 'good' compiler wouldn't be making the code slow! (But I won't speak of the evil compilers. "Muahaha, I found a N^5 algorithm to replace your N^2 algorithm!!!" Maybe people who dream up Intercal would write such a compiler.) I think you should redirect if you're interested in language discussion. I certainly am... language theory is among my areas of interest.

You ask: "what do you gain by squeezing cycles out of the caller's code?" Answer: Not much. But it's still worthwhile if the gain is greater than the cost of the optimization. Fundamental rules of economy still apply. The total cost of calling the function, of course, is the cost of it + cost of every subroutine it calls. Routine A costs 100+10e6 cycles to call. The compiler would be smarter to focus on percentage gains in the subroutine that costs 10e6 cycles to complete.

During the development cycle, where there is much debugging, optimization should be reduced (unless, of course, you are running correctness tests in release code because the compiler can't be trusted to optimize well... or because you often suffer timing based errors). Even discarding the extra programmer-time costs of figuring out what the heck the compiler did to your code, if you're only going to run code a few times before rebuilding it, it is unlikely that the cost of optimization will be greater than the runtime gain.

"my personal preference is for tools that aren't smarter than me" -- heh. In fifty years, I expect that compilers will be expert systems with savant level intelligence in optimizing code... and at testing its correctness. Heck, they might not need human developers... we'll just describe what we want (as a set of constraints), and they'll program it -- constraints based programming taken to the highest level a human can formulate. Now, as a person working in the field as a programmer, I fret for job security under such circumstances... but I certainly recognize that this would be, for the most part, a very good thing. Had to run out earlier. This is probably getting a little too conversational and far afield from the title of this page, so I'll be brief. I put all these ideas in a book some time ago, which didn't sell much but now seems to be hard to get. It is Dunlavey, "Building Better Applications: A Theory of Efficient Software Development", ISBN 0442017405 , which I only say to indicate how long I could go on on these topics :) In a nutshell, I got my degree in A.I. so I take a jaundiced view of such things, but there are deep lessons in it. Probably different people see different lessons in it.

It doesn't take a strong AI to do an excellent job at optimization. Anyhow, I agree that we're too far afield at this point.

I'm the world's worst promoter, which doesn't stop me from trying. So if you click on my name I've got links to related stuff of interest, and my email address. Nevertheless, back to the point of this page, something drives me to beat the drum about the pause technique, because maybe it's megalomania, but it seems as computers get bigger and faster, software gets bigger and slower, and if you add up the cost, it's gazillions. No computer clock or compiler optimizer or profiler can keep up with the waste that creeps in as software gets larger.

It is only natural that software grows. As it becomes easier to implement features, people will want more features. The only wasted features are the unused ones.

The waste that creeps in is not due to features, it is due to layers.

From what I've seen, the problem has to do with the psychology of programmers (and I'm as guilty as anyone). When a useful/powerful function/class/utility is written, the programmer who writes it is proud of it (rightly so) and expects that it will be used with a certain reverence, i.e. sparingly. The user of the function/class/utility (who may be the same person) sees it as getting so much done with only a line of code, that they use it a lot (unsparingly), thereby writing powerful code and honoring the builder. Usually this dynamic is subtle, but if you pause the code, you'll see it.

In big software this happens at multiple layers of abstraction, because you have classes/layers built upon other classes/layers, several layers deep. (As programmers are taught to do. I call this "galloping generality".) The result is at each layer there is inefficiency (due to excess calls) that multiplies with the inefficiencies of the other layers. That is what I've observed: the inefficiency is exponential in the typical call stack depth.

The good news is it is easily fixed (identified, really), via the pause technique. And, after you do this a few times, you start to recognize the encroachment of excess generality, and it changes your coding style to be more plain and direct. Unfortunately, this sort of pragmatism flies in the face of what programmers have been taught. In a situation where using a plain old array could save them 20% over using a "collection class" with a "custom iterator", many modern programmers would have difficulty making that choice.

A layer is only unnecessary if it provides no necessary features... and a layer is only useless if it provides no useful features. A layer that provides no useful features may as well not exist... and, indeed, can be compiled out of existence by a compiler capable of recognizing this. Of course, recognizing unnecessary features is not an easy endeavor.

Whether resource expenditure due to use of a feature at a particular layer can be 'fixed' after being found will depend mostly on whether that feature is necessary or of great enough utility that its elimination is acceptable... and on whether a human can identify and eliminate some redundant work that the automated optimizer could not recognize as redundant. The former is mostly an issue of features, and the latter an issue of the 'insufficiently smart' optimizer. Even if optimization was perfect, the issue of wasted features is present... e.g. if you don't need or care about a complete log of events, then a layer providing such a log is a largely useless and consumes unnecessary resources even if it runs very fast.

Even if you don't want a log, someone might... which is the real cause of code bloat. As a user, you ask for 10 features and receive 100... and spend resources (space and time) supporting 90 features of which you'd much rather be rid. As software and state-of-the-art advances, the number of features you'll be supporting but not using will grow... often linearly. Fortunately for the user, resources grow exponentially, and will continue to do so (under Moore's law) for another two to four decades. Unfortunately, this problem will NOT easily go away. Even if you are given the ability selectively install features of a product to avoid those you won't use, you'll be spending an amount of time reading about and rejecting such features that grows linearly with time... impractical wrgt. scaling laws (life span isn't increasing linearly). The best solution I can think of is to selectively install or uninstall, at or between runtimes, features as needed.

This is great stuff to talk about. My opinions come from doing it a lot and seeing how reality differs from expectation. I tried to make a realistic distilled example in my DDJ article in 11/93. In that example, execution time was reduced by a factor of 40, and code size by a factor of 4. Unfortunately, numbers like this are above most peoples' belief tolerance. They think I'm saying percent, but I'm not. I'm saying times. The example is also in my book which is hard to get now, but if you email me I will send you a zipped copy. --MikeDunlavey

40x is very impressive, and I can see why people fail to suspend disbelief. I work a lot in I/O bound programs, myself, so that sort of gain isn't usually plausible... what I can gain from improving speed is slightly enhanced battery duration. Please send me a copy of your book, that I may see this example.

What about using only one profiler function that merely logs the line number and timestamp (and maybe the time since the last timestamp)? Then, you sprinkle calls around in the sections that are too slow (initially, the top-level function is the section that is too slow). Add calls until either a) you find something that is too slow that you can optimize (including removing unnecessary code/calls) or b) the profiler calls are surrounding a single function call that is too slow. For a), you optimize it, and check it again. For b), you go to that function and start sprinkling the calls around there.

This way is more methodical than the pause key method, and can also help you find unnecessary calls.

Personally, I think I would try the pause key method first, then this single profiler function method. --DougKing I've tried to introduce the pause method to some colleagues, and the initial reaction when they see the call stack is Gack! - it looks like gibberish, especially in C# where each line is a mile long. The method works fine, but that initial reaction is a real stumbling block. --MikeDunlavey

Well, OK. Maybe I will blather on a little bit. For a long time, MS didn't bother having a working pause key in their IDEs. (Well, they did, but it would only stop when it took a notion to.) However, they did allow you to have disabled breakpoints that you could enable on the fly. So I wrote an empty function, i.e. FOO(){} and set a disabled breakpoint in it. Then I sprinkled the code with calls to FOO. While the app was doing its slowness, I enabled the breakpoint, causing the app to stop in FOO. Then I could look at the call stack and other stuff. Clumsy, I know, but it worked. --MikeDunlavey

More blather, sorry. In MS IDEs you can set a watchpoint, meaning the program will stop whenever you change a given memory location. This causes the IDE to go into emulation mode, about 100 times slower than normal. However, in this mode, I think the pause key actually works at the instruction level.

JuneZeroSeven