As Fast As Cee

An apparent holy grail of performance, to be as fast as CeeLanguage, as often cited by users, developers, and advocates of higher-level languages (e.g. LispLanguage, FunctionalProgrammingLanguages, JavaLanguage). Quite often, a failure to hit this target is pinned upon the lack of a SufficientlySmartCompiler.

This page covers a lot of ground, but it mostly boils down to this:

Claim that in some domains a FunctionalProgrammingLanguage can be as fast or faster
- Demand for concrete references, refereed journal papers and explicit examples
Statement that being as fast as C is not always necessary or appropriate, and that other considerations can take precedence.
- Demand for concrete references, refereed journal papers and explicit examples

One major issue with this page is that it fails to note that the core C language is nearly a lowest-common-denominator sort of programming. Some high level languages can achieve c-like performance when you program *like* c. For example, several Lisp implementations, OCAML (somewhat less true, it has some good tricks of its own), and some simply can't. On the other hand, these high level languages can easily do things that there is no natural way to do in C. If you spent the time and 100k-lines of C code to provide some of these features to your C environment, it isn't clear that it would be faster than another languages native implementation. See, for example CLOS, or Haskell's patterns. So, as usual, it really depends on what you want to do. It is also worth noting that for *really* cpu intensive tasks, C is often not good enough. Hence the continued use of Fortran, and inner loops written in ASM, etc. Especially where vectorization is concerned, if your c compiler isn't sufficiently smart, you may still be able to blow it out of the water with a dozen lines of ASM.

I think this page can be summarised as follows:

C is really, really fast.
For some contexts, other languages can be easier to program in.
Advocates of these languages often try to sell performance junkies on said languages.
Some people claim that for some problems and some programs, occasionally other lanuages are AsFastAsCee.
Sometimes the algorithms are sufficiently complex that writing them in C and then running the resulting blindingly fast program would take longer than writing it in Lisp, OCaml, Haskell or Perl, and then having a coffee while it runs.
Most people seem to have an axe to grind, and this page is not really very informative.
Other languages may not be AsFastAsCee, but that may not really be the point. If for you it is the point, then probably C is the language you're happiest with.
On the other hand, why aren't you programming in AssemblyLanguage?
- Because C is usually within some percentage of the speed of AssemblyLanguage, which is often close enough, and in any case, for most programs, writing just the innermost loop in AssemblyLanguage (or C, for that matter) gives most of the possible speed advantage.
- In the real world, people do in fact write inner loops in C (and sometimes but increasingly rarely in assembly), regardless of the language the rest of the system is written in.

Of course, AsFastAsCee is a very simplistic and almost-useless benchmark. Some languages are faster than C in some domains; FortranLanguage is faster than C for numeric applications (many optimizations performed by Fortran compilers are disallowed by C, though C99 narrows the gap). Many statically-typed FunctionalLanguages can do things faster than C as well, especially in the absence of SideEffects. Proponents of other languages will cite these as examples; OTOH outside the specific problem domains these languages suffer compared to C in either performance, usability, or both.

Please cite some concrete examples of functional languages that are faster than C. The only one I've seen that's anywhere near as close - within 2x - is ObjectiveCaml, and only if you resort to using imperative features and manually remove bounds checking and so on.

DemandForEvidence

Many Lisps often sit within 2x of C also (which is not the same as being faster!) so ObjectiveCaml is hardly unique in that.

False. You're assuming O'Caml is merely within 2x. Actually it is sometimes faster, and typically isn't worse by more than about 20%, which is quite unlike the typical case for Lisps. This is only true on artificial benchmarks, ones that make use of imperative and unsafe features of OCaml. 2x is more typical, but it can be more for floating point intensive programs.

The book On Lisp (http://www.paulgraham.com/onlisp.html) features some Lisp code that it claims is faster than C on page 24. However, it fails to name implementations of either language, and a comparable C version, so.

Functional languages can hardly beat C in terms of pure speed for very optimized functions. However, for common, unoptimized code, which makes the bulk of programs, the speed is comparable. One should think of C being to functional programming languages what assembly language is to C. Therefore, a more interesting point of comparison, both in terms of speed and functionality, would be C++. Things that modern functional programming languages like ObjectiveCaml take advantage of are : static strong typing which allows some compiler optimization, absence of memory fragmentation and improvement of locality in cache memory by the garbage collector, and highly optimized use of the call stack.

On the other hand, the speed of C comes at a cost. It's not very abstract, giving the programmer plenty of room to tune every last data type or memory allocation to the specific needs of an algorithm. This flexibility can make the program run faster at the cost of programmer time in the best case. It can also make no appreciable impact on performance at the expense of programmer time, or it can even make the program slower.

More abstract languages offer heavily-optimized, widely used data structures and algorithms right out of the box. Few C programmers can whip up a hash table implementation as fast as Perl's, for instance.

Timing tests please? One-off fast hash tables for a particular purpose aren't hard. Completely general purpose ultra-high-speed hash tables are very hard.
DemandForEvidence

There are two ways to view this one, from the C side, one-offs can benefit from knowing the distribution of your numbers, sometimes mod over an array is optimal. From the Perl side, few could write better hash tables than perl has for the same functionality that Perl's hashes offer. Otherwise, we'd have a patch to perl.

Or will have it eventually, but not necessarily yet, which is why I want to know a good source of info about it. I have lots of background with hash tables, so naturally I would be curious about details, rather than simply expecting that Wall would have done a perfect job already.

I'd rather avoid the premature optimization and steal perl's hash tables. I can always tune the performance after I can measure it.

Yes, but the speed issue had already been raised.

Higher-level languages may permit the implementation of smarter, more efficient algorithms in less programmer time, which should also inform this debate. The difficulty of implementing faster but highly complex algorithms can discourage their implementation in C.

Moore's Law has changed the game significantly in terms of language optimization.

In other words, speed is often not important in the first place, due to faster CPUs, so even if something is not AsFastAsCee, it often doesn't matter anymore, even though it mattered a lot years ago.

Many programs on servers and end-user PCs are I/O bound, so languages that can use smarter I/O mechanisms have a big advantage.

What do you mean by "smarter I/O mechanisms"?
Asynchronous I/O, for example. Instead of a blocking fread(buf, size, count, f) something like fread(buf, size, count, f, call_me_when_done). This can be implemented in C/C++, but the lack of closures makes it hard to use the asynchronous model because you need to use callbacks, opaque pointers and ad-hoc context structures for each call instance.

And of course, being AsFastAsCee is a RedHerring for many applications anyway. A CrudScreen never needed to be AsFastAsCee, nor do many other types of programs. The C bigots who loudly insisted that performance is the only thing that matters ought to be ashamed; the bigots of other languages ought to be ashamed when they fall into this rhetorical trap.

Hmmm. I wrote two pages of Perl that processed several GigaBytes? of input from a pair of major datasets, compared them using a variety of RegularExpressions, compiled a set of comparative totals and statistics from the data, and which executed in about 35 seconds. I'm not sure I could have written that, such that it ran any faster, in CeeLanguage. Granted, it was a Sun E-10k running SolarisOs, but by any standards, grinding through two sets of 13+GB (that's 26+GB total) files and making sense of it in under a minute on anything is good exercise for the eyebrows. And it's an interpreter. [Still shaking my head]

I grant that Perl is, itself, written in CeeLanguage, but I have to say that it has leverage that goes beyond that. One might say "more than the sum of its parts." -- GarryHamilton

Perl has a highly optimized regular-expression parser. It was made for doing processing of the type you describe. You probably could've written code in C that would have run just as fast, but it would've taken you 10 times as long to write. -- TimKing

There exists an implementation of perl-compatible regular expressions in Common Lisp that is actually faster than the CeeLanguage implementation embedded in perl. That this is possible is due to the fact that a running Common Lisp program has access to the compiler, so it can compile the regexp into machine code, something c code cannot do. See http://weitz.de/cl-ppcre/#performance But see TickCee

Perl also has highly optimized I/O buffering. Again, made for doing this type of processing.

It sounds highly likely that your processing was limited by I/O bandwidth; not CPU speed. An interpreted language like Perl won't do too badly if most of its time is spent waiting for the disk.

You know, I still don't get this remark, "limited by I/O bandwidth".
- The classic categorization is to say that any given program's total execution speed is either limited by CPU or by I/O. This categorization has nothing to do with whether it is fast or slow in the first place. So he is saying that, of the two factors, it sounds like your script was I/O bound, not CPU bound, so you shouldn't regard it as indicating something about computational speed, but rather as indicating something about I/O speed.
  - The execution time of a straightforward implementation in C will usually be the sum of the CPU time + the I/O time. When the I/O time is much larger than the CPU time, it is "I/O bound". However, there is the possibility of using complex, but time-efficient parallel I/O buffering -- then, the execution time is the slower of the CPU time or the I/O time. In perl (or other interpreters), this I/O buffering could be done behind the scenes for you, and the straightforward perl implementation may actually be faster than a similar straightforward implementation in C.
  - Perl certainly could, but does it? Also it would need to figure out that parallel I/O was needed in the first place -- which Perl constructs would allow the interpreter to be aware of this? (Answers for other languages also of interest; presumably the comparable thing in C/C++ is something built around e.g. Unix select)
  - Conjecture: Parallel I/O could be the default (on an OS with thread support). When you open a file/subprocess for input, perl could create a thread that tries to keep a CircularBuffer full, while the program pulls data out. For output it could do something similar, or it could simply buffer data, and write it out in large chunks (to avoid partial sector / cluster writes), and trust that the disk cache will provide the parallelism.
  - Another conjecture: you never heard about fadvise(). Simply use it and the OS worry what to read when. It is most skilled at those things.
- Perhaps it would do well to illustrate what "bandwidth" means. Suppose you had to add two numbers. If they are in the registers, the CPU has immediate access to them, and the add happens immediately. If the numbers are in RAM, they have to be transferred to the registers--which takes more time--and then added. If those numbers were on the hard disk, the disk has to be spun to the correct sector, the head has to be moved to the correct spot, the memory then has to be moved to the registers, and then added; the movement of the disk itself happens hundreds of times more slowly than a memory addressing. If those numbers were on a different computer, you have to create a packet, which is sent across the line, sent to a database, which reads a disk, which then puts the result in a packet, which is sent back, decoded, put into the registers, and then added; this takes significantly more time than even a disk read. Now, imagine writing code to add those two numbers, and you had to choose between C and a higher-level language. Sure, C's "add" might be faster than, say, Python's...but if you're fetching those numbers off of a network, or even off of a disk, the time it takes to do *that* is going to completely dwarf the time it's going to take to get those numbers in the first place! Thus, if you want to optimize your script, you'd do far better keeping your data in memory (or at least on a Solid State hard drive--which is literally just RAM), or keep the data local, than you would be trying to optimize that "add" function. --Alpheus
Interesting discussion. My puzzlement was that someone would characterize something that was blazing fast as "limited" by anything. I understand that my script got a lot of help from the OS, and that it probably never had to wait around for the next gallon of data. Still, all other things being equal, tossing 26 billion cards into the air and stacking them in neat piles before they hit for floor is impressive even if the resistance due to friction with the air has been removed.
- You and I are impressed by such things only because we're old farts who remember the slow days. Soon the young kids will be yawning at rates of terabytes per millisecond. These things are relative. It's impressive to us that what you did ran so fast, but in isolation it doesn't reflect on much at all. And what it does reflect on says more about the hardware (both the I/O and the CPU) than the software. To figure out something about the software we'd need to do comparisons with other software...etc.

HorsesForCourses. If your application dictates the speed of C (or C++ or Fortran or whatever), use that. If not, and you're more productive in Perl or Python or Smalltalk (or even Java), use that instead. Rather simple, really.

And don't forget the strategy to AlternateHardAndSoftLayers. Typically only a tiny fraction of an application, the innermost process bound loops, really need to be AsFastAsCee.

When people talk about relative speeds, they are usually talking about efficiency of use of cpu, not about IO, and C often is faster, compared with e.g. Perl, Python, Ruby, Pascal, Ada, etc etc etc.

DemandForEvidence

Lost me there when you included Pascal. Although I can often get the speed of Pascal by including all of my C source files into one big source file and declaring everything static, it's by no means certain: pointer references and 8 bit values are sometimes optimised poorly.

Most of the latency involved with I/O is spent - doing the I/O. That's an OS function. A language design can have an effect, of course - depending on what I/O primitives are available. In most case, stdio is plenty fast.

That said, the C standard library provides little support for "advanced" I/O features (no pending on multiple file descriptors, no networking support, no asynchronous or nonblocking I/O, etc.) - but those are very OS-specific, and most languages are less flexible than C in this regard.

They're OS-specific because the C standard library doesn't provide an interface to them; not vice-versa.

And there are many useful de facto standard libraries for these things - Berkeley Sockets, POSIX, Win32, etc... that are widely available. You will notice that most programs that need to do high-performance I/O (web servers, RDBMS's, etc.) are implemented in C/C++.

The C stdio library does burn some cpu, and compared with a more specialized facility, can be noticeably slower. Long ago I needed to read very large hexadecimal files, and speed was important. The original code used scanf. A hand-coded solution was vastly faster. Sometimes use of IO library functions is actually cpu-bound rather than IO-bound; this is, of course, not typical.

This particular example--where scanf is slower than the handwritten code--is probably not a result of scanf being an unspecialised facility; more likely, it's a result of most of the information that would be determined at compile time via a sane API (say, type information, parameter locations, etc.) being delegated to a fairly complex system of runtime string processing, stack frame manipulation, and possibly even callback handling, depending on the implementation. Then again, one could argue that the stunted compiler-in-a-can that is scanf is simply what you get (deserve) for using unspecialised facilities designed for pre-ANSI C.

If you are doing something in Java, try JNI if there are performance problems in some function. I've always found that it takes less time to RefactorTheCode? in Java, rather than to RecodeInCee?.

C is not faster than SML when doing NumericalComputing?.

Reference?
DemandForEvidence

For doing IO you should always consider using AsynchronousProgramming, so that your application can continue to do work rather than waiting for I/O.

Here's one approach: Read the file in Java. Put it in a queue using a thread, then use another thread to process it. You will be surprised that if the file is big enough, the queue will use all the memory and the program will stop. So, IO can be so fast in Java that the processing can get behind. The same is true in C++.

Java never claimed it would be faster than C, but now it is on several benchmarks, with the remarkable exception of Swing.

URL?
DemandForEvidence

CeePlusPlus can be just as fast as C. Yes, you could write C++ code that's slower and bigger than C code, but you also could write C++ code that's as small and fast, if that's your goal, if that's what's important to you. Additionally, this C++ code can be cleaner and more maintainable.

I recently wrote a boot ROM for an embedded system, wrote it in C++. Even threw in an exception for good luck. (C++ exceptions add no performance overhead unless you actually throw one.) I had no performance or code size problems. That's because the low-level code, the part doing all the real work, compiled just as well as it would have were it written in C.

And if the C code you're comparing to is not small and fast (instead preferring other values), you could probably outperform it with a number of other languages.

-- TimKing

See also OptimizeLater.

As always, the real solution is in the design and coding, not the language. A program in <high-level-LanguageOfChoice> could be as fast as it needs to be, to within a constant multiple, if the design is correct. If a program is primarily I/O intensive, then fast I/O (which is usually a factor of the system calls, not the language or the libraries) will be the key, whereas computation-intensive code will depend on how you design the code.

Note, however, that correct design includes the necessary post-optimization; if the HotSpots? are re-written in CeeLanguage or even AssemblyLanguage, then the remainder can be written at whatever LevelOfAbstraction is appropriate. AlternateHardAndSoftLayers

In any case, performance is not the only criterion that a program should be judged by, even if it is the one most obvious to casual users. Correctness, usability, and maintainability count for as much, if not more, in the long run. -- JayOsako

As a long time C/C++ supporter, let me provide my observations.

The speed of the language rarely affects the speed of software.
I/O speed, network hops, disk accesses, etc. are usually far more critical to run time speed.
Physical memory size, i.e., avoiding disk cache, is usually more important.
Particularly in desk top PCs, any performance improvements made in code today could be more easily obtained by using next year's hardware. The increase in CPU speed will improve performance more than any code level tuning.

I like C/C++ a lot, but I really couldn't justify C/C++ as a project language based on speed considerations, except, perhaps in some very limited environments. -- WayneMack

Anecdotal opinions like this would be much more helpful when the application area is indicated, because these sorts of observations and conclusions can vary sharply depending on what you are doing.

To provide a general answer one must rely on general observations. I base my observations on many years of embedded design work with C, assembly, and microcode sometimes with CPUs running less than 1 MHz where I was actually tweaking code for performance because the hardware was barely sufficient for the job. I also designed the hardware and knew the trade-offs made when designing the hardware. I have done C++ work with 2 and 3 tier client server designs primarily with MS Windows starting with Windows 3.1 with occasional ports to Unix environments. More recently, I have been the project manager for a 3 tier design incorporating Visual Basic, C++, and SQL scripts with some COBOL and Korn shell thrown in. My team has also done some preliminary investigations of first going to a web architecture with ASP and more recently with Java.

In the set or performance problems I have seen, the problem has never been due to excess machine instructions generated by an "inefficient" language. Most often, performance problems are related to the handling or transfer of "bulk" data. The majority of language constructs (i.e. math operations and branching) are common across languages, often the syntax itself is similar, and the underlying assembly language to support these operations is the same. Often the only differentiator is memory use, both in quantity and efficiency of memory management, and this usually only affects speed if the application is running near the limits of physical memory. This observation seems born out as with little effort, one can find a benchmark for any language showing that it is "fastest."

In summary, I would state that when selecting (one or more) languages for a new project, language "speed" ranks very low in importance. For most operations, the differences in languages are minor and often masked by processor speed and other design criteria.

-- WayneMack

Does this apply when designing a language too, i.e. is speed something that a language designer/compiler writer should worry about? Basically, I want to know if I have to worry about really fucking up my language speed-wise through language design decisions?

The research I've done on ContinuationImplementation, KeywordParameterPassing, ImplementingMultipleDispatch seems to indicate that most features, no matter how advanced, have fairly small performance penalties. Continuations can be had for a 5-10% slowdown across the board, keywords are just an array lookup, MultiMethods can be done in quasi-constant time, though the constant is fairly large (an array lookup for each active selector in the method). I'm wondering if this squares with your experience, and if there're any particular "gotchas" when implementing a compiler, things that could make it run multiple factors slower.

But it seems like speed considerations do come into play with a lot of language implementors, mostly concerning the cache. Dylan introduced sealing so that the appropriate method could be selected at compile time, eliminating the run-time dispatch entirely. Dispatch-DAGs for multimethods are preferred because dispatch tables require a lot of non-local memory accesses, potentially blowing the cache. One of the main reasons ParrotCode uses a register architecture is that stack machine have a higher memory load, and hence cause more cache misses.

Any advice for would-be compiler writers? -- JonathanTang

Purely my opinion, but focus on the big picture of programmer productivity. Do not ignore compiler speed, but don't give it undue priority either. Correctness is highly important; a single miscompiled statement will probably waste far more of my time than even the slowest compiler. Make sure compiler error messages are clear; I've wasted countless hours tracking down silly errors that were not clearly reported. Make sure the compiler has a way to escape endless loops; I have a preprocessor from a major vendor that never returns if it is given certain typos. Lastly, define a good benchmark. Don't load the benchmark up with worst cases, make sure it reflects typical use; otherwise you will spend time tweaking unimportant cases.

I remember the days of start the build and go have lunch. For the size of projects I am currently involved in, compiler speed is not a significant differentiator, it is usually the speed and ease of use of other parts of an IDE that cause me irritation in development. -- WayneMack

Combining the speed of C and a simpler language

One reason C is so fast is that many man-years of effort have gone into writing efficient optimizing compilers for it; some new language would have to duplicate at least a significant fraction of that effort.

Speed is not always the issue

Incidentally, Fortran is known to regularly wipe the floor with C in numerical computation; see comment above.

Of course, most programming tasks do not need the speed of C; for those, a higher-level language is usually a better choice. For the subset that do need high performance (and for which the designers are willing to sacrifice a high-level language), C is a reasonable choice. ForthLanguage is another; however there are many more C developers and C implementations than Forth implementation available.

Some claim that it is possible to tell the difference (in performance) between an application written in CeeLanguage (or ForthLanguage or AssemblyLanguage, or some other low-level language) and one written in a HighLevelLanguage. It was proposed to implement a similar low-level language (but one easier to program in than C, which contains lots of pitfalls). Such a thing is certainly possible (and has been done; for example ModulaTwo, etc.) but a considerable amount of effort has gone into writing efficient optimizing compilers for C; some new language would have to duplicate at least a significant fraction of that effort.

It was then re-emphasized that many programming tasks do not need the speed of C; for those, a higher-level language is usually a better choice. For the subset that do need high performance (and for which the designers are willing to sacrifice a high-level language), C is a reasonable choice. ForthLanguage is another; however there are many more C developers and C implementations than Forth implementation available.

Regarding optimizing compilers: DEC tried to grant all of its languages the same optimization attention by turning all of their compilers into front ends that produced code for a virtual machine. This code was then optimized and compiled to a specific physical machine. In this way all of their C optimizations were available for all of their languages. Microsoft has chosen the same strategy with .NET.

HP has done the same for many years, and very successfully; their RISC code optimizing back end has been world class for something like the last decade.

The GNU compiler suite does something similar, but more ad hoc.

[The comparison to DotNet seems to be questionable. CommonLanguageRuntime code is targeted towards a VirtualMachine that is executed directly, not translated to the instruction set of the target processor (though JustInTime compilation is done). What is being spoken of above is an IntermediateLanguage (or an AbstractMachine), which is intended for subsequent translation into something else. While the AbstractMachine can be implemented; that's not its intent.]

.NET's "JustInTime" compiler actually compiles the CLR code when the application is installed or run the first time. The virtual machine (or abstract machine if you want to call it that) doesn't execute the code; the physical translation persists. This is very much like DEC's GEM line of compilers, with the final compilation step delayed until installation.

And now, a humorous interlude.

 Some languages ain't fast
 Some languages ain't slow
 Some languages are just half-fast

One domain where core language speed still matters a lot is video games. C/C++ is used in almost all video games. (One company, NaughtyDog?, uses a Lisp dialect for their PS and PS2 games, but it seems they may be switching back to C for their PS3 games.) See LispInJakAndDaxter.

If speed is of over-riding importance, why don't they use AssemblyLanguage?

Because C is usually within 20% of the speed of AssemblyLanguage, which is often close enough, and in any case, for most programs, writing just the innermost loop in AssemblyLanguage (or C, for that matter) gives 90% of the possible speed advantage. All of which is well-known, thoroughly studied, carefully instrumented, widely published, etc. Don't ask a rhetorical question if you don't actually know the answer.
- Actually, I do actually know the answer, and it is as you say.
- One point is that people who loudly proclaim that they use C because they want to wring the last possible ounce of speed out of their machine have nevertheless made a compromise.
- Another point is that other people may choose a different compromise.
- The conclusion is that speed is not of over-riding importance, and if people who bleat about "AsFastAsCee" accept that, then most poeple would be happier to listen to them.
- There seem to be some terminology issues here, resulting in minor confusion. Speed can be #1 priority, but that does not imply a polar position where all other issues are disregarded, because of things like diminishing returns. It is not a contradiction to have speed as #1 priority but then to settle for only 90% of the maximum theoretical. It's extremely rare for it to be possible to hit 100% no matter what one attempts.
- I mean, if speed is of over-riding importance, why don't I switch to using the world's fastest supercomputer? Because there are real world constraints, like budget, no matter how important speed is. Writing an entire app in AssemblyLanguage, in this day and age, usually violates other constraints, such as a need for efficient development and for maintainability of the resulting code, etc.
  - Which goes to the heart of the whole debate, which is misframed. The debate should not be about whether an entire system should be written in C or in AssemblyLanguage. Ideally they should not. It should always be merely a question of whether inner loops should be written in C (often they should be) or in assembly (sometimes necessary, but much less so than was once the case, depending on application area).
  - Except for one other thing, which is that C has become a default lingua franca -- Java has partially displaced it in that role, but certainly nothing like 100%. The question is therefore often "ok, fine, what should we use for the non-speed critical parts of the system -- keeping in mind that it should be compiled to native machine code, since the speed budget is moderately tight, if not critical, for all parts of our system". And the answer to that is problematic. A large plurality of programmers hate each of the most obvious examples (C, C++, compiled Java, compiled Lisp), other languages have lurking issues (like that it can be nontrivial to link with C), etc.

I think the summary at the top really says it all. You are right, and everyone is arguing past each other.

In all that argument, everybody seems to have missed that there is a glaring reason for using C rather than AssemblyLanguage for video games - portability. While speed is paramount in video game development (the calculation for each frame is expected to take a maximum of 16.7ms for a 60fps game), video games are designed in general to be portable over a number of architectures, and while all of the major consoles in this generation might have PowerPC processors, that wasn't always the case.

Given the pipelining and multiple execution units of modern x86 processors I'd say that there are few instances where AssemblyLanguage is warranted. Here are some:

writing an interrupt handler which performs initial work when only a few registers have been pushed and/or which allows return from the interrupt from that point
getting access to special machine instructions not necessary for standard C. Examples are the x86 movs, stos, scas instruction families and the rdtsc instruction used for measuring elapsed times.
replacing inefficient lib functions such as str*
minimizing the number of call/rets by using inline assembler (see OpenWatcomC #pragma aux) to minimize cache reloading
debugging and probing deeper by switching to mixed assembly/source code

As to writing the innermost loop in assembler I've seen the Wind River PowerPC compiler condense a relatively small block of three nested whiles into the assembler equivalent of one while. It was impressive, and doing something with the innermost loop there would have made the end code much slower. - OlofForshell

DecemberZeroFive

CategoryComparisons, CategoryPerformance