Faster Than Cee

Primary discussion is in 'AsFastAsCee'. But, if 'AsFastAsCee' is a powerful benchmark for performance, then 'FasterThanCee' is the true holy grail: any computer language that can boast 'FasterThanCee' for a wide spectra of problems has much greater potential to penetrate the wide array of programming markets that are (despite many very reasonable arguments contrariwise) still obsessed with PrematureOptimization and 3% coefficient differences.

"FasterThanCee", referring to the C programming language (CeeLanguage), is primarily a claim of optimization (as opposed to, say, productivity). Although the C language isn't well designed for high-level or program-wide optimizations, it is already low-level and many thousands of man-hours have been placed into the modern C compilers, such that it is quite difficult to beat a C program for speed... especially hand-optimized C.

FasterThanCee is inevitable

C and similar languages (Fortran, for example) today have a performance advantage for many programs due to runtime models that closely match the hardware architecture. However, it is inevitable that CeeLanguage will lose this advantage. And it's already happening:

C is designed for consistent memory semantics. Memory and cache consistency has been reduced over time for purpose of performance, but C's default semantics don't take full advantage of this.
C is designed for a single core processor, which is becoming rare. It is not well designed to leverage multiple processors, especially not safely. Multi-thread shared memory is a possibility, but requires synchronization. Lock-based synchronization works, but is subject to deadlock, performance issues, or safety issues (pick at least one). SoftwareTransactionalMemory has been implemented atop C before, but C's programming model (which uses fine-grained state manipulation) makes transactions more expensive than they could otherwise be.
C is designed to operate upon then produce "static" data, and lacks support for creating or tracking dataflows in a high-performance manner (i.e. with automatic caching and DataDeltaIsolation, ability to pipe data to where it needs to be). As hardware becomes more parallel and distributed, so will become the data-maintenance and distribution issues. A C-language solution will either sacrifice performance by executing a full query, or sacrifice reliability via complex and explicit management of caches and partial updates.
C is designed for a uniform, computable address, memory model with ambient authority. This has some performance advantages on the small-scale benchmarks but many problems on the large scale. For security and safety, C programs must be separated by process barriers. The resulting IPC introduces serialization, copying, and buffering costs that don't need to exist. Further, the model simply doesn't scale. Data-sets can be large and distributed. The processing upon a data-set will often benefit from load-balancing, distribution, and putting code near the data.
Programmable devices (GPUs, FPGAs, RAID cards for example) and services (SQL databases, for example) with slave memory and CPU are more common than they once were, and still qualify as part of the computer architecture. Support for these devices is critical in achieving high performance. C is not well designed to support construction, safety analysis, optimization, etc. of code - not even indirectly, through libraries.

These changes in the architecture eat steadily away at the grasp C and brethren (C++, Fortran, etc.) have on performance. Unfortunately, language benchmarks will often fail to demonstrate performance scalability issues, or scalability issues of any kind (reliability, for example). It's ArgumentByLabToy - intra-language benchmarks are kept small because they're economical to write, easier to compare for rough equivalence, and reduce variability due to design choice.

The places where C offers a "small-scale program" advantage (uniform memory model, explicit GC, etc.) will forever continue to offer enhanced perception of C/C++'s performance fitness. Higher-performing languages that lose some small-scale performance advantages will have a challenge on their hands for proving viable performance. They'll need to sell themselves on other properties then prove 'adequate' performance, and forever suffer the reputation of being a bit slow. That C/C++ may have failed to achieve adequate performance for the same problem, it is likely that nobody will never know.

Makes me wonder how many languages already qualify as FasterThanCee on the large scale.

Despite the wall presented by the powerful C optimizers and the wizards who come in and further hand-optimize code to show how 'fast' C really is, there are some languages that have proven to be FasterThanCee, sometimes much faster, in limited domains - especially in domains that benefit a great deal from significant code-specialization (VHDL, simulators) and immutable or interned values and garbage collection (e.g. backtracking searches, planning in weak AI, inference systems). Hand-written C code must be written generally to the problem domain (a set of models), whereas code-specialization techniques essentially write the code for the specific model being examined at the time (and will rewrite it as often as necessary for new problems). And, for immutable and/or interned values with garbage collection, as are often useful for large, backtracking searches, C is simply a major hassle to use... even wizards of the language will often avoid building the massive structures necessary to make them work efficiently. As a consequence, C programs will often lack optimizations made easy by other languages.

There are also some languages that are, generally, AsFastAsCee, or approximately so, plus or minus based on the problem. This is most quantifiably evidenced by the ComputerLanguageBenchmarksGame. Examples include OcamlLanguage (a mostly functional language with type-inference), DelphiLanguage (Borland's Object Pascal), and MercuryLanguage (a functional & logic language based on PrologLanguage but purer and better designed for optimization).

In my opinion, nothing is FasterThanCee yet. My benchmark: chess programs. None of the top programs are implemented in anything but CeeLanguage, CeePlusPlus, or AssemblyLanguage. Chess programs are in a class which must run AsFastAsPossible?, so they eschew dynamic memory allocation (excluding all FunctionalProgrammingLanguages, sadly, and most modern programming convenience libraries) and require direct compilation (excluding interpreted languages like Java and Python) and the best optimization (gcc or intel compilers). Chess programs tend to implement domain-specific versions of data structures provided by dynamic languages (strings, bit vectors, hash tables, lists, multithreading primitives) in order to squeeze out the last iota of performance, and only low-level languages allow them to do so.

[You'd probably be well served by CUDA or OpenCl. In any case, while tighter algorithms might make a difference of one or two moves lookahead in a given period of time, even a scripted Chess program can be very competitive. It's the algorithms, strategies, and heuristics that matter most, and a scripted program might offer greater potential to improve those.]
There is research in this area, but it turns out that chess evaluation and searching is very heavy on conditionals and hits the huge transposition table frequently, and so far has not been a good fit to SIMD-optimized processors working in the limited memory on the graphics coprocessors. Although scripting does allow you to express heuristics nicely, those heuristics are run at every node in the search tree, and the state of the art for chess programs is in the millions of nodes per second. So scripting is right out.
[The quality of a search is not measured in nodes per second.]
Well, that just shows you know little about chess programming. :) Search speed has by far the highest correlation with program strength. Next comes sophistication in search algorithms (extensions, pruning, etc.) and then sophistication in evaluation tuning.
[Human brains aren't counted in your statistics. Biased sample, I think. Search speed is only as important as it is because, despite the sophistication of our algorithms, they suck. When a chess beginner plays, their whole brain is active as they `search` for moves. When an expert plays, one little tiny region in the brain is active and they have the move they will play almost immediately. Our chess programs still play like beginners... just very, very fast beginners with prodigious memory. Perhaps, if we were not so focused on making it faster, we could make a program that is better.]
Yep, humans think different! Except I would say that computers play real chess, and it is we that play like beginners because we can't search accurately enough. Hard to argue when the top hundred chess playing entities are all machines. (But this is a philosophical point better discussed in another forum, I think.)
[I don't think I would call what machines do `playing`. Let's give them a wattage limit that is only 10x what a human uses and see who wins. Sort of a `weight class` for chess. Would certainly make a more interesting competition for algorithms and hardware.]

I don't think Cee is a fast language because a language isn't fast or slow per se. Its implementation (compiler) makes it fast or slow. That being said, what about a chess game or Go game that uses a relational database with statistical voting (this move won the game how many times, versus that other move)? I wonder if chess games are reinventing databases by using hash tables, lists, etc. In order to build the best Go or Chess game, I believe the computer just has to have a large database of all the possible moves and vote which moves won the game the most, and vote which moves won the game the least. The top chess players' moves in competitions can be recorded and stored in a database. It takes up more hard drive space to have thousands (millions?) of move combinations pre stored though. I believe some Go games use relational database backends.

There are multiple databases in use by competitive game programs, each customized to the domain:

Opening databases: indexed by hashed position (which can handle move transpositions) or stored as a gigantic directed graph if storage space is being minimized (ROMs for a dedicated computer). Typically, multiple moves are suggested per position, with probabilities reflecting current opening theory and the preferred style of the program, with some randomization for variety. Lookup speed is not a factor.
Endgame tablebases: entire classes of endgames can be encoded in gigantic tables. State of the art is up to six piece endgames in terabyte sized tables, indexed by cleverly encoding the piece positions into a direct index. Constructing the tables and minimizing their size has been an active research topic since KenThompson began researching them decades ago. Lookup speed (i.e. microsecond) is important for using them during search.
Transposition tables: an optimization technique where the result of a lengthy evaluation or partial search is stored in a hash table, for later use during the search (memoization). Required for any kind of intelligent endgame play. Often there are two tables, one for full positions and a smaller one for just the pawn structure, since it changes less frequently yet is a large portion of the evaluation ("pawns are the soul of chess"). Lookup speed is paramount, since it is performed at every node in a search. The thread-safe implementations are in fact a current research topic in computer science.

Of these, only the openings have any use for ad-hoc relational searches, and then only when used by humans to do their own opening preparation (i.e. what has a particular GM been playing lately?). Chessbase has rolled their own, for no particular reason.

Your note about Go is interesting. It turns out that Monte Carlo Tree Search has made great strides lately. The hand-coded yet brittle Go evaluation functions have been superseded by integrating the results of thousands of randomly played games. The more randomly played games, the better the results, so speed and parallelization is again of the essence, and the best Go programs are all written in C.

They are not all written in C; there is a program called Moyo Go Studio which was written in DelphiLanguage, but it wasn't a Go game it was just a Go analysis tool which used databases as the backend. I don't know much about it since I haven't used it, but I read about it. A lot of time has been spent making Cee compilers produce tight, fast code - many compiled languages could be just as fast as cee if people spent the time on those compilers (i.e. modula or oberon: why would cee be faster than modula or oberon?). You could make a slow Cee language interpreter (or you could make it fast if you spent more time).

Moyo Go is not a playing engine but a GameOfGo-player's study tool, so speed is not so important.
But otherwise I agree: Pascal, Delphi, Modula, Oberon... those are all procedural Algol-style languages which *ought* to compile as efficiently as each other. In this case, Cee wins due to its popularity. The popular Cee compilers are the ones that have been optimized (and likewise modern processors have been optimized to run their output). In a different world, Oberon would have become the systems programming LanguageOfChoice, and its compilers would have been most heavily optimized. --IanOsgood

Also, are the Go programs and Chess programs really written in plain C? Or is it C with preprocessor macros, or is it actually C++? Why wouldn't C++ be used instead of plain C? Plain C is so primitive. PlainCeeProgrammersAreLuddites; even if better technology is available, some guy will still use plain C just to boost his ego. Why, for example, would someone use plain C in a game, when games often have lots of strings in them.. wouldn't you use a mixture of C++ (with managed strings as classes) in addition to using some more plain C for other stuff?

In my recent research, the best chess engines are evenly divided between CeeLanguage and CeePlusPlus, divided between gcc/intel and Visual Studio compilers depending on target platform. Digging further, the CeePlusPlus engines use a sane subset where performance is guaranteed: no dynamic allocation, no virtual methods in objects, custom templates (not the default CeePlusPlus standard library templates, since their implementations often don't make performance or memory guarantees. For example, the default <string> library is often naïvely implemented.) The next most popular language for chess (about 5% of engines) is DelphiLanguage. The only strong engine in Delphi (Critter) later moved to CeePlusPlus because the Delphi compiler was inadequate for optimized 64-bit code. --IanOsgood

[The amount of game states might be larger than the number of atoms in the visible universe. But feel free to go for it. I understand that many begin-games and end-games are handled this way. I do agree we could take better advantage of data resources, at the very least to mine for strategies and heuristics.]

You could just store the most common moves in the database, like the top chess player moves in certain game positions, and for other game positions, you could resort to calculation on the fly method (brute force). I think maybe this is how some Go systems work like Moyo Go studio, but I have to admit ignorance here since I don't know enough about it. Apparently what Moyo Go did was record a database of many top Go players common moves in certain game positions. Microsoft apparently reverse engineered Moyo and used the technology in a product of their own.

AprilTwelve