How Dynamic Typing Is Implemented

Describe HowDynamicTypingIsImplemented by most popular compilers here. Please give a technical description, and contrast with static typing systems (v-pointers and v-tables in particular).

What is the extra overhead (if any) of dispatching a message to an object in a dynamically typed system? In C++, typically this is one extra layer of pointer indirection over a straight function call (looking up a function address in the v-table). How would a compiler for a dynamically typed language approximate this?`` - Matt

From http://www.cs.washington.edu/research/projects/cecil/www/vortex.html

"Vortex exploits dynamic profile information about the relative frequencies of different classes at each call site to optimize dynamically dispatched messages. If one or a few receiver classes are most common at a site, then class tests are inserted in-line before the message, with each test leading to a branch where the message has been statically bound to the appropriate receiver. Even in the absence of peaked frequencies, if only a few methods can be called by a message site, Vortex can insert a few subclass tests in-line to statically bind to each of the possible methods."

"Vortex performs static intraprocedural iterative class analysis, tracking the possible classes of expressions within a procedure. Static class information enables dynamically dispatched messages to be statically bound as regular procedure calls (if static class information proves that only one method can be called), and allows run-time class or subclass tests (such as Java's instanceof construct) to be constant-folded (if static class information proves the test is always true or false)."

Is it true that in order for the performance of message dispatching in a dynamic type system to be nearly as fast as that in a static type system, that the compiler needs to perform a great deal of static analysis?

The performance of message dispatching is not directly related to typing discipline. Some statically typed languages such as C++ have rules for message dispatch that make it possible to find the correct method by doing a single array lookup. In Smalltalk, such an implementation is not feasible. Smalltalk message lookup is typically implemented with a hash table. However, there are statically typed languages which do message lookup in a way similar to Smalltalk, e.g. ObjectiveCaml.

How much do dynamically typed languages benefit from aggressive static code analysis (relative to statically typed languages)?

Dynamically typed languages can benefit enormously from static code analysis. Whether this is "more" or "less" than the benefit for statically typed languages is up to the metric used.

Commercial (since ParcPlace spun off from Xerox) Smalltalk environments (and increasingly, Java as well) sidestep these issues through techniques that moot the argument. While not necessarily the case here, such arguments are often pursued by technologists who are unfamiliar with mature Smalltalk implementations.

First, the question of "static" vs "dynamic" has an implied time quantization that the above analysis ignores. The time between method changes is much larger than the time between method invocations (except in very pathological cases). One common Smalltalk trick, therefore, is to perform some static analysis when a method is invoked the first time, save the results in a cache, and then invalidate that cache if the method is changed. From the perspective of the CodeGenerator in such an environment, the distinction between "static" and "dynamic" is mostly arbitrary.

A second, related trick is to treat the output of the "compiler" (for example, the bytecode set) as an "intermediate language", and to treat the "interpreter" as a code generator. When viewed through this lens, there a number of optimizations that the code generator can perform based on static and dynamic context.

Given the above two viewpoints, here are some further specific tactics that dramatically improve Smalltalk performance. Please remember that because Smalltalk is a pure object environment with rich metastructure, the information needed to reliably undo any of these optimizations "on the fly" is always present. Also, I use the term "Compiler" to refer to the code that executes in the environment "above" the virtual machine interface (and typically emits bytecodes) and I use the term "CodeGenerator" to refer to the portion of the virtual machine that operates "below" the virtual machine interface (and typically translates bytecodes into executable code).

1. At code generation time, the VM can inspect the implementation graph and identify the number of implementors of a given method. In typical commercial environments, about 70-80% of methods have just one implementor. In these cases, the dynamic polymorphic dispatch mechanism can be shortcutted with a simple function call (which itself can often be in-lined).

2. Many message sends and byte codes can be readily replaced, at code generation, with special-cased platform-specific assember code. For example, variable references through accessor methods are usually faster than "naked" instance variable accesses, because they are recognized and special-cased by the interpreter. This is a great example of the pitfalls of PrematureOptimization -- naive Smalltalkers (especially ex-C programmers) are sorely tempted to use naked variable references in an effort to "speed up" method performance. The actual impact is often the opposite (a number of factors influence the actual measured behavior).

3. Certain "cross-bytecode" optimizations can occur at code generation time. Since the code generator has access to all the bytecodes of the CompiledMethod? at code generation time, it can recognize patterns of related bytecodes (multiple pushes and pops, for example) and emit correspondingly optimized machine code.

4. Certain categories of method implementations can be in-lined, improving performance still further.

5. The call- and implementation-graphs can be traversed and marked dynamically, allowing significant optimization of the lookup process separately from the per-method optimizations just described. For example, all Smalltalk virtual machine implementations have maintained a method cache for each class that caches the results of the message lookup process between code changes.

-- TomStambaugh

Variable references through accessor methods are usually faster than "naked" instance variable accesses.

I find it extremely hard to believe this, especially since if it really were true, then a compiler could trivially replace all "naked" instance variable accesses with accesses through (compiler-generated) accessor methods. Thus it seems that reference through accessor methods is at best as fast as reference though "naked" instance variables. -- StephanHouben

This can't be done by the compiler (at least in Smalltalk) because the bytecodes don't allow it (there are explicit bytecodes that load and store instance variables). It is the code generator that can (and generally attempts to) perform this optimization. In practice, the performance tradeoff is the inverse of your observation (and the intuition of most of us): the performance through accessor methods is at WORST as fast as reference through "naked" instance variables. It is often much faster. I've lost track of the specifics of what causes the tradeoff (I have a vague recollection that it has to do with how much of the receiver is referenced by the method invoking the containing CompiledMethod?. For a naked access, the context has to see the entire object. For an accessor, this is not the case. I might be mistaken about this rationale, though, so don't beat me up if so.). Please note that this is NOT the same as method temporaries -- those are MUCH faster (because they are kept in the context of the method). So a common idiom in Smalltalk is to copy any repeated instance variable access into a method temporary (using a read accessor) and then use the temporary.

I just spent an evening searching for references on why this is so. The Squeakers seem convinced that indeed an accessor is slower than direct reference. I'm rather skeptical myself. It's one bytecode according to the BlueBook to access an instance member directly, but unless you do aggressive static analysis, it will be much more than one bytecode to use an accessor. Even then, I don't think you can't beat one bytecode. -- SunirShah

But bytecode is a VM level construct: time taken for actual evaluation of each bytecode is more or less arbitrary. For instance say I add two numbers in a VM(stack-based), something like

      ld.lit 3
      ld.lit 4
      mul

      call func

In this (somewhat arbitrary) case a call is only one bytecode, where as multiplying is three, now if multiplication takes longer than the call i'd be _very_ worried. Merely counting bytecodes is pointless. That said I still have difficulty working out how a call can take longer than an access, depending on how squeak handles methods anyway. My understanding of dynamically typed language would have implied that field names would need to be looked up at runtime, but that overhead would seem apparent in method calls as well -- although I'm not familiar with the squeak language and maybe a degree of method/interface information is available that doesn't required name matching -- OliverHunt?