Garbage Collection

"GarbageCollection (GC), also known as automatic memory management, is the automatic recycling of heap memory. GarbageCollection is performed by a garbage collector which recycles memory that it can prove will never be used again. Systems and languages which use GarbageCollection can be described as garbage-collected."

The preceding text was appropriated from the glossary of the Memory Management Reference, http://www.memorymanagement.org/glossary/g.html#garbage.collection.

Other places to look include


Benefits and costs

On the other hand,

Basic techniques

There are three main techniques for automatic memory management: reference counting, mark-and-sweep, and copying.

Each of these, as described, has serious drawbacks. All of them can be improved considerably at the cost of some complication. State-of-the-art garbage collectors thus perform a lot better than the naive ones that were once all that existed. As a result, some common ideas about how dreadful the performance of garbage collection is are too pessimistic.

Some people prefer not to consider ReferenceCounting a variety of GarbageCollection at all, on the grounds that it doesn't do any sort of reachability analysis and therefore cannot collect garbage containing circular references. (Using it with GarbageCollection on the other hand means much less work for the collector (which only needs to clean up after circular references (which in turn can be avoided in the first place by using ImmutableObjects)) and may be considered an optimisation.) The reference count is a form of reachability analysis; while it's not guaranteed to locate all unreachable objects, objects it does determine to be unreachable are guaranteed to be so, and can therefore be collected without requiring the additional step of searching the heap for references.

Some people prefer not to consider StopAndCopy a variety of GarbageCollection, on the grounds that it collects the non-garbage and leaves the garbage to rot in the unused semispace :-). All three approaches, and their many variations, are certainly varieties of AutomaticMemoryManagement?. For what it's worth, the book by Jones & Lins covers all three.

Subtleties

Any of the schemes described above can be implemented quickly, but doing GarbageCollection well is not easy. Issues to be aware of include:

There's plenty more. Read the book by Jones and Lins for some of it...

Performance

Garbage collection has a reputation for being slow and memory-hungry. This reputation is not, these days, entirely deserved.

Back in 1993, Detlefs, Dosser and Zorn wrote a paper on Memory Allocation Costs in Large C and C++ Programs (http://www.win.tue.nl/~stephanh/CU-CS-665-93.pdf for pdf and ftp://ftp.cs.colorado.edu/pub/techreports/zorn/CU-CS-665-93.ps.Z for PostScript). They took some real-world programs in C and C++ and swapped in a variety of implementations of malloc/free, including (what is now a very old version of) the conservative collector of Boehm, Demers and Weiser.

They found that, on average, using the conservative GC added about 20% to program execution time and multiplied heap size by about 2.5. This last figure is rather misleading, though; it is nearer the mark to say that a program that used N kilobytes of storage with a conventional malloc used about 1.2*N+550 kilobytes with the BDW conservative GC. On the other hand, it should be noted that of the various mallocs they tested, the one they used as a baseline was the one with most fragmentation. Comparing with the most-fragmenting [should be least-fragmenting?] one, the formula would be approximately 1.5*N+600.

These are upper bounds on how much ideal GC can cost, for several reasons.

It is alleged that GC overhead in Smalltalk has been measured at 1%-3%. I'm not sure which version of Smalltalk, on what tasks, on what machine, or how it was measured.
Those were on ParcPlace VisualWorks Smalltalk, and widely published in the mid-eighties. GarbageCollection was one of the most hotly contested attributes of Smalltalk, and was thus highly optimized.

It is widely believed that copying GCs will more or less always be faster than mark-and-sweep ones. This is not true, or at least not as obviously true as sometimes thought.

Also of interest: Andrew W. Appel, 1987 pages 275-279 "Garbage Collection can be Faster than Stack Allocation", Information Processing Letters, volume 25 number 4; citeseer.nj.nec.com/appel87garbage.html


I distinguish GC from "automatic memory management (AMM)" by the fact that GC has to look at the code to recover memory. AMM always knows what memory has been allocated and when it can be released. Many GC's must lock other threads to work or pause for a significant time to find the memory to release, whereas AMM does not. GC is normally done in big cycles whereas AMM is done incrementally. My AMM does no reference counting and uses 4 different techniques to allocate and automatically recover memory in a multi-threaded, shared memory environment.


I've been looking at the possiblity of implementing a mark/sweep garbage collector (probably with a higher-than-normal execution cost) that does not have to stop the world to work. The cleaver part: the allocator allocates new objects already marked as reachable.


The main objection to GC is performance. However, when asked how people manage performance in non-GC languages they assume that memory management is without cost. (Most, but not all programmers)

Hand written memory management is error prone. GC can increase performance if it compacts data. The end result is that page thrashing for virtual memory machines is reduced. A lot depends on the patterns of use. Stop and copy can be very quick if most objects don't survive.


What languages use GC?

Almost all general-purpose languages designed since about 1990. (In these days of fast processors, large memory, and mature GC technology, the old arguments against GC seem unconvincing to many people. The cost in performance is small, and the benefit in productivity and bug-resistance is large.) Anyway, here's a very partial list of languages with GC.

(Contrast: LanguagesWithoutGarbageCollection)


Python's GC is done mostly by reference counting; then cyclic garbage is collected separately. The idea of the algorithm is this: keep all allocated container objects (i.e., objects capable of containing references to other objects) in a list. Then, to find cyclic garbage, go through the container objects decrementing the reference counts of other container objects they reference. Anything whose refcount reaches 0 is referenced only from other container objects. Now perform tracing only within the set of container objects, starting from the ones whose refcount is still positive. Anything we don't reach is referenced only from other container objects with no references from outside, and can therefore be recycled. Neil Schemenauer, who implemented the MarkAndSweep part, has written about it in more detail at http://arctrix.com/nas/python/gc.

This scheme requires three words of extra space for every container object, which is uncomfortable (but bearable; Python's container objects all already use quite a bit of space). It is observed to add about 4% to the runtime of Python programs.


Problems with DistributedSystems

I recently had a bad experience with GC. I was attempting to create a distributed object system, where one environment was in C++, and the other was was in another language (VeraLanguage). The languages are probably not important here - any distributed system would see the same effects. The issue is that neither could see all the references to all the objects.

The basic design was simple. Objects from one language were proxied on the other. The first time an object was passed across the interface, the proxy was created, and stored in a collection. Future calls would reuse the same proxy-object (this allows equality to be implemented in terms of identity). Ideally, I wanted to remove the object from the collection when there were no more references in that part of the system.

The GC system of vera is not exposed to the programmer, There are no weak links, nor even finalizers. In fact, there is no way to determine when the reference in the proxy-container was the only one remaining. The only solution was to require manual memory management.

Actually, message passing handles this issue very well. See ErlangLanguage, for example. -- ScottVokes

{How does MessagePassing solve this issue in ErlangLanguage? Keep in mind that communications was not the problem. Distributed GC was the issue.}

With a few more years perspective...it isn't MessagePassing so much as Erlang's clean and relatively small type zoo that fixes things. In Erlang, everything is either a symbol (atom), a number, a "pid" (process ID), function, etc., or a list or tuple of the above, or a variable bound to the above. Every one of those has clear semantics for serialization, so streaming around a copied data structure is not a lossy conversion. This becomes much tricker in the presence of subtyping, changing class hierarchies/definitions, etc., but Erlang dodges that issue entirely. As GC goes, each process (which not a full OS process; the Erlang VM schedules them) has its own heap, which is (IIRC) collected by a generational copying GC.

{Memory management, though, is not the 'only' solution. If one uses two sets of object IDs - local and remote - then the communications system that maps from global IDs to local IDs becomes a gc-root to protect objects that are referenced only from global systems. If that much is achieved, then one doesn't require manual management of anything but the globalID->localID map, which may be maintained by explicit protocols for distributed GC. That said, partial failures and security issues can make distributed GC a serious challenge - i.e. it's theoretically impossible to have wholly correct distributed GC in the face of partial failures. One needs to use heuristic (mostly correct) distributed GC, and have some ability to recover (e.g. self-healing) after heuristic failures. Self-healing designs are considerably more resilient against a few hiccups in distributed GC, but generally require support for persistence and serialization (e.g. PersistentLanguage or MementoPattern) that is not offered by C++.}


Check out Eiffel for this problem. There is an option to pin an object which prevents it being moved in memory. Objects that exist in other systems are either owned by the GC language in which case they are covered using the finalize approach. For objects in the GC language that are owned by the other language, they will be garbage collected unless marked as being owned from elsewhere. In this case, they form part of the roots of objects that need to be marked and sweeped or copied depending on the GC approach.

EditHint: can someone improve the preceding paragraph?


Garbage collection in a DistributedSystem is still an unsolved problem and an area of active research. And for a distributed system, it may be a bad idea (as it requires GlobalConsensus as to the state of a system to implement). None of the languages in current use which feature (local) GarbageCollection feature distributed garbage collection, at least not that I am aware of.

The MozartProgrammingSystem includes a distributed VM (for OzLanguage), that has distributed GarbageCollection.

Java uses leases for remote objects accessed via RMI. Are leases not a form of distributed garbage collection?

They are, but whereas 'conventional' local garbage collection is generally merely a conservative approach, distributed systems, in order to avoid the GlobalConsensus issues mentioned above, have to be both quite aggressive and conservative: aggressive in that they can and will collect objects which aren't actually garbage yet, and conservative in that the algorithms themselves won't detect all garbage, in the same way (except more so) as local gc won't guarantee a complete collection (SemanticGarbage? is the word, and the reference too I think).

Among other things, there are issues with cycles, where A, B and C are distributed agents, and A passes a reference to B, which passes to C, which passes back to A.
  • It may or may not be appropriate for A to even detect that it is the same object it originally sent.
  • Given naive algorithms + mutation, we end up with the ReferenceCounting cycles issue, because there is no global consensus to (in effect, no outside observer who can) determine that it is a loop that nobody references.
The idea with a lease is that agents are responsible for periodically stating which objects they care about, and therefore any objects which haven't been spoken for can be deleted. This tends to perform much better than attempting to achieve the consensus required to find loops and determine reachability, but it'll tend to delete too much: in the case where an agent becomes unreachable for some time, a leasing algorithm will remove objects even though that agent still refers to them. In the case of a reachability algorithm, they will stick around as long as necessary. This may or may not be a good thing.

And there's the rub, to the effect that "sometimes you have to make hard decisions". It may end up being a choice between not losing data on one subsystem and but bring the entire system down, or losing that data but having the entire system still functional and limping along. RobustnessVsCorrectness?.

-- WilliamUnderwood

Does it have to be that way? You could use the ProxyPattern to make object proxies, which are actually serialized (and possibly cached) references to objects on other computers. Depending on the language, this could be done painlessly and effortlessly, or it could require a lot of boilerplate; however, that isn't the issue at the moment. The destructor, which the GC should call, could commit any cached data and notify the computer which "owns" that object that it has unsubscribed from the object. Once there were no more subscribers to any given object, the manager object could remove that object from an internal list, and let the local GC take care of it. The only problem that I can see is that there might be distributed cyclic references; however, as in CPython, a secondary tracing mechanism could intermittently scan for them.

I'm not so sure that there's any benefit to having non-local objects participate in garbage collection in the first place [discounting leases, which someone else said aren't "really" gc], except for maybe in niche scenarios like a beowulf cluster or supercomputer, where a perfectly reliable network and 100% reliable units is a good assumption--in which case, problems involving GlobalConsensus disappear anyway. I mean, let's assume that most objects are little more than data storage: these can and should be implemented as ImmutableObjects and stored and garbage collected locally, so they are not part of the issue. Many objects could be transformed into this sort of object with functional programming techniques, so those aren't part of the issue either. So only objects with truly important global state, such as a database object or an open file, can't be handle by local garbage collection. Except the database object, being a persistent store, doesn't need garbage collection, because it's known in advance that it should live for the entire application life, and the file object needs a flush signal and (since it's open to other systems which, even if fully trusted, still might hang or experience power failure and thus fail to close the file) a timeout to ensure the file is unlocked eventually--at which point, the file object can be safely reclaimed and any further message-sends to that resource can result in 'request timed out'.


See the loosely related subjects on DeterministicResourceManagement, ResourceAcquisitionIsInitialization.

See also: LifesJustTooShort.


CategoryGarbageCollection


EditText of this page (last edited November 26, 2014) or FindPage with title or text search