Hash Table

A HashTable is a data structure for storing key/value pairs; that is, an implementation of the DictionaryDataStructure. Hash tables can be startlingly efficient; insertion and retrieval take expected O(1) time (at least for some sorts of HashTable). The downside: that word "expected" (if you are unlucky, everything can become O(N) instead), and highish overhead cost (so that for very small tables, something much simpler may work better).

The basic idea behind hash tables is very simple. Imagine for a moment that all your keys are small integers: say, none larger than 1000. Then you can just build an array with 1000 elements, and insertion and retrieval just go via direct lookup, obviously O(1). Unfortunately, in practice your keys might be strings or 1000-digit numbers or something worse, so you define a hash function which maps each key to a small integer; then, to look up a given key, you feed it through the hash function and then do a direct table lookup as before.

The difference between a hash table and a plain lookup table, then, is that (1) you have to apply the hash function to the key before using it (adding some overhead cost), and then (2) multiple keys might be sent to the same hash value by the hash function (so you need a way to deal with that).

Let's think about #2. There are two main strategies. Open hashing, also called open addressing, says: when the table entry you need for a new key/value pair is already occupied, find another unused entry somehow and put it there. Closed hashing says: each entry in the table is a secondary data structure (usually a linked list, but there are other possibilities) containing the actual data, and this data structure can be extended without limit. [Caution: some people use the term "open hashing" to mean what I've called "closed hashing" here! The usage here is in accordance with that in TheArtOfComputerProgramming and IntroductionToAlgorithms, both of which are recommended references if you want to know more about hash tables.]

With open hashing, a new problem can arise: perhaps there isn't an unused entry anywhere in the table. The table can fill up. This isn't an issue with closed hashing, but that has a different problem as the number of items stored in the table increases: those secondary data structures can start to get large, so that insertion and lookup times are no longer O(1). The usual answer is the same in both cases: allow the table to grow. If you always double its size (or, more generally, always increase its size by at least a fixed constant factor) then the cost of doing this is O(1) amortized.

Depending on the strategy used for finding unused entries, an open-addressing hash table may become slow as the table gets close to being full. Typically, unsuccessful searches are particularly bad. So it is wise to grow the table before it is completely full.

There are other sorts of DictionaryDataStructures: notably, the many varieties of a BinaryTree. They may be preferable if you need a guarantee on the worst-case access time, or if you want operations such as "find all keys between a and b" to be efficient. When hash tables are possible at all, though, their typical performance is better than that of tree-based data structures.

A hash table is a data structure for storing key/value pairs. A dictionary data structure is often made using a hash table. Don't confuse the two. Usually, dictionaries are made using a hash table, but not always. Sometimes, a BinaryTree or other DataStructures may be used. The runtime-performance of a hash table is fast. Insert and remove operations are expected to be O(1). A hash table is almost the only data structure where adding and looking up takes order 1.


Also see: http://www.nist.gov/dads/HTML/hashtab.html


CategoryDataStructure


EditText of this page (last edited August 9, 2004) or FindPage with title or text search