Algorithms Dealing With Massive Data

AlgorithmsDealingWithMassiveData can be quite different from your usual QuickSort, that is fast only if it operates on RandomAccessMemory?.

See e.g. http://www.cs.duke.edu/~jsv/Papers/catalog/node39.html

by JeffVitter? (http://www.science.purdue.edu/jsv/), who studied under DonKnuth.


The most significant difference:

When everything fits in RAM, we may assume that reading or writing any random piece of data takes time O(1).

When it's too big to fit in RAM, reading or writing some random piece of data X items away from the current position of the tape head (or hard drive head) takes roughly O(x) time. (If the head can accelerate and decelerate, perhaps O(x ^ 1/2) would be more accurate).

It takes longer to read the data, but it still scales linearly. Generally you divide up the info onto multiple platters. I think your calculation assumes a single platter's scaling pattern.


At first glance, it seems like this problem should be disappearing as time goes on. RAM is more plentiful than ever before - so this problem should apply to fewer and fewer problems as time goes on, right?

It's not that simple. One thing to consider is that as our ability to deal with larger data sets increases, the data sets we deal with always manage to increase in locked step to keep up.

This problem also arises in new ways. It's not very common these days to read your data directly from a linear tape drive. However, RAM is not what it used to be. RandomAccessMemory? hasn't been O(1) for a while now. Cache performance is a big factor in algorithm selection these days.


See NoMoreDatabases

CategoryAlgorithm


EditText of this page (last edited December 4, 2004) or FindPage with title or text search