More About The Database

These pages are stored as separate files in a single linux directory. Writes to the directory are serialized using the mkdir locking idiom. Reads are not locked and could possibly retrieve a page that has been partially written. More seriously, writes are not guaranteed to finish since the web server can kill a cgi script at any time. Our approach has been to feed the file server with lots of memory so that file accesses are largely in cache and complete promptly.

The translator (read MoreAboutTheTranslator) accesses the directory as it formats each hyperlink. For a long page, like RecentChanges, this can be many searches of the directory. At the time of writing, the directory is half a megabyte long but page formatting still proceeds in a blink of an eye.

Pages are represented as a collection of key-value pairs...

I rebuild the dbm database weekly. This seems to be necessary to maintain proportional growth. I record the unix date, page count, total text size, and db file size in a log file ...

The data goes into excel easily. Use text-to-columns as I did to make these charts. I don't know that they mean much except that you can see the slow period every Christmas week and a subtle shift when HowToDeletePages was added late in 2000.


Very interesting. But, since only the top 10% of the Y axis is shown (in the best Time / USA Today tradition, to "enhance" the trend), I tend to discount the significance of the curve. Been deleting more small pages recently?

This is sometimes known as a "Gee Whiz" graph due to its tendency to dramatize the shape of the curve. (EdwardTufte would be very disappointed.)

Actually, come to think of it, we have been deleting a lot of small pages recently.

Well, the delete feature is relatively new. Perhaps its introduction a few months ago is entirely responsible for this "Gee Whiz" trend.

Could it be that recent pages try to cover too much in one space? Or that some discussions go on for a longer period of time and thus become large enough/fast enough, that refactoring has not/will not, happen?

These pages were once stored together in one file using the unix dbm hashed access method. Then the pages seem to take about 2K each, roughly half of which is overhead. Dbm documentation claimed any access can be completed with two disk reads or less, which seemed important at the time.


EditText of this page (last edited May 7, 2009) or FindPage with title or text search