Wiki Files

An idea came to me. If implemented, it might address the wiki locking issue, among other things.

Basically, why store wiki content in a Berkeley DB file, or a SQL database, when there is a perfectly good database already handy, with locking too. Namely, I'm talking about the file system. Basic run down is that the file name is the wikiname. You don't need locking on file reads, and you can easily lock the files individually on file writes.

So, what are the possible problems with this?

Speed could be an issue. But, on unixish systems, file systems are supposed to be fast. Look at how sendmail and inn work, for example.

I think I will implement it on my home wiki, although this isn't going to provide any answers to the speed issue. -- Joshua D. Boyd

A file for every WikiPage: There are WikiEngines which do this, such as UseModWiki.

Speed issues: A LoadTest? should be easy to write, so you can test this at home. It is also less vulnerable to the locking problem.

Load a directory with N files of average length L using names that average W characters. Launch jobs at rate R per second that read random files from N with some weighted distribution. Have each job stat K additional names that exist in the directory with probability P. Measure disk block caching behavior as N increases. Repeat for both linux and NT. Try a two level directory hierarchy where the second level directory names are formed by hashing into C codes. Compare caching behavior with single directory. Try weighting retrieval statistics to favor most recently added names. Is there any performance impact?

Here are the parameters for this wiki. Should anyone try this I will happily verify them.

N = 18000
L = 4000
W = 25
R = .1 to 10
K = 15
P = .75

I've been thinking of performing this study myself. I wish I knew how to observe page caching behavior on linux. Any suggestions? -- WardCunningham

Possible scaling problem: I think some file systems limit the number of files that you can have in a single directory.

If you search the web for this answer, you'll get ones that are apparently not too accurate. Your best bet is this:

 perl -e 'for (1..25000) { open(F, ">${_}foo"); }'

On my system (very default), it did 25,000 files no problem (less than a minute). I let it run ad infinitum, and got bored and killed it. It had gone for ~5minutes, and made 500,000 files. Directory access (randomly cat'ing a file) was instantaneous.

/bin/rm -rf'ing the directory took only a few seconds.

: If you are likely to be using something like ext2, then a directory is just a file, so rm -rf merely removes the entry for the child directory in the parent directory. The time it took was to search for symbolic/hard links. However, with certain file systems, your test may have permanently chewed up some disk space (that is until you reformat).
: That would amount to an extremely broken filesystem (or /bin/rm), wouldn't it? What systems are like this? (I want to remember to never, ever, ever use them.)
: The only one I know of off hand is NTFS, so you're probably safe if you're using /bin/rm. ;) On the other hand, I suspect the journaling file systems have a similar problem.

While fiddling with this, I chatted about it to rhizomatic.net's #Perl. Leolo had these two ideas:

1) An elegant way to hash the files:

 $file =~ /(.)(.)/;
 mkdir $1 unless -d $1;
 mkdir "$1/$2" unless -d "$1/$2";
 open FILE, "$1/$2/$file" or die $!;

: UseModWiki splits the files based on the first character.

2) Since the Wiki code already uses a tied hash for access, it would be easy to make a tie'able class that makes sure the above directory hashing is done OnceAndOnlyOnce.

: A simpler solution is to just write a function pageExists that caches the results in a hash so it only has to query for each page once. This hash can also incorporate the SisterSites listings.

There's quite a detailed analysis of unix-ish filesystem performance details and the issues of using it as a database in the venerable "News Need Not Be Slow", http://www.apocalypse.org/pub/u/paul/docs/c-news/c-news-toc.html. Basically it boils down to the time spent in the system calls to stat large/deep directory structures.

So, it appears that there is a performance hit to too many files in the same directory. This can be fixed by using deep directories. Deep directories can also kill performance, if you don't chdir to the directory. But then you could end up spending too much time chdiring around, since system calls should also be avoided when reasonably possible.

So, how does this effect wiki files? Well, there are a lot of wiki pages. You could try to split them into directories in some fashion so that the files are fairly evenly distributed. But, then each page would likely require a new chdir call. So, for small wikis, this could work, but there is a point at which it would no longer scale unless some method for hierarchy is established so that subsequent page calls would still be in the same dir. That said, this particular wiki already does some unpleasant, performance killing things. The fact that it is CGI comes to mind, as does the locking problem with the db. The locking might actually be seen as a performance problem, but switching to files would fix it, and still wouldn't be so bad compared to the exec calls being made.

But, that is beside the point. I'm more interested in my own project, where I am attempting to rework the linking system, because WikiNames don't meet my needs, and I already want it to be hierarchial, somehow. Besides, I tend to get a scummy feeling anytime I try to use BerkeleyDB, GDB, or any of the related systems. So, I try to always use either files (plain text or XML) or Postgres (mysql if that is the only option on servers I don't control.). -- Joshua D. Boyd