File System Alternatives

I am tired of hierarchical FileSystems. They grow into big messes over time (LimitsOfHierarchies). Let's discuss alternatives. Some have suggested using various kinds of databases so that one can query or view based on a wide variety of potentially orthogonal traits.

Description of a set-based relational-influenced file system:

http://www.geocities.com/tablizer/sets1.htm (NoteAboutGeocities)

There are some lucid and cogent descriptions about these issues in some of the documents to be found at http://www.namesys.com/, home of the Reiser JournalingFileSystem. This one is a good starting place: http://www.namesys.com/whitepaper.html

ErosOs, PlanNine and some others have an interesting way to handle persistence without using any sort of "files" or "databases", TransparentPersistence. It basically treats the whole hard drive as a big memory swap space. So applications do all their work in virtual memory, relying on the operating system to flush everything to disk periodically and at shutdown, and to restore the whole machine state on reboot. So there is no need for applications to "save" anything - they just create new objects in memory and those objects will live forever (or until deleted/garbage-collected). It's like a super-duper version of the Windows "Hibernate" feature.

ErosOs does not use GarbageCollection (at the OS level). It uses hierarchical ArenaBasedMemoryManagement?. The arenas are called "space banks".

Hmmm memory leaks would scare me with that file system :)

It is always possible to delete a space bank. Any resulting dangling references have well-defined error behaviour.

File systems are about more than just disk. Besides caching technology already greatly reduces physical IO. File systems are often used to organize and categorize information. Also, what you are describing sounds kind of like a NavigationalDatabase.

{It is true that any alternative to a FileSystem must also be capable of organizing and categorizing information. However, if TransparentPersistence is achieved - as per ErosOs - then one is free to create "database" objects whose sole role is to organize and categorize information, and these objects will be persistent by default. TransparentPersistent? thus paves the way for FileSystemAlternatives. Whether the "database object" organizes data in a hierarchical manner is an implementation detail (even Oracle organizes data hierarchically, internally - via tree-based indexing and block management); the API to this "database object" may be SQL strings or DataLog.}

How is a FileSystem different from a NavigationalDatabase?

Good question. A related question is if and why a file system should be different than a database (regardless of the database paradigm).

The answer is that a FileSystem is a NavigationalDatabase. And it should only be different once you've solved the AI problem.

Why is that?

A filesystem is merely a way for humans to organize objects, with the organization scheme made explicit in the naming of links to parent and child nodes. Navigating a filesystem is merely a way of sorting through either its syntactic or semantic content. 'tar' is an example of syntactic sorting. Semantic sorting requires a human level intelligence.

Useful semantic sorting does not require a human level intelligence -- viz Google. Besides, humans are part of the system; to the extent that semantic sorting does require intelligence, the human users provide it.

In both cases of sorting, human and machine, the explicit organizational scheme is essential to tolerable performance.

Are you saying that NavigationalDatabases are the fastest? I am not sure I am following you.

For human retrieval of objects, yes.

So non-navigational may be faster if silicon chips do it? Please explain.

{I agree with the skepticism, here. But Navigational does tend to offer higher performance (latency-wise) for StateOfTheArt DistributedSystems. Suppose data is scattered across a thousand machines, albeit with enough redundancy to support high availability. A navigational data query will know where to find that data. A non-navigational query, even with optimizations for quickly locating information by topic and other properties (i.e. distributed indexes and DHTs) will usually need to ask many machines.}

{Still, latency performance for the initial query might not be the most important to many queries. In a MultiCaster system or PublishSubscribeModel system like DataDistributionService, the "query" portion is performed rarely, and then you subscribe and track updates to cached data. Neither the publisher nor the consumer know exactly who is on the other side. But the navigational design does have a major advantage: ObjectCapabilityModel security can be achieved if navigation is the only way to reach an object.}

{I suspect interweaving achieve the best of both flexibility, security, and performance: when it comes to organizing data and services, one makes 'relational' queries to 'objects', and those queries return data, and among that data can be references to objects (which in turn may be 'new' objects, or may be 'shared'). MultiCaster objects, DataBase objects, service registries, AbstractFactory objects, search services, etc. are all instances of this pattern. You need to know where to send the query, but then you don't concern yourself with what happens behind the scenes... you only need to know you get an appropriate object back.}

{Note that interweaving is different than simply trying to implement relational queries over the references between objects. The latter design basically guts both performance and security in order to achieve a very dubious level of flexibility. I call it 'dubious' because once the relationship between objects are exposed, they also tend to become constrained by external coupling; encapsulation is violated, as is the flexibility it offers. Interweaving doesn't violate security or encapsulation. The main 'disadvantage' of interweaving is the need to explicitly identify data sources (i.e. which database to use, which MultiCaster service to use, etc.) rather than having these be implicit.}

While something like "select * from filesystem where filename like '%.wiki'" has a certain appeal over the subtleties of unix's find command, pondering what an insert or update would look like might be hazardous to one's sanity.

[A realistic syntax would be much more concise -- "*/%.wiki", for example. (Which research filesystem was it that allowed you to cd to a directory specified by a pattern? I think it was the IntensionalFileSystem?, but I'm not sure.)]

What do you mean? Like any application, perhaps there would be a "controlled" interface to create and update stuff.

By the way, Oracle built a filesystem on top of a relational database, called Oracle iFS. From the 9i datasheet, "Using familiar interfaces such as Windows, Web browsers, e-mail, and FTP, users see database-held documents and media as files and folders. Database access scales dramatically because more users gain faster access to their mission-critical information. In addition, you provide for valuable content to be secure and searchable from a central location."

I don't want to see the hard drive failure rates of systems running said ErosOs. While the features offered might be neat, the overhead must be a treat - serializing and deserializing all the essential as well as useless and transient memory contents that unless explicitly marked for saving are plain ignored on traditional systems. Also, each and every rarely used data object causing memory overhead due to references from other objects (possibly equally rarely used) - CPU vs. memory efficiency trying to keep that tidy anyone?

From PlanNine's experience, there was little or no problem with the disk IO. The reason for this should be obvious to anyone who's studied address spaces. If you make RAM just a cache for the hard disk, there is no serializing/deserializing involved.
The only problem PlanNine had with its scheme was when they restarted the system. This caused massive lags as the system retrieved a working set from CD and into the hard drive cache.
- KeyKos, on which ErosOs's OrthogonalPersistence support is based, had very fast start-up times. Perhaps PlanNine's implementation just needs more work. In general there's no reason why a system using OP should have start-up times any worse than a laptop coming out of hibernation.

Although a filesystem might have a role in creating a view of the structure of data on a storage device, it is also very strongly connected with the final physical representation of the data, which will end up being unidimensional.

The representation of data in virtual memory is also unidimensional. It is easy to represent any structure unidimensionally.

This mapping of filesystem to datastream should be as efficient as possible - bloating a system with features ignoring performance and overhead because "the hardware can take it" is by no means a professional approach. With this in mind, strict nested structures win over loosely linkable constructs anytime.

Implementations of OrthogonalPersistence do not ignore performance and overhead; their performance characteristics are different from filesystem-based designs, but as often more efficient than less.
In point of fact, orthogonal persistence isn't a "feature". It's a design principle which REDUCES the number of useless features (eg, "saving" and "retrieving") in the OS.

There's no reason not to put a database-like layer on top of the data to organize it and ease up manipulation and organization of data independently of underlying structure - development via abstraction layers has saved many a coder many a headache. But these will inevitably be user-oriented features, and as such have no place in considerations of hardware tie designs.

Putting a database-like layer on top is an AbstractionInversion.
- I assume this is another way of saying that it is more hardware-intensive. So are GUI's, but they are common anyhow.
{I'd like to contradict both of you: (a) Putting the database layer 'on top' of the persistence features is not an AbstractionInversion. One shouldn't force the persistent storage to be subject to the DML. Any enterprise database will need persistence, but not all persistent systems need a database. Indeed, putting a database layer below persistence would be the AbstractionInversion. (b) In opposition to the earlier statement: the design for persistence must consider the "user-oriented features" it is intended to achieve with high performance. For the programmer, such features include GarbageCollection, support for "large" blocks for multi-media and massive B-tree indexes, support for parallelism and concurrency - atomic reads and writes or SoftwareTransactionalMemory, support for distribution and redundancy, and so on.}

Apparently, WindowsLonghorn will feature such a system.

What, like the object-oriented filesystem we were promised for MicrosoftCairo?

Yes, only it won't get into Longhorn after all. Wait for the next Windows.

It won't get into the next Windows, either. WindowsXP, Windows Vista, Windows 7... Windows has a great deal of inertia to deal with.

For as far as my knowledge goes, they're still planning on WinFS for Longhorn. But it will not be included when Longhorn ships. It will be released as an update, and will be available for Windows XP as well. See http://www.tomshardware.com/storage/20030617/ for an introduction to the WinFS filesystem.

Hierarchical filesystems are just about the disk, the fact all of the currently used ones have a thin user-interface layer of file and directory names over them does not change that. That, too is an added feature. The advantage of organizing data in a filesystem is the virtually negligible overhead required to build on numeric file location pointers and such, and this system has proven itself to be functional, reliable, and easy enough for a secretary to comprehend.

An orthogonally persistent system is also built on numeric location pointers; that's not a difference between the two approaches.
Conventional filesystems are not functional, reliable, or easy to comprehend. I suspect we'll just have to agree to disagree on that.

I seriously doubt systems that require high performance run exclusively proprietary software to run;

What does that mean?

also, I don't think operating system producers are going to split up development into two branches with two different filesystems (which due to the paradigmal differences in the systems would be necessary) to suit the needs/wants of all of their customers as opposed to conceiving the performance-eating procedure as an add-on feature it is.

All useful operating systems already support multiple filesystems. As for performance, see below.

I don't see overhead as being a big problem with the Eros system. It's just extending the virtual memory swap space so that it takes over the whole disk, so there's no distinction between memory and disk. There's no serialization cost involved, because the data is already serialized in memory. There's no cost for rarely used objects, because they remain paged out until needed [excepting that rarely-used objects may share a page with heavily-used objects]. I suspect that even writing back the transient objects isn't that costly, because it takes the same amount of time to write a disk block whether it's full or empty.

The biggest cost, it seems, would be GarbageCollection. This normally has very poor locality properties - in order to trace the active objects, it'd have to pull in all live data, which potentially means pulling in the entire disk. That's where you'd get your massive thrashing and hardware abuse.

{Copying GarbageCollection can fix the locality-of-reference problem, even improving locality-of-reference across collections with simple copying heuristics. It also naturally eliminates fragmentation and ensures the disk can support large-block allocations (i.e. for multi-media files). That said, it would take profiling to improve locality-of-access as opposed to merely improving locality-of-reference. Region-based GC can solve the tracing problem by tracking intra-block references (this 1MB region contains a ref to that 1MB region) and thereby allowing one to selectively collect blocks based on what's already in memory. Basic designs that have vmem cooperating with GC have resulted in 40x throughput improvement and 200x pause improvement: (http://lambda-the-ultimate.org/node/2391). Copying and Regional GC can be combine very symbiotically, since regional GC makes for very clean copy-collection regions.}
ErosOs does not use GarbageCollection (see above). [Why repeat it? ErosOs is not the definition of OrthogonalPersistence.]

To fix that, you might be able to use a GenerationalGarbageCollection scheme with a write barrier, so only objects in the youngest generation (presumably all contained in memory) get traced. Combine that with separate tracing and collection phases, like the GC in ParrotVm?, and you could just mark a page as dead when it contains no live objects, and avoid writing it back to disk when you need to swap it out. Major collections are still problematic, but perhaps you could just start using ReferenceCounting when the object gets promoted to an older generation (where the cost of checking and updating reference counts won't bite you, and you'd need to go through the write barrier to create a cyclic structure), and remove blocks from disk when the reference count reaches zero.

The big flaw I see with OrthogonalPersistence is the data corruption aspect. Programmers are usually much more careful in making sure the data written to files is valid; the extra serialization step gives them another chance to check their work. [It also gives them another chance to muck it up.]

There are two aspects to this: corruption introduced by hardware failures, or software failures. The hardware issue is easy to fix, by using redundant disks, and needs to be fixed anyway, regardless of how data is stored (not least for security reasons). The software issue can be solved by using safe languages.
An "extra step" is still needed in an orthogonally persistent OS. However, it is not a serialization step; it is a conversion of the data to a documented format that can be transferred between software systems (including between versions of the same system), as opposed to an internal format designed for processing efficiency. Unlike a file format, this can directly be a tree or a graph, as opposed to a serialization of one, and can include references (capabilities in the case of an ObjectCapabilityOperatingSystem).

Look how often we have to reboot our computers because of a bad pointer - now imagine if we had to wipe the HD for that too.

Advocates of orthogonal persistence generally also advocate use of safe languages (i.e. languages without undefined behaviour).
Even if there were a bug in the OS or language implementation, it's unlikely that any corruption would affect more than one arena/space bank. So data corruption shouldn't be as much of a problem as with current C-based OS's. Versioning and snapshotting of previous states are also easier to support in an orthogonally persistent system.

-- JonathanTang, DavidSarahHopwood and others

File systems are here to stay

The problem is not the tool but the operator. File systems work just fine if you know how to organize your system. Moreover a wiki working locally could work wonders for your file organization. Give it a try and you'll see that a file system is not at all a bad idea when the user knows what he is doing.

Heck! File systems have been around since the beginning of computing (1943) and I fail to see how they could ever be replaced! Files and sub-directories have become the units in all operating systems. They come naturally in our frames of mind. The same way we have the paradigm building/rooms, countries/cities we have sub-directories/files. The concept of files is just intuitive and natural.

Those who grew up with GoogleSearches might disagree. The description of stuff you are looking for is often semantic ("Where's my green bag? It's got my keys in it!") rather than hierarchical ("Hmmm... I thought I left it in the bathroom. Or was it the bathroom in the hotel...?"). Bookmarks and file system shortcuts could be seen as saved queries. The major OS players are integrating relational databases into their future products. Windows is getting WinFS, and Apple just announced SpotLight?, indexing technology from BeOs combined with smart filters for automatically indexing the content of files. This is also partly the motivation behind DocumentCentric? research, where each application finds its own documents. This is not to say that container relationships don't have value for persistant data. Just that it shouldn't be the only or even the primary relationship.

RichardKulisz's OS design, BlueAbyss, satisfies all the OperatingSystemsDesignPrinciples, and #9 of those is OrthogonalPersistence. That means ditching the FileSystem. -- JonathanTang

Are you sure? I read OrthogonalPersistence and it seems Richard is proposing an automatic save of files at the end of each session but he never mentions ditching the file system per se.

: There is some truth in this, but it's incomplete: files become objects, links between directories take precedence to directories (which are just objects), and the 'saving' of files is really publishing a new version of an object to a version control system... which is the file system. This description isn't completely complete or accurate, but it's closer to the intent on BlueAbyss. --WilliamUnderwood

"The operating system just swaps memory out to disk when it shuts down." Swapping operates on the block level; it's very inefficient to do it on the file level. The reference to ErosOs seems to confirm this, as Eros is (mainly) what's being discussed on this page. -- jt

My personal observation is that most existing file systems are a mess. I suppose you could blame it on those who created the classification, but I think that is only a small part of the problem. The fact that you cannot fix bad classifications without changing a jillion existing path references is a symptom of the limits. Maybe there is some compromise solution, such as putting a non-tree classification system and auto-sequence numbering system on top of the existing file system or something.

{I agree with that observation. Worse, there is no agreed way to "fix" bad classifications. That is, all 'path' based classification is "bad" to someone. Data organized relationally or by a LogicProgramming mechanism of some sort seems more appropriate when dealing with data management. As mentioned above, though, the organizing principle can be largely (though, for performance in the face of concurrency, modularity/composition, and security concerns - not entirely) independent of the persistence mechanism. It seems quite reasonable that organization and volatility of data should be largely orthogonal.}

There is a limit to organizing information using current FileSystem. Nowadays information contained in an average computer is around 10GB, considering music, videos, documents and all. It is getting increasingly difficult to remember the navigational sequence (where did I put my file?). Therefore, something akin to relational database is the way to go, where you don't have to remember where you put the file. Computer finds it for you. BTW, take a look at Microsoft's Windows Future Storage (WinFS) to be released with Longhorn. It is similar in kind to JFS from Namesys, Reiser 4. -- vhi

As of August 2009, typical desktop systems come with HDDs in the range of 1TB storage. The usage of that storage has increased - but not quite proportionally - for larger multi-media files, higher-quality imagery, and so on. At the 10GB mark, it was getting difficult to find all this data. With two orders magnitude more storage, well... let's just say that nowadays people can't browse all the images and documents they have. It's often easier to look up a PDF on Google and download it again than it is to find it on local disk! We need something other than the FileSystem. Main memory should be a cache of the HDD, and perhaps the HDD itself should, for a large part, be a 'cache' of the network resources one wishes to access, as well as serving as redundant storage for network resources on behalf of other computers. (Only a relatively small fraction is needed for tracking active processes!) This would suggest favoring programming-language that supports principles that support automatic caching of network resources, such as FunctionalReactiveProgramming. (Reactive LogicProgramming is also very promising.)

How about a database that is used to form filesystems, where a user could, for example, log in to find /Home filled with their own stuff, /System filled with system files (certain libraries, kernel, etc.), /Applications with applications they have access to, and Devices with device files. This would be great for multilingual environments, since the database would store the information with it's own internal references. Applications could get their own, especially for build programs (like make) for easier development and backwards compatibility. Case insensitivity would be great, along with a new shell (admit it, bash and sh suck). -- Pingveno

What improvements to shells would you like to see? (I've written shells, it's a long-term interest, so naturally I'm curious)
- One of my major complaints is network transparency. It'd be great to say "cd ftp://aserver.com/", be prompted for any passwords, then be able to use the remote filesystem like a local filesystem. I know something like that can be done with mount points, but not that easily.
  - Network transparency should be ConsideredHarmful -- it is NOT desirable to be able to transparently and accidentally save "myprivatesexualfantasies.doc" to what turns out to be a publicly-accessible Web server, after all. You also don't want to save important stuff where somebody else can mess with or delete it, or (especially on metered connections) keep reading over the net stuff that you could read from a local copy. The boundary between your local storage space and the net should be opaque, for many of the same reasons the walls of your house are, by and large, opaque and its windows have curtains and blinds. I agree, see TransparencyAndUniformity. -- .gz

Maybe we should base future 'shells' from InteractiveFiction:

Your Home Your apartment is a mess: your garbage bin is overflowing with ancient files and pornography. In this room there are two cluttered desks and a bed. A clock sits on one desk and a calendar is on one wall. Your personal study and library is to the north, while to the east is your multi-media and photography shop. Apartment exit is south. > i You are carrying: one subversion tortoise [status: cranky] your standard file zapper a keyring (closed) an e-wallet (closed) a pda (blinking) a portable radio (off) > x bed It's your bed. You woke up there this morning. If you sleep there, you're computer will also sleep. > zap bin (with file zapper) What would you like to do with the bin? 1. empty bin 2. restore documents from bin to original location? > 1 A brilliant ray flashes from your zapper to the bin, and with a puff of faint smoke, the bin is now empty. Your zapper will be able to restore the deleted objects for the next hour or so. > look In this room there are two cluttered desks and a bed. A clock sits on one desk and a calendar is on one wall. Your personal study and library is to the north, while to the east is your multi-media and photography shop. Apartment exit is south. > x tortoise Your very own subversion tortoise, version 1.5.5 build 14361. It is old, and cranky because it hasn't been updated recently. > update tortoise Downloading (in background). > (notice:) Your subversion tortoise has been updated. > i one subversion tortoise [status: happy] your multi-purpose object zapper a keyring (closed) an e-wallet (closed) a pda (blinking) a portable radio (off) > listen It is quiet. > turn on radio You power on your radio. You should now hear music (if not, check speaker volume). The radio is currently tuned to ".977 the Hitz channel" (station 1, shoutcast). > reduce volume Please clarify: 1. reduce volume on portable radio 2. reduce master volume > 1. The volume on your portable radio is now set to -20dB. > set radio volume to -17 The volume on your portable radio is now set to -17dB. > listen Your portable radio is currently playing Katy Perry, "Waking up in Vegas" You can't hear anything else. > x pda Your pda is blinking with a notice: you are scheduled for a meeting in conference room 624 at 9:30 (about 17 minutes). You clear the notice after acknowledging it. > x pda Your very own multi-function all-purpose personal organizer! Use 'help pda' to get a complete listing of functions. > synchronize pda Your virtual pda is already set to automatically synchronize with your iPhone, but checking for any recent updates (in background). > look south Apartment exit is to the south. > x south Apartment exit is to the south. Exiting the apartment will lead you to your online communities and allows you to browse the Internet via this interface.

See nadvsh

It's a thought. I suspect it could be done very well... InformLanguage and other professional InteractiveFiction languages offer a great deal of support for handling:

ambiguous references
intelligent categorizations and 'rules'
adapting 'verbs' to the objects
supporting prepositions (with, through, on, under, to) and command options.
menus of commands
adaptive prose for descriptions

Mostly, one needs more integration for multi-media, persistence, concurrency, and networking. Along with the ability to write new applications, as per a Mud or MOO, in a manner that won't hurt security. InteractiveFiction can extend to 2D, 2.5D, or even 3D by supporting InteractiveSceneGraph and more complex adaptive prose. There may be a few navigational aspects (i.e. go to your photo and media workshop in order to effectively print pictures or splice and edit images), but that wouldn't prevent access to the same functionality through other locations (e.g. the pda), and functionality can largely be automated by relationships... i.e. objects of given 'kinds' will automatically support certain functionality (or packages thereof), such as creating a tuner-object that can be attached to typical 'televisions' allowing them to tune into Youtube, with 'televisions' hooking multi-media & scenegraph caps of the shell.

This has some similarities to ToonTalk... except you get a wand instead of a multi-purpose object zapper...

Perhaps we should list examples of alternatives already in practice.

ErosOs
PlanNine
BeOs: file system was a relational database, user could add indexed metadata to any file objects.
MacOs: original file system was a B-Tree. Files were referenced by volume ID (which B-Tree), parent folder ID (primary key) and filename.
AppleNewton: persistant object store ("soup") with hardware support for object-level swapping. Probably wouldn't scale up to large disks.
PalmOs: uses simple named databases for persistent storage.

I am tired of hierarchical file systems. They grow into big messes over time

Why not try using short cuts or virtual directories or the like? For Windows, simply create a set of short cuts on the desk top. I can maintain a nice hierarchy for archival purposes while still having direct access to the subdirectories that are currently in use. When the project, calendar year, or other circumstances change, I merely redirect a link to the new target.

Symlinks suck. They solve nothing. The fact that you say you "merely" redirect symlinks to a new target, merely shows you don't understand. -- RichardKulisz

Why do they suck and solve nothing?

I think it's -partially anyway- about the fact that none of the steps you take to increase productivity are automated? Whenever a path changes, your shortcuts lose their value. Unless you update them manually. Which is in my opinion a very tedious task. I've been struggling with the Windows OS and productivity for years, and this is one point I always encounter. Also, under Windows, some apps allow you to double click shortcuts and enter the directory, while others treat the shortcut as a file and ask to overwrite the shortcut with the data you want to save. A nice example of how much the MS implementation sucks. It shouldn't even be possible to overwrite a shortcut with a data file, because they should be different types of entities. - PeterOdding?

And the default file prompts in many apps often don't recognize shortcuts. Thus, you are forced to do everything thru Windows Explorer if you rely on them.

All of these are reasons why shortcuts (MS symlinks) suck. But symlinks in general suck and one of the reasons, pointed out by Peter, is that symlinks have to be updated manually. This is a symptom of the underlying disease that symlinks are second class links and an inferior substitute for hard links.

What we really want is to be able to have multiple hard links pointing to the same directory so that the target directory has multiple different parent directories. - Actually, you can in general POSIX filesystems, but you have to patch ln to allow it (or write your own version). The reason POSIX specifies that ln mustn't allow that feature is because it is dangerous - your filesystem has to be a DAG of hard links (apart form the "magic" . and .. links) or BadThings? happen when you try to recurse through the filesystem, and it would be rather expensive to check before every move operation that you're not creating a cycle.

I think a DynamicRelational system would be more appropriate for a file system alternative. The attributes desired may vary greatly from project-to-project and creating a formal schema every time could get a bit tedious.

One alternative to filesystems is to simply tag individual objects with metadata and reconstruct a directed graph based on that metadata. You'd start with a flat bag of objects, and you'd aggregate those objects that have a tag in common, then you'd go into the aggregates and create subaggregates and so on. For easier browsability, you probably wouldn't create aggregates that have less than 3 or 4 objects in them.

How do you do such with binary and proprietary file formats? You risk ruining the file. Some kind of parallel structure may be needed to avoid harming the originals.

But what's this technique called? How well does it work in practice? Are there any fundamental limitations to it? Two obvious limitations are that categories wouldn't be persistent, independent objects so you couldn't annotate them. If you made categories into first-class objects things would become weird. Secondly, it would be greatly annoying to have to tag a collection of objects with AnonymousCollection4385. Which points to a hybrid of some kind.

What I'm wondering about is this fundamental question, is it ever justified to have the structure:

name1 / name2 / name1

where name1 and name1 are lexically identical but semantically the same?

Given the recent invention of a RailsFilesystem, wouldn't it be easy to implement a folksonomy-fs, like http://del.icio.us/. Implementing such a filesystem would involve no modifications to client applications, to tag a file currently in /personal/financial with the "todo" tag, simply copy it into the /todo folder.

There is one bug/feature that this does create though: given the non-authoritative existence of a file in a directory (is /foo/a.txt the same document as /bar/a.txt?) this mandates a global namespace for all files.

As a person striving for the efficient, yet accurate, use of information, the imposition of a global namespace can certainly be seen as a good thing. How many notes.txt documents does the average user have on their hard drive? Would it be so bad to force them to name them "notes taco.txt", "notes salsa.txt" and "notes chimichanga.txt"?

-- MichaelWalker

UniqueIdentifiers

It has been argued that names shouldn't be unique identifiers anyway. Especially since the real result would be for users to name their files "notes.txt", "notes2.txt" and "notes3.txt".

This would be true if one did not want to recover a unique file in a unique space, and would rely upon some other device for discrimination. UniqueIdentifiers for some is vital and increases their ability to recover wanted documents. --- A file's full path is its unique identifier. For a file that isn't updated or shouldn't be considered the same anymore if it is modified, a hash works too. What should go in the example /todo directory is something more like a symlink anyway. Copying things leads to a) disk space wastage and b) copies getting out of sync.

Tagging as a Workaround

I've been kicking around ways to live with messy file hierarchies where overhauling them is not an option. One approach I'm considering is a folder tagging system where a text file in each folder holds one or more key-words. A web-spider-like sifter periodically reads these tags, refreshes a database, and then a "category browser" application is used to search for such categories and launch selected files or open folders. Here's a draft schema:

table: Folders ---------------- folderID path descript

table: Categories --------- categID tagName descript aliases // synonyms

table: Folder_Categs ---------- folderRef categRef rank // based on order listed

Open issues:

Are tags manually entered to via a dialog box? If manual, how are typos fixed?
Can it be done at the file level and not just folder level?

Continuing difficulties with messy file systems has prompted me to think yet more about this issue. Here's an example user interface system for a multi-category-based file system to enhance existing hierarchical file systems.

 *******************Begin Sample Screen***************
 New File Category Dialog (upon save)

 File Name: presentation_stage_5.doc
 Path: foo/bar/etc/blah

 Select from existing key words or phrases:
 *--------------------
 | Budget
 | Customer Complaint
 | Database 
 | Planning
 *-------------------
 Filter above list using: [____](substring) [*clear*]

 Candidate List: 
 *-------------------
 | Budget         
 | Horse Racing
 | Crime        
 | Foo
 *-------------------

 [*Cancel*]  [*Submit*]

 *******************End Sample Screen***************

(Readers with Mozilla-derived browsers will see extra spaces due to a known WikiWikiBug.)

When the user clicks on the scroll-list near the top, the phrase appears in the candidate list below. Unlike the top list, the user can optionally key new entries into the bottom list. A down-arrow or Tab will allow them to move to the next entry point.

When the user presses Submit, any items not in the master category list are displayed for confirmation. The confirmation dialog could look similar to the one shown, perhaps even be the same, but with some re-ordering and highlighting.

The system administrator would determine if or which users can add phrases to the master list. Perhaps there can be a master list and a separate user-managed list, similar to an email system's global address list and local address list. Both would appear in the top list.

Perhaps users could flag phrases they feel should be considered for the master list. Either way, the administrator should review words used often but not in the master list using pre-canned reports for such studies.

It may be useful if synonyms can be tracked or marked. How this would happen needs some pondering.

The existing file path may also be used to pre-assign some phrase categories. Folders could be assigned categories and those categories would be automatically put into the candidate list for any file under the same branch. Thus, the path itself may do half the work.

Searching

The search screen would look very similar to the phrase assignment screen, except the second list would be the file match list and the first list would allow free-form phrase entry for those phrases not in the master or standard list(s). Some kind of type-ahead technology could be used to simplify category selection. Date range searches would also be possible:

   Create Date Between: [xx/xx/xxxx] and [xx/xx/xxxx]
   Modified Between:    [xx/xx/xxxx] and [xx/xx/xxxx]
   // first dates blank by default, 
   // second defaults to "now"

Whether some kind of priority ranking should be included needs some pondering. Such may overcomplicate the interface, so we have to be careful about it. It could be useful in cases where not enough exact matches can be found and the search engine wishes to select subsets of our phrase list to search on.

Perhaps have ordering drag-knobs on the right side of the list to place priorities. The search engine would narrow the search based on these if not enough matches found.

A "search wider" option can be given to ignore progressively lower-ranking match phrases. Or just automatically include the lower ranking search combinations at the bottom of the file match list. But, this requires more horse-power.

-- top

As usual, PornDrivesNewTech:

http://www.kuro5hin.org/story/2005/7/29/152355/970

There's a reason filesystems are hierarchical. A file system is NOT a NavigationalDatabase. The most crucial operations on a filesystem, the ones which define it, are "move a big hunk of stuff to a different physical location" (remove disk and carry), "delete a big hunk of stuff", and "duplicate a big hunk of stuff" (backup disk). These are what make the correct analogy a file system rather than a database -- the ability to easily and quickly throw out an entire file drawer, or move it to a different building. In a database, you can't do that because it breaks cross-references.

Given these as the fundamental operations -- given the physical fact that data will often be stored on different physical storage devices with different size limitations -- the purpose of the hierarchy is to sort data into big hunks which can be individually copied onto different physical media, or individually deleted. Quickly. In other words, the purpose of a file system hierachy is cp -r and rm -r, period. To the extent that filesystems make this slow or difficult, they are not well-designed as filesystems, and are in fact trying to be databases.

Your filesystem hierarchy should be designed around the principle of nuking entire directories, or backing up entire directories, or mirroring entire directories. If you have data organization needs which do NOT fall along those lines, THEN you want a NavigationalDatabase.

When they grow to a sufficient size and complexity, then mirroring physical filing cabinets is insufficiently powerful. Cross references then become a necessity to find stuff and avoid duplication and stale copies. And what is the boundary of "stuff"? I agree that perhaps they are just a base tool, but they currently don't offer much in terms of standard interface/conventions to connect to a more powerful multi-category-friendly classification and tracking system(s). Some variation of SetTheory is just a better model/metaphor past a certain volume of info.

Generally speaking, all file systems are really Graphs, not Trees, but graphs are too hard to manage except with a ThreeDimensionalVisualizationModel. But to do that well, you have to start with a UnifiedDataModel, which drive a new kind of application development: the MashUp. And this is because AllDataRelatesToOtherData. Add the Internet and you have a DataEcosystem and a WebOS. -- MarkJanssen

Don't you mean "some file systems"? There are strictly hierarchical file systems -- those that support no mechanism for links or multiple references to a single file -- that are organised as a tree. Although every tree is a graph, I presume that's not what you intended. Now corrected, although I am ultimately trying to elude to the idea that as the purpose of files is to hold data, and that AllDataRelatesToOtherData, that a graph is the most natural form for a "filing" system.

They are tree-centric with light graph-ness sprinkled in as an after-thought.

PageAnchor Top4726

I believe that file systems will have to become more like ContentManagementSystems to be what people keep looking for. A standardized base "record" structure for objects/files would have to be agreed upon (similar to WebGodObjectDiscussion), and various tools could be put on top of that, such as browsers and Gopher-like tools (think file-browser on steroids). Locations of "stuff" would still have a hierarchical path, similar to a URL or URI, but convenient and standardized "graph-y" cross-linking would also be allowed, perhaps using "shortcuts" (which are something the web has not standardized on, unless you count redirection, which is has not been standardized sufficiently). However, we don't want mission creep (CreepingFeaturitis) to try to morph it into a generic database or CRUD engine. We may want to allow standardize extension conventions for such, but not included as part of the base standard.

The "shortcuts" could work as follows: You put a shortcut object in a given folder, and any path that uses that shortcut becomes virtualized at that point in the path.

Directory of \\myServer\folder1\folder2 foo.txt bar.html zaz.xlsx myMapShortcut.link // map showing your home at google.com/maps?address=123blah

To see the google map, one could use the following URL/URI:

\\myServer\folder1\folder2\myMapShortcut

However, the security implications of such need to be evaluated. A warning message should perhaps be part of the browser.

One benefit of a WCMS-influenced approach is that one can create web-page-like interfaces AND/OR use an advanced file-browser like utility to sift content, and switch between them as needed. For example, one may want to associate an image or thumbnail with an object/folder/file, along with a description paragraph or synopsis. In "summary" or "list" mode, a browser would only show the title and perhaps a small thumbnail, similar to a typical Windows file browser. But in "full" view (or semi-full view), it would look more like a primary (main) web page with synopsis's and/or bigger thumbnails. (Kind of like http://www.cnn.com or http://www.latimes.com ) And perhaps allow custom views via custom templates (both by browser user and by folder "authors").

One will probably also need "link records" that facilitate cross linking and/or merging of objects (folders or files). There will probably have to be "kinds" of links or associations to help the browser or browser user have a feel for that they are looking at. But these "kinds" would be loosely enforced such that one can be used in place of the other, and one "kind" may morph into the other anyhow. For example, if we declare an object to be a "link" (which is similar to a folder, but non-hierarchical), but later decide we want to supply our own content because the destination has moved, we can change that object to be a content object and put content in it directly. It would still function for the same purpose in most circumstances. There would be at least these object types/categories: "content" (object or file), "folder" (hierarchical relationship), and "link" (graph relationship). "Related" ("see also") may also be one.

Some kind of semi-simplified query language/notation may also be available to collapse hierarchies and cross-links into a "single listing" so that the user/browser doesn't have to dig into hierarchies. To borrow from DOS-like and SQL-like syntax, such "queries" could resemble:

  SELECT * FROM
  DIR \\x\foo\bar\*.* /s   [note: "/s" means include sub-folders]
  DIR \\yy\blah\*.* 
  DIR X:\miff\zerp\*.txt /s
  AND modifDate BETWEEN '2013-10-01' AND '2013-10-15'
  ORDER BY title, modifDate

These would all be appended together. Date ranges and other do-dads might also be permitted, but per above, we don't want to bloat up the standard. Perhaps include some basic/common query options, but allow an escape clause for custom- or vendor-specific query commands so that fancy query options are possible without having to bloat up the standard. For example, a "basic" SQL clause similar to above may be part of the standard, but JOIN's etc. perhaps should be considered add-ons. Remember, the content may come from a variety of sources such that complex queries may confound or overload the system. Appending, filtering, and sorting DIR-like results is reasonable. But, features like JOIN may be very expensive or bandwidth hogs, depending on a given situation. (Servers may want to limit the size of requests to avoid overly large crawls, at least without special permissions. A protocol for warning messages if the quotas are exceeded may be a consideration.)

Sure, this sounds messy and organic, but sufficiently complex information repositories generally become organic whether we want them that way or not. It's better to find ways to work with organic relationships rather than try to avoid them, which is what our engineering tendencies often urge us toward.

See TopsFileSystemAlternative for more.

--top

FileSystem

CategoryOrganization CategoryFileSystem, CategorySpeculative