The End of Hierarchical File Storage

Most people never learned to organize files on their hard disks. If you’re reading this, I’m probably not talking about you. I’m talking about the kind of person who has every document they’ve ever created on their desktop, or scattered across their hard drive.

But even those of us who know how to use subdirectories and document trees have a problem. On my ‘media’ drive (which contains all of my MP3s, video clips, downloaded files, and so on), I have approximately 12,000 files. My home directory on my Powerbook, where I keep most everything else useful, has a mind-boggling 63,000 files in it, though lots of those are from CPAN builds and the like, and are hidden in directories I’ll never see. Even excluding those dotfile directories, however, I still have an unbelievable 57,000 files. The overlap between this disk and the media drive should be nothing at all; that implies 69,000 files that I have to look after. And that doesn’t count applications or system files.

With all of those files, the idea of creating a single hierarchical taxonomy to sort this data seems daunting to me, and possibly absurd. But I think that there’s a better way, a two-pronged solution to making it possible for me to find the veritable needle in my rather substantial haystack.

The first prong of this two-pronged method is full-text searching. Google is making this real on PCs, I am told, and Spotlight should be doing it for the next generation of OS X. Full search will certainly make it easy to navigate my approximately 600MB of archived e-mail and my thousands of IM and IRC logs. It should also work for all of my MagicPoint presentation slides, and possibly even for my browser cache, if the index is updated frequently enough. (Imagine , all those “I just saw it today; where could it have possibly gone?” complaints vanishing.) For those of us who are digital packrats, I don’t think that I need to explain the advantages of full-text search.

But sometimes full-text search is too granular. What would be as useful, and certainly more useful for non-text data, would be tagging, a la del.icio.us and Flickr. Even less net-addicted people find that iTunes does a better job of finding music via the search feature than if they had it stored in any number of directories.

For the most part, I’ve given up on bookmarks in my Web browser. I had the same problems as with filenames, but the additional trouble of limited screen real estate and a more limited toolset for manipulating bookmarks. My solution has been to leave buttons on the toolbar for things I hit every day, a drop-down menu full of blogging-related stuff (send this to del.icio.us, send this to blo.gs, and so on), a menu of stuff I prefer to keep private (electronic banking, mostly), and to handle everything else via del.icio.us.

Not only do I post most anything interesting to my bookmarks now, but because I tag heavily, I can find anything almost instantaneously. The Quicksilver plug-in lets me bring up commonly-referenced pages with just a few keystrokes, and I can use either the web interface or Cocoal.icio.us to sort through and find anything else.

The nice thing about tags is that they’re arbitrary, and over-tagging causes little to no difficulty. If I like a Frank Rich column in the New York times about Clint Eastwood’s Million Dollar Baby, I will typically tag with “rich,” “frankrich,” “times,” “nytimes,” “nyt,” “eastwood,” “clinteastwood,” “milliondollarbaby,” “film,” “criticism,” and anything else that is relevant. Because tags are non-hierarchical, they’re quick to write: I don’t need to think logically about which of those is the primary piece of information I’ll need to sort it appropriately. I can combine tags in searches: if I remember a New York Times article about Million Dollar Baby, I don’t need to remember that it was Frank Rich who wrote it, or that it was about euthenasia.

It would be wonderful to tag arbitary files. A standard finder/explorer interface could easily work for going through tags. To continue with our Frank Rich example, I might first click on ‘nyt.’ Not only would all of the files listed with that tag appear in the window, but also ‘folders’ with the names of all of the other tags that appear alongside ‘nyt.’ There would be folders for “politics,” “frankrich,” “maureendowd,” “terrorism,” “culture,” “williamsafire,” and so on. Clicking on the “frankrich” folder would show all files that have the tags “nyt” and “frankrich,” and all of the other tags that appear at the intersection of those two tags.

When saving files, below the filename there could be a field to type in space-delimited tags. Heck, I doubt I’d have much use for filenames if I could tag files. Why name a file “landlord-leaky-pipes-complaint” when you could put all of those items as tags. Applications could tag files with their date and the name of the application that created them automatically, so I could look for “msword landlord kitchensink” and find the appropriate letter.

The big difficulty with tags is portability. Not in the sense of making the tags comprehensible to other people — I doubt that they would be any more trouble than filenames in that regard — but in transferring the metadata when transferring the file over the Internet, between people or between computers. Many files solve this problem by including metadata in the file format: MP3s, Word documents, and so on, but too many file types don’t make provisions for this for such a system to rely on internal metadata. Still, even if nobody else could use them, tags would replace hierarchical file storage for most of my needs.

3 thoughts on “The End of Hierarchical File Storage

  1. Jon Posts Stuff

    Jon posts a lot of stuff. Some of it is silly, some of it is not. And then there was this article. I didn’t even read it and I knew it was silly… Just like Jon. I need a beer….

    Like

  2. Well, I like the idea of file tagging and am all for ways to do it. What I think I’d really like, though, is a filesystem that functions as an RDBMS with tags maybe in their own linked reference table.

    Like

  3. I thought about that, Amir — BeOS did it that way, actually — but as I thought, I became convinced that less is more: unstructured tagging makes it easier to do, and tagging profusely is trivial. Hierarchies have problems because you have to keep a whole logical system of organization in your head, and relational databases have the same problem. The mental ‘cost’ of doing that is IMHO too high; the mental cost of tagging is minimal, but gets 95% or so of the ‘finding what I’m looking for’ benefits.

    Like

Comments are closed.