I seem to have version control systems on the brain somewhat these days. Until not very long ago, I didn’t think much of them: they’re obviously important to have around if you’re collaborating on software, but other than that, who cares? But then I started getting addicted to looking at diffs after having checked in code, and thought that, perhaps, they might be important if you’re working on software even if you aren’t collaborating. And then I started getting more involved in the administration of the server hosting this blog, and began to appreciate that they’re useful for storing other kinds of information: in particular, we store all the configuration information in Subversion repositories.
And then Jim Blandy pointed me at the figure in this discussion of ZFS and this figure explaining Subversion’s “bubble-up method”. (Hey Jim, start updating your blog and I’ll actually link to it! :-) ) And I started using Time Machine; as John Cowan reminded me, I’m using it in ways that are seriously lacking from the “catastrophic recovery” viewpoint of backup (if my house burns down, I lose that backup, too), but of course backups are useful for other reasons, namely to save you if you make a stupid mistake and accidentally delete something you wish you hadn’t.
So now I’m starting to wonder: are we moving to a world where VCS-like functionality will be considered a part of basic filesystem functionality? At the very least, every time I create a new directory, I now ask myself if I should place it under revision control; frequently the answer is “yes”. For example, I’ve now placed my GTD files under version control; this seems like a particularly good fit, because GTD is all about removing subconscious worries, and now I don’t have to worry that I’m deleting information as I check off steps in a project that I might want to refer to later! (I’m using Mercurial for that repository, initially for a change of pace and because it’s one of the cool new kids on the block, but after thinking about it more, I can easily imagine wanting to clone that repository to this laptop and make commits while I’m off the net.)
There are some interesting design decisions here. On a basic level: do I want to back up everything (with a revision history), or do I want some files/directories to be excluded? (Browser caches are one candidate for exclusion.) And do I want to be able to rewrite history? In particular, do I want to have the ability to remove confidential information from all backups?
How intentional should the creation of new revisions be? I can imagine a filesystem that stores every single change as a separate revision, or one that takes revisions periodically (as Time Machine does). But, in a VCS, revisions generally serve some sort of communication purpose: if I’m working on source code, I don’t want every time I save a file to be reflected in the revision history, I want control over when I commit. Not so clear in other instances; I’m thinking of setting up a cron job to do a commit on my GTD info every night, and reserving manual commits for special occasions.
I’m also wondering what other lessons filesystems could learn from VCSes. For example, distributed VCSes are all the rage these days; could networked filesystems learn something from that? (Though I can’t quite envision how the merge problem will get solved there.) Are there situations (cloud computing?) where we can get some mileage from relaxing consistency guarantees and viewing different filesystems as repositories with a genetic relationship, with changes periodically pushed/pulled in one direction or another? Maybe we should pay more attention to asynchronous notifications, either in the filesystem world or the VCS world; I spend enough time looking at networking problems through a RESTful lens that I get the feeling that asynchronous notifications are a bit out of style, but I’m appreciating them at work more and more these days. (Hmm, Twitter suggests that they’re not out of style at all, doesn’t it?)
No big conclusions, and I’m sure other people have been aware of some of these issues for decades, but I’m enjoying noticing these things.
Post Revisions:
There are no revisions for this post.
Plan 9 from Bell Labs had a file server that stored everything on a WORM drive, with new blocks queued up to write out once per day. Disk was treated as just a cache of the WORM drive contents.
I think they didn’t go quite far enough, there is no reason to restrict oneself to once a day. A modern journalling filesystem may have several copies of the same file already on disk. It might be fun to push this to see how far it could go by adopting a simple “never delete anything” mantra, but making use of “data deduplication” to avoid storing redundant blocks.
3/19/2008 @ 9:24 pm
Oh yeah, I’d forgotten about that! Hmm, maybe I should reread the Plan 9 papers one of these days…
3/20/2008 @ 10:44 am
What I don’t understand is this. If you have a bunch of different versions of the same file saved, what happens when you move from the “I know what path my file is in” Unix model, and move, as I have, to the Mac model, where you don’t know what path anything is in and you just spotlight for its keywords every time you need it. Wouldn’t you constantly be accidentally making changes to the wrong version?
3/20/2008 @ 7:41 pm
The Mac has a traditional filesystem there, and spotlight doesn’t (at least normally?) bring you through older Time Machine versions. (And then Time Machine lets you navigate through different versions of the file system over time.)
Though it does seem like directories are an idea that is starting to, in some circumstances, outlive its usefulness; I know that, with my GTD tradition, I wish that I was using a mail client that was based on tagging instead of folders. But that’s another blog post…
3/20/2008 @ 8:12 pm
Don’t forget — even VMS had automatic file revisioning, of a form!
I think of revision control in the developer’s source-control sense and automatic file-system-based revisioning as serving different purposes, and solving different problems, even though they are the same “thing” and may be built on the same technology. When programming, a “check in” is usually a positive statement about a piece of work: “This file (or this set of files) is in a state that merits being recorded”. (Different groups and different projects will have different standards for what merits being recorded — e.g. some might require that a check-in compiles, some that it pass unit tests, others may be more free-wheeling, etc.). The stricter a given development’s team requirements for check-in-ability, the longer the potential time between check-ins. During those between check-in interstices a developer may be very happy to have the automatic revisions inherent in a filesystem. I know there have been times when I have changed my mind with respect to how I am implementing something, but not all the way back to the latest check-in, and have been glad for emacs’s automatic incremental file backups.
I admit that I haven’t crossed over to the new Macs, so don’t have a feel for how Time Machine works, though from what I’ve heard about it it reminds me of netapps snapshots (though snapshots don’t use a new disk).
3/31/2008 @ 5:41 am
It makes me happy that I have blog readers that remind me of the behavior of Plan 9 and VMS…
4/5/2008 @ 9:35 pm