Whenever you rip a CD with iTunes, it looks up information about the contents of the CD in Gracenote’s CDDB: as hard as it is to believe in this world of digital information, a CD contains no information about its contents other than the number and length of its tracks and, of course, the music data itself. I have political reservations about CDDB, unfortunately: it started as a free project, and many people (including myself) happily contributed data to the database, but now it’s private. Also, I believe they’ve filed software patents covering their techniques (and are trying to use those patents to squelch others).

Those issues aside, though, my experience with the database is much better than my experience with other online CD databases over the years (including earlier incarnations of CDDB). I have yet to stump it; while it’s not too surprising that it knows about Beatles albums, I’ve also tried foreign CDs (some Herbert Groenemeyer, the Katamari Damashii soundtrack), and it knew about them (giving me the option of either Japanese or English titles for the latter). And it hasn’t been stumped by some of the more obscure (I assume) CDs I’ve tried, such as the excellent Trilectic or this CD of erhu music by Zhao Rongchun that I can’t even find on Amazon. (Hmm: come to think of it, it’s entirely possible that I entered the data for the latter CD into their database myself years ago…)

Having said that, their information is by no means perfect. They list Herbert Groenemeyer variously as “Herbert Groenemeyer”, “Groenemeyer, Herbert”, or simply “Groenemeyer”. (With the genre listed as jazz, world, or pop.) And the information they list for (both of) the Glenn Gould Goldberg recordings is just bizarre: they have “Goldberg Variations” as the title of each track, with what should be the track names (“Variation XXI”) listed as the artist for the track!

I can edit this information myself in iTunes. But I wish it were easier to do so, and I can just see myself having to do the same thing again in a few years when re-ripping the CDs in some other context. Wouldn’t it be better if I were to keep my own private database about all the CDs that I own, and then use that to generate the metadata that the iPod needs? My understanding is that it keeps all this information in an XML file: surely I should be able to write software to generate that out of my own database? Which I would probably also store in an XML file: it seems well-suited to the purpose, and I really should learn about XML anyways.

So: how should I organize the information? The Herbert Groenemeyer example points at one key rule: every entity should have a unique representation in the database. So if I decide that, say, I should switch from “Bach” to “Bach, J. S.” or “Bach, Johann Sebastian”, everything should be updated seamlessly.

The Goldberg Variation example raises another point: a CD isn’t just a CD, it’s a distribution of a recording of a particular piece. I would probably want to have one entity representing the composition itself, another entity representing, say, Glenn Gould’s 1955 recording of the composition, and a third entity representing something even more concrete (the actual CD I own, as opposed to a record of it that was released in the 1950’s). Exactly what this third entity should represent isn’t entirely clear, however, and one can imagine multiple levels here: a CD version is probably different from a record version, but is an older CD remastering of the recording different from a newer CD remastering of the recording? Is the same remastering but with a different booklet the same thing? Should I have this third entity at all, or just make my distinction at the level of recordings instead of releases?

And if I’m building a CD database, why not extend it to other media? I have all these books, and if my house burns down I’m going to want to have a record of them to show the insurance company. And every time I finish a book, I rate it at Amazon (and have, in the past, rated it at other web sites): I really would like to keep track of those ratings, both for my own purposes (so I don’t have to re-enter them again if I find a better web site for recommendations) and to, say, stick them on my web page (here are the last 20 books David read, with their ratings). How should the book organizations be similar to or different than the CD organizations: can we make a distinction between the story contained in a physical book (which is like a musical composition) versus an edition of that story (which is like a recording) versus a physical book or a specific printing of the book (which is like a CD)?

I could go on for a while (creating “virtual CD’s”, populating the database by snarfing info from Amazon, the roles XSLT might play), but I’ll stop for now. I have some free time coming up at the end of the month; maybe I’ll try to cobble together some simple software over the holidays. Despite the iPod being the motivating example, though, I think I’ll start with organizing my book collection first. I guess I should go out and buy a book about XML (and XSLT), and one or more Java books (since, for example, the books that I have don’t talk about GUIs).

Post Revisions:

There are no revisions for this post.