Despite my earlier planning, I still haven’t done any work towards switching dbcdb to an SQL backend. Part of this is other things intervening in my life, but most of this is the inertia of starting something new, and that I haven’t broken tasks down enough.
Today is a holiday, so I can spend some of the day programming; let’s tackle those head-on. I did install MySQL a couple of months ago, but I’ve never really used it, and there’s a difference between reading about something and actually using it. So I’ll work through some of the exercises in Learning SQL today.
And the other thing that I’ll do is come up with a schema. I’ll do that in this post; doubtless many of you know more about SQL than I do, so hopefully I can get some useful suggestions.
All the pages are reachable via a unique number: it’s not that books and authors have a separate number 1, but rather item 1 is an author, item 2 is a book, etc. Let’s call these IDs; presumably I need a table mapping IDs to the kind of item they represent. (Is saying the word “mapping” bad form when discussing SQL? Should I say “relating” instead?)
That “kind of item” can just be a number: 1 = book, 2 = author, 3 = series, etc. If I wanted, I could create another table documenting that; should I do so? It’s not clear to me what immediate value it would have; I suppose that, in some future world, I could stick enough information in that table to be able to generically generate the index pages for each type, but that seems pretty far-fetched. So it would only serve as documentation. And, while I have nothing against documentation, it’s not obvious to me that it’s appropriate in this case. I guess a raw number is a bit stark, though; some sort of enum might be more appropriate.
I’ll presumably have one table for each kind of page. What will the author table have on it? The first column will be the ID. And then there’s the name; eventually we’ll want to separate out first and last names (and suffixes (Jr., III), and whether or not the author is Asian, non-Asian, or nonhuman, and probably other things I’ve forgotten), but right now I don’t need that extra information, so let’s just include one name field and refactor things later as necessary.
The page for an individual author also has links to the books and series by that author, but that information is best stored in the book / series tables instead of the author table.
What about books written by multiple authors? The only kind of compound author that is currently supported is a book with two authors, but it will probably be just as easy to design the schema so that it supports an arbitrary number of authors. So I guess I have a table whose first column is the compound author’s ID and whose second column is the ID for one of the individual authors making up the compound author. So a single compound author will be represented by multiple rows, all of whose entries in the first column are identical.
But order matters, too: Fred and George is a different author from George and Fred. We could say that, for Fred and George, the row with Fred’s ID in the second column comes before the row with George’s id in the second column. But does SQL have a notion of “before”? I’m not completely sure. And, if for some reason I wanted to edit the Fred and George entry to change it to, say, Fred, Bill, and George (unlikely, but whatever), then depending on the ordering in the database would make that a little more annoying. So maybe I’ll add a third column, which is the order of the individual within that compound author.
Now we have two tables representing authors, which bothers me a little bit. One possibility: I could add all individual authors to the compound author table: they could be a compound author containing only one author! Being a mathematician, I like using boundary cases like that. (Hmm: the zero-author boundary case is impossible to represent. Do I care? Probably not.) But, after thinking about it for a little while, I’m not so sure: it imposes an extra constraint on the data (that every individual author will also have a row as a compound author), and I don’t like that. I could try reinterpreting the data somehow (the individual author table doesn’t really represent authors, it represents people), but that would just be trying to dance around a real issue.
I guess the deciding factor is that, even if I leave out the one-author compound authors, I can still build up that table by doing a union of select id, id, 1 from the single author table with the multiple author table. Great; as long as I can create it as needed, then no need to stick in those extra rows explicitly. Hmm: is it possible to create a sorted list of authors (once I get the first name, last name split in place) that handles compound authors correctly by doing a single SQL query using that extended compound author table? Not obvious to me how to do it, but my SQL skills are nonexistent; it wouldn’t suprise me at all if it were possible.
One more serious point that arises from my trying to make a distinction between an author and a person: given that I don’t have any authors without any books listed, there’s another way to get the list of all authors from the database, namely by searching the book table and the series table for their author columns. (I’ll talk more about the book and series tables in a bit.) Hmm: maybe I should think of the author table as, in fact, a creator table: an author is somebody in the creator table who happens to have written a book. This isn’t a serious issue now, but it will become so at some point: when I add movies to the table and have my entire book and movie collections indexed, then I’ll want a single page for Hayao Miyazaki that refers to both his movies and the Nausicaa mangas.
If I really believed in that whole-heartedly, then I would probably also include video game developers in the creator table. And I will, again, have to deal with this eventually: I have both the game Katamari Damacy and its soundtrack CD, and I might want to list Namco as a creator for both of them. For now, though, I’ll stick video game developers in a separate table, and refactor as necessary when it comes to that.
Enough about authors; let’s move on to books. Books can have a title, an author, a last read date, a rating, an ISBN, and whether or not I own them. So we’ll have a table whose first column is the ID of the book. The second and third columns will be the title of the book, the third column being either the, a, an, or NULL. (That will help me generate the sorted list of books.) The fourth column will be the author’s ID. The other columns are pretty obvious.
Series will be similar, but will only go as far as title/author.
Ah, but what about volumes? Do I create a separate table for volumes, or do I add a series ID and volume number to the book table? If they’re in separate tables, then the tables will look awfully similar. And I never want to search only for, say, books that aren’t part of any series. And I should be able to go in either direction: if I have a combined table, then I can generate the volume and non-volume tables by searching for entries whose series ID isn’t or is NULL. While I can go in the other direction by sticking a couple of extra columns onto the non-volume table (for series and volume number), sticking an extra column on the volume table (author ID, which is always derived from the series), and doing a union.
I think I’m leaning towards having just one table. But the latter construction does point out one way in which there is unnecessarily duplicated information: if the series ID is non-NULL then the volumes’ author ID should match the series’ author ID. I can live with that; actually, probably what I would do is that, if the series ID is non-NULL, then you’re allowed to have a NULL author ID. (Which will, admittedly, make the mapping of book/volume to author a bit more complicated than it would otherwise be.) Also, it would open up the door to series where different volumes have different authors; doubtless there are examples of that on my bookshelf somewhere. Or maybe not; I’ve never been that into shared-world series, which is the obvious example. (There are certainly non-fiction examples where my current notion of series will have to be rethought, e.g. Springer GTM’s.)
Looking at all that info, I could still go either way. What do other people think, and why? This is clearly a situation where I could use help from people with more database design expertise.
Once I’ve got those in place (either as one or two tables; I can break them up or put them together if necessary anyways), can I generate a sorted list of books? That shouldn’t be too bad; I can create a four-column table with ID, name body, name prefix, volume suffix, and sort it using the last three columns.
Series should be pretty similar to books. And I don’t think video game systems, video game developers, and video games add much in the way of additional complication. (Other than the aforementioned issue of whether developers and authors should be lumped together as creators.)
I guess one further issue would be whether I should try to extract out the common fields in books and video games to a separate table. (Last read/played date, rating, whether or not I own it, ISBN/ASIN.) This is represented by a common class in the Java code; should the database match the code in that respect? I can’t think of any obvious reason to do so; for example, I can’t think of any situation when I would want to search that joint table. (I don’t, for example, want to produce a “media recently consumed” table. Though, now that I think about it, doing so could be sort of interesting. Hmm.) I guess I’ll just have similar fields in the book table and the video game table, then.
And how long should my strings be? When programming, I don’t like having arbitrary size limits in my code. But that’s not an option in SQL: I have to decide if I want the length of my strings to be 8 bits, 16 bits, 24 bits, 32 bits, or what. (64 bits!) 16 bits is certaintly enough, but I can’t think of any books I own with more than 255 characters in their name. So maybe I’ll go with 8 bits, and feel a little dirty.
Phew; that was a lot of typing. Congratulations to anybody who has read this far, and please give me any suggestions you might have. No wonder I ran into programmer’s block, if just doing a rough draft of the database schema took this much time. And I still have book exercises to work through; I’m not sure if I’ll finish all of those today (I don’t plan to spend all day typing, after all), but I should at least get started. I should also add investigating JDBC to the list somewhere.
Random question: do people say “database schemas” or “database schemata” more frequently? Google suggests the former. Is there any distinction between people who say one and people who say the other? I guess I don’t have to decide which camp I’m in until I’ve written two of the beasts…
There are no revisions for this post.