At work, our service has been in a pre-alpha mode for the last month or so. Which has been a fascinating experience, one that I’ve never been so heavily involved in before: in the StreamStar group, we were selling a product rather than running a service, so there wasn’t the same visibility into how well it was working at any given moment, and at Playdom, I joined the team for a game that had been around for more than long enough to be past the scaling issues. Whereas with the product I’m working on now, we’re feeling out those early issues. In general, we’re staying ahead of scaling problems, but we’re also trying to be very heads-up on alerts so we fix things before anybody else notices them.
Which means that people sometimes get woken up in the middle of the night. Initially, it was the same couple of people, but that’s obviously completely unfair (and goes against the Whole Team philosophy that I prefer), so now we’re rotating that job among all the developers. I had the pleasure of being on call a couple of weeks ago, and it was a lot more interesting experience than I expected.
And yes, part of that interesting experience involved staying up later in a couple of evenings than I would have liked, and being woken up a couple of times. The problems there happened to involve a piece of code that I was relatively familiar with; in fact, during that week I’d already been experimenting in test deployments to see where and how it would fall over, I just didn’t manage to get far enough ahead of production scaling to give us quite enough breathing room! But at least it meant that I had some idea of what to look for and what to do, which certainly helped. (And don’t get me wrong, even when I was the primary person on call, a lot of other people were ready and eager to help as necessary, and somebody else implemented the initial stability fix.)
It turned out, though, that being the primary person on call when issues like that come up has an interesting affect on my brain. In the past, when people had talked to me about issues that had come up when they were on call, I was happy to help fix things, but my brain was treating the problem somewhat abstractly and as a distraction to what I’d been planning to do that day. Whereas running into problems when I’m on call rather quickly puts my brain into problem solving mode, with the result that when I showed up at work the next day I knew exactly what I wanted to be working on, what steps I wanted to take, and where I needed to advocate for further work. So the upshot was that I felt a lot more focused for the rest of the week (well, when I wasn’t feeling sleep-deprived!), and it carried over to the next couple of weeks: I can program pretty effectively when I’m trying to answer the question of “what can I do to prevent other people from being woken up for similar reasons?”.
I am, of course, hoping not to always have to keep that motivation in the front of my mind when I’m programming. (And fully expecting that to be the case: we’re doing a very good job of leaping on small problems now and trying to take a systemic approach to solving them instead of a patchwork approach.) But it’s a helpful reminder of the virtues of Whole Team: exposing everybody to problems means that everybody has a chance to really feel their effects and get their brains directly interested in solving those problems.
And, as a side effect, it means that non-bug work can go a lot faster, too: in particular, it’s been great this week to have much faster startup speed for the component involved in the aforementioned issue, because that made it so much easier to run experiments. Goodness all around.
Post Revisions:
This post has not been revised since publication.