(This post was informed by a session at Agile Open California; many thanks to Llewellyn Falco and Matthew Carlson for their discussion and suggestions. But if there’s something in here that sounds wrong to you, blame me, not them!)

 

Agile software development has always had a strong focus on your software being correct. At the code level, Test-Driven Development requires you to provide a backstop that says that your code does what the programmer intends it to do; at the level of user interaction, the Customer has the end-user expertise to say how the software should behave, and they collaborate with programmers to develop acceptance tests to ensure that the software does in fact behave that way. These techniques are incredibly powerful by themselves, and even more powerful when combined with other Extreme Programming techniques; they provide a very solid defense against large classes of bugs.

In fact, there are a lot of bugs where I will say that, if you’ve written that bug, then you’re doing XP wrong: TDD plus pair programming plus simple design should have stopped you. But, over the course of the 2000’s, faith in the ability of the Customer lessened significantly: A/B tests, in particular, became a standard part of the design toolkit, accepting that we won’t always be able to predict in advance what sort of behavior will best meet our needs.

Of course, XP (and other agile methodologies) has many practices that work actively work to support A/B testing and related discovery mechanisms. Agile methodologies explicitly don’t say that they can predict the future indefinitely: instead, they tell you to work incrementally, in as small slices as possible, and one of the reasons for that is to give you as many situations as possible where you can try out working software in order to figure out what to do next. (That’s the whole “Responding to change over following a plan” part of the manifesto, coupled with “Working software over comprehensive documentation”.) So A/B tests slotted naturally into agile shops; I maintain, nonetheless, that A/B testing embraces uncertainty about what it means for behavior to be correct in a way that TDD does not.

Uncertainty about how humans will react to design is one thing, though; uncertainty about correctness about how computers will react to code is another thing. And I certainly would not in general recommend throwing some code out there, and running an A/B test to see if that code works! Having said, that, TDD is better at functional requirements than at non-functional requirements; you can bridge some of that gap with forms of testing other than unit testing (e.g. performance tests, though those in turn can be in conflict with short continuous integration cycle times), but over the last few years, I’ve run into some situations where even that starts to break down.

 

For example, late last year I was working on some code to partition an incoming workload across servers. We wanted to do this in a way that, as much as possible, colocated workload coming from a given customer onto as few servers as possible, while avoiding overloading individual servers so we can keep up with the incoming data stream in real time, and doing so in a way that supported autoscaling our servers up and down. I can come up with an algorithm for doing this that makes sense to me, and I can implement that algorithm in a TDD fashion; I’m a little embarrassed to say that I wrote a couple of bugs that slipped through to integration testing, but that’s not a problem with TDD, those bugs were directly attributable to my not paying enough attention to the warnings that the code was giving me that I wasn’t following simple design.

But the bigger question is whether that algorithm would actually lead to the desired system behavior: will servers handle the partitioned workload acceptably in practice, and, if they do, what are the best choices for the tuning parameters that the algorithm depends on? Will we keep up with incoming data load under normal situations, what will happen if a trial customer decides to load test us by suddenly switching from sending us data at a 15 GB/day rate to sending us data at a 5 TB/day rate (and, incidentally, the desired behavior there isn’t completely clear: in some situations customers want us to absorb all the data, while in other situations, customers want us to throttle the data so they don’t get overage charges), what will happen if some of the servers doing this processing go down, if there are network problems? And you have to consider efficiency in all of these calculations: we expect to have to overprovision somewhat in order to provide enough of a buffer to handle load spikes, but underused servers cost money.

You can, of course, simulate these scenarios. You can try simulating by sending inputs just to the partitioner component; if you do that, though, you have to somehow model how changes in the partitioner outputs will feed back into the partitioner’s input (in particular, CPU usage), and that modeling might be inaccurate. So you may instead decide to launch the entire system, with a simulated workload; that still requires you to model the variations in the customer’s data load, though, and that model in turn might be inaccurate. And there’s still the question of servers, networks, etc. going down; Chaos Monkey techniques are a good answer there, but, again, how much confidence do you have that the Chaos Monkey is going to hit all of your corner cases?

Ultimately: these are good testing techniques, but they take time, they take money, and they depend on models of non-local behavior in a way that means that they don’t provide the same level of reliability in their results that TDD provides for the local behavior of code. When we were uncertain in how real humans would react to the visual/interactive design, one answer was to test it against real humans, potentially in production; when is it responsible to test lower-level aspects of your code on production, doing so in a way that is consistent with agile (though potentially looking more like modern agile than XP); conversely, when does testing in production descend into cowboy coding?

 

Stepping back a bit, why do we test? Starting again with a TDD lens: TDD starts by forcing you to construct a precise hypothesis for what the desired behavior will be, and those precise hypotheses in turn support further experiments (refactoring, adding new functionality) that you hope won’t invalidate that behavior. But TDD also supports more imprecise feedback: for example, design feedback about how it feels to use your interfaces in practice. Also, precise questioning can be useful in situations outside of testing: in particular, numerically monitoring the state of your system (response times, queue sizes, etc.) can be crucial for understanding the behavior of running systems, as can health checks that fire off when numbers go outside of the desired range.

TDD also teaches you to work in tiny increments: this helps you quickly detect when you’ve made a bad change (instead of having to wait for days or weeks), it helps you figure out quickly exactly where the bad change was (because only one thing changed, instead of lumping together dozens or hundreds of changes at once), it makes the rollback process easy (because you only have a little bit of code to revert), and there aren’t broader consequences (because you fix the failing test before you push your code). Also, stepping back a level, it teaches you to embrace the red: a failing test isn’t a catastrophe, and, in fact, it’s a sign that you have effective monitoring in place, that your safeguards are working as expected. (At least hopefully it is: it’s good to make sure that the test went red for the expected reason!)

 

Given the above, what would responsibly testing code in production look like? You should be able to quickly detect when you’ve made a bad change: so create some precise signals to monitor, and add alarms to let you know if you’ve gone outside of the desired parameters. If there is a problem, you should be able to figure out what code caused the problem: so continuously deploy your code, with each new push only introducing a minimal amount of change. And, if there is a problem, you want rollback to be easy, so design your deploy system with that in mind. (For example, one strategy is to deploy new code to new servers while leaving the old servers running, and monitor whether the new servers are behaving as well as the old servers; if not, redirect traffic back to the old servers.)

That isn’t enough, though: it’s great to be able to detect problems quickly and roll back, but, if users are affected by the problems, that’s not enough. So you always need to keep this in mind while writing code: what are the consequences of a change in code going wrong? UI-only changes aren’t so bad: if the worst consequence of a change is that a flow of operations is a little more annoying to your users, then the downsides are temporary and recoverable, while the upside in terms of learning could be permanent. Similarly, having operations be a little bit slower is manageable as well: if you’re not sure how efficient an algorithm will be in real-world inputs, then it might well be beneficial to monitor it in production, being ready to flip the switch back if the slowdown exists and is noticeable. And slowdown that isn’t user-visible at all is even better: if you can get useful information from an experiment where the only potential downside is having to temporarily spend more on hardware resources, then go for it!

Data loss, however, is a completely different story. If code touches data, think about what extra testing you’ll need to do to get confidence before it runs on production, and think about your rollback steps: while discovering a data loss problem quickly and turning it off before it does any further harm is better than discovering it slowly, that’s cold comfort to people whose data was lost during the period while your code is in production. So think about worst-case scenarios: can an early stage in your pipeline store data someplace safe for disaster recovery purposes? Can you guarantee that data is stored redundantly by a downstream process before throwing away an upstream copy of the data? If you have to do data migrations, can you do double writes and double reads with consistency checks, enabling you to seamlessly switch back to the old store if you detect problems?

And, of course, there are intermediate scenarios: while I said that latency might be acceptable, there’s a big difference between having an operation go from one second to five seconds versus having an operation go from one second to one hour. Also, keep in mind that not all problems are caused by code pushes: machines go bad, networks go down; even worse, machines can stay up while their disks stop working well, or networks can get partitioned.

 

Done right, though, testing in production can be actively beneficial. Returning to the manifesto, we prefer working software, we prefer responding to change: so embrace incremental development, and ask yourself: if you’re uncertain about the effect of a change, is production the best place to test that? Most of the time, the answer will be no, but sometimes the answer will be yes. So, if testing in production will responsibly and quickly give you feedback about a change, then go for it.

More subtly, though: testing in production helps you build antifragile systems. If your strategy is to always get things right, then you may be able to go a long way with that strategy, but if you fail, consequences can be catastrophic, because you won’t have practice dealing with problems. In contrast, if you build a system with production monitoring and testing in the forefront, then you’ll be able to quickly detect when things have gone wrong and you’ll have practice with remediation steps. (And, again, some causes of failure aren’t under your control at all, e.g. machines going bad, customer load spikes, malicious attacks on the system: practice those as well by artificially inducing those problems!)

When talking about this in my Agile Open California session, I led off with the question “how often should your pager go off on a well-functioning agile team?” One answer is “never”, and I’m pretty sympathetic to that answer, but I prefer a different one: your goal should be that your pager never goes off in response to circumstances that the customer notices, but that your pager should occasionally go off in response to quality degradation that the customer doesn’t notice. If you do that, you’ll have monitoring and alerting that notifies you of the problem, you’ll develop monitoring and logging to help you pinpoint the cause and procedures to get you back to a good state quickly, and you’ll learn what problems are frequent enough to deserve automated remediation.

And my belief is that, done correctly, a system that has gone through that will grow stronger through antifragility: you’ll be able able to release software more quickly, you’ll be able to run experiments in the location that maximizes learning, and you’ll be able to do all of that while having your system become more and more robust over time.

Post Revisions:

This post has not been revised since publication.