batch method objects and reducing duplication

I’ve been falling behind in blogging here, but I did write up a note last week on the Sumo Logic blog about something I recently ran into while programming, and Sumo has kindly allowed me to publish a copy here as well.

When Sumo Logic receives metrics data, we put those metrics datapoints into a Kafka queue for processing. To help us distribute the load, that Kafka queue is broken up into multiple Kafka Topic Partitions; we therefore have to decide which partition is appropriate for a given metrics datapoint. Our logic for doing that has evolved over the last year in a way that spread the decision logic out over a few different classes; I thought it was time to put it all in one place.

My initial version had an interface like this:

def partitionFor(metricDefinition: MetricDefinition): TopicPartition

As I started filling out the implementation, though, I began to feel a little bit uncomfortable. The first twinge was when calculating which branch to go down in one of the methods: normally, when writing code, I try to focus on clarity, but when you’re working at the volumes of data that Sumo Logic has to process, you have to keep efficiency in mind when writing code that is evaluated on every single data point. And I couldn’t convince myself that one particular calculation was quite fast enough for me to want to perform it on every data point, given that the inputs for that calculation didn’t actually depend on the specific data point.

So I switched over to a batch interface, pulling that potentially expensive branch calculation out to the batch level:

class KafkaPartitionSelector {
  def partitionForBatch(metricDefinitions: Seq[MetricDefinition]):
     Seq[TopicPartition] = {
     val perMetric = calculateWhetherToPartitionPerMetric()
     metricDefinitions.map {
       metric => partitionFor(metric, perMetric)
     }
   }

   private def partitionFor(metricDefinition: MetricDefinition,
                            perMetric: Boolean): TopicPartition = {
     if (perMetric) {
       ...
     } else {
       ...
     }
   }
 }

That reduced the calculation in question from once per data point to once per batch, getting me past that first problem. But then I ran into a second such calculation that I needed, and a little after that I saw a call that could potentially translate into a network call; I didn’t want to do either of those on every data point, either! (The results of the network call are cached most of the time, but still.) I thought about adding them as arguments to partitionFor() and to methods that partitionFor() calls, but passing around three separate arguments would make the code pretty messy.

To solve this, I reached a little further into my bag of tricks: this calls for a Method Object. Method Object is a design pattern that you can use when you have a method that calls a bunch of other methods and needs to pass the same values over and over down the method chain: instead of passing the values as arguments, you create a separate object whose member variables are the values that are needed in lots of places and whose methods are the original methods you want. That way, you can break your implementation up into methods with small, clean signatures, because the values that are needed everywhere are accessed transparently as member variables.

In this specific instance, the object I extracted had a slightly different flavor, so I’ll call it a “Batch Method Object”: if you’re performing a calculation over a batch, if every evaluation needs the same data, and if evaluating that data is expensive, then create an object whose member variables are the data that’s shared by all batches. With that, the implementation became:

class KafkaPartitionSelector {
  def partitionForBatch(metricDefinitions: Seq[MetricDefinition]):
    Seq[TopicPartition] = {
      val batchPartitionSelector = new BatchPartitionSelector
      metricDefinitions.map(batchPartitionSelector.partitionFor)
    }

    private class BatchPartitionSelector {
      private val perMetric = calculateWhetherToPartitionPerMetric()
      private val nextExpensiveCalculation = ...
      ...

      def partitionFor(metricDefinition: MetricDefinition):
        TopicPartition = {
        if (perMetric) {
          ...
        } else {
          ...
        }
      }

      ...
    }
  }

One question that came up while doing this transformation was whether every single member variable in BatchPartitioner was going to be needed in every batch, no matter which path I went down. (Which was a potential concern, because they would all be initialized at BatchPartitioner creation time, every time this code processes a batch.) I looked at the paths and checked that most variables were used no matter the path, but there was one that only mattered in some of the paths. This gave me a tradeoff: should I wastefully evaluate all of them anyways, or should I mark that last one as lazy? I decided to go the route of evaluating all of them, because lazy variables are a little conceptually messy and they introduce locking behind the scenes which has its own efficiency cost: those downsides seemed to me to outweigh the costs of doing the evaluation in question once per batch. If the potentially-unneeded evaluation had been more expensive (e.g. if it had involved a network call), however, then I would have made it lazy instead.

The moral is: keep Method Object (and this Batch Method Object variant) in mind: it’s pretty rare that you need it, but in the right circumstances, it really can make your code a lot cleaner.

Or, alternatively: don’t keep it in mind. Because you can actually deduce Method Object from more basic, more fundamental OO principles. Let’s do a thought experiment where I’ve gone down the route of performing shared calculations once at the batch level and then passing them down through various methods in the implementation: what would that look like? The code would have a bunch of methods that share the same three or four parameters (and there would, of course, be additional parameters specific to the individual methods). But whenever you see the same few pieces of data referenced or passed around together, that’s a smell that suggests that you want to introduce an object that has those pieces of data as member variables.

If we follow that route, we’d apply Introduce Parameter Object to create a new class that you pass around, called something like BatchParameters. That helps, because instead of passing the same three arguments everywhere, we’re only passing one argument everywhere. (Incidentally, if you’re looking for rules of thumb: in really well factored code, methods generally only take at most two arguments. It’s not a universal rule, but if you find yourself writing methods with lots of arguments, ask yourself what you could do to shrink the argument lists.) But then that raises another smell: we’re passing the same argument everywhere! And when you have a bunch of methods called in close proximity that all take exactly the same object as one of their parameters (not just an object of the same type, but literally the same object), frequently that’s a sign that the methods in question should actually be methods on the object that’s a parameter. (Another way to think of this: you should still be passing around that same object as a parameter, but the parameter should be called this and should be hidden from you by the compiler!)

And if you do that (I guess Move Method is the relevant term here?), moving the methods in question to BatchParameters, then BatchParameters becomes exactly the BatchPartitionSelector class from my example.

So yeah, Method Object is great. But more fundamental principles like “group data used together into an object” and “turn repeated function calls with a shared parameter into methods on that shared parameter” are even better.

And what’s even better than that is to remember Kent Beck’s four rules of simple design: those latter two principles are both themselves instances of Beck’s “No Duplication” rule. You just have to train your eyes to see duplication in its many forms.

Post Revisions:

This post has not been revised since publication.

Published 5/14/2017 & Filed in Programming

remastering rocksmith »
« where are the protests?

malvasia bianca