The Case Against Bandit Testing
January 24th, 2013
Many have asked me if Etsy does bandit testing. The short answer is that we don't, and as far as I know nobody is seriously considering changing that anytime soon. This has come up often enough that I should write down my reasoning.
First, let me be explicit about terminology. When we do tests at Etsy, they work like this:
- We have a fixed number of treatments that might be shown.
- We assign the weighting of the treatments at the outset of the test, and we don't change them.
- We pick a sample size ahead of time that makes us likely to notice differences of consequential magnitude.
In addressing "bandit testing," I'm referring to any strategy that might involve adaptively re-weighting an ongoing test or keeping experiments running for indefinitely long periods of time.
Noel Welsh at Untyped has written a high-level overview of bandit testing, here. It's a reasonable introduction to the concept and the problems it addresses, although I view the benefits to be more elusive than it presents. "It is well known in the academic community that A/B testing is significantly sub-optimal," it says, and I have no reason to doubt that this is true. But as I hope to explain, the domain in which this definition of "sub-optimal" applies is narrowly constrained.
Gremlins: The Ancient Enemy
At Etsy, we practice continuous deployment. We don't do releases in the classical sense of the word. Instead, we push code live a few dozen lines at a time. When we build a replacement for something, it lives beside its antecedent in production code until it's finished. And when the replacement is ready, we flip a switch to make it live. Cutting and pasting an entire class, making some small modifications, and then ramping it up is not unheard of at all. Actually, it's standard practice, the aesthetics of the situation be damned.
This methodology is occasionally attacked for its tendency to leave bits of dead code lying around. I think that this criticism is unfair. We do eventually excise dead code, thank you. And all other methods of operating a consumer website are inferior. That said, if you twist my arm and promise to quote me anonymously I will concede that yes, we do have a pretty epic pile of dead code at this point. I'm fine with this, but it's there.
My experience here has revealed what I take to be a fundamental law of nature. Given time, the code in the "off" branch no longer works. Errors in a feature ramped up to small percentages of traffic also have a way of passing unnoticed. For practitioners of continuous deployment, production traffic is the lifeblood of working code. Its denial is quickly mortal.
This relates to the discussion at hand in that bandit testing will ramp the losers of experiments down on its own, and keep them around at low volume indefinitely. The end result is a philosophical conundrum, of sorts. Are the losers of experiments losing because they are broken, or are they broken because they are losing?
The beauty of Etsy's A/B testing infrastructure lies in its simplicity.
- Experiments are initiated with minimal modifications of our config file.
- Visitors are bucketed based on a single, fixed-width value in a persistent cookie.
One of the advantages of this parsimony is that new tests are "free," at least in the engineering sense of the word. They're not free if we are measuring the mass of their combined cognitive overhead. But they are free in that there are no capacity implications of running even hundreds of experiments at once. This is an ideal setup for those of us who maintain that the measurement of our releases ought to be the norm.
Bandit testing upsets this situation in an insidious way. As I explained above, once we weight our tests we don't tweak the proportions later. The reason for this is to maintain the consistency of what visitors are seeing.
Imagine the flow of traffic immediately before and after the initiation of an experiment on Etsy's header. For visitors destined for the new treatment, at first the header looks as it has for several years. Then in their next request, it changes without warning. Should we attribute the behavior of that visitor to the old header or to the new one? Reconciling this is difficult, and in our case we dodge it by throwing out visits that have switched buckets. (We are not even this precise. We just throw out data for the entire day if it's the start of the experiment.)
Bandit testing, in adjusting weights much more aggressively, exacerbates this issue. We would be forced to deal with it in one way or another.
We could try to establish a rule for what to do with visits that see inconsistent behavior. A universally-applicable heuristic for this is not straightforward. And even if feasible, this approach would necessitate making the analysis more complicated. Increasing complexity in analysis increases the likelihood of it being incorrect.
We could continue to ignore visits that see inconsistent behavior. Depending on specifics, this could discard a large amount of data. This decreases the power of the experiment, and undermines its ability to reach a correct conclusion.
We could attempt to ensure that visits only ever see one treatment, while re-weighting the test for fresh visitors. This sounds like a great idea, but ruins the notion of tests as an operational free lunch. Test variant membership, for Etsy, is independent across web requests. Introducing dependence brings tradeoffs that developers should be familiar with. We could keep test membership in a larger cookie, but if the cookie gets too large it will increase the number of packets necessary for user requests. We could record test membership on the server, but we would have to build, maintain, and scale that infrastructure. And every time we added an experiment, we would have to ask ourselves if it was really worth the overhead.
On the Ridiculous Expectations of Runaway Victory
When we release any new feature, it is our hope that it will be a gigantic and undeniable success. Sadly (and as I have discussed at length before), this is practically never what happens. Successful launches are almost always characterized by an incremental improvement in some metric other than purchases or registrations.
Wins in terms of purchases do happen occasionally, and they make life considerably more bearable when they do. But they're exceedingly rare. What is not rare is the experience of releasing something that makes purchase conversion worse. This turns out to be very easy in an annoyingly asymmetrical way.
What we are usually aiming for with our releases is tactical progress on our longer-term strategic goals. Modest gains or even just "not extremely broken" is what we can rationally hope for. Given this background, bandit testing would be wildly inappropriate.
Regret Approaches Zero
Let me point out something that may not be obvious: when we test features on Etsy, we are not typically testing the equivalent of banner advertisements with a limited shelf life. Not that I am suggesting that there is anything wrong with doing so. Nor do I think this is the only scenario in which bandit testing is called for.
But new features and the redesign of existing features are different in several important ways. The unlikelihood of purchase or registration conversion wins means that "regret" in the vernacular sense is minimal to begin with, obviating the need for an algorithm that minimizes regret in the technical sense. And the fact that we are building features for the longer term implies that any regret accumulated during the course of an experiment is minor from the perspective of all history. From this vantage point, the elegant simplicity of not banding testing wins out.
Is bandit testing right for you? I believe it is a question worth asking. It may be the case that you should (to borrow Noel's imagery) "join their merry band." And if so, master, be one of them; it is an honourable kind of thievery.
In the absence of practical constraints, I have no argument against this. But reality is never lacking in practical constraints.