Math, Programming, and Minority Reports

The Case Against Bandit Testing
January 24th, 2013

Many have asked me if Etsy does bandit testing. The short answer is that we don’t, and as far as I know nobody is seriously considering changing that anytime soon. This has come up often enough that I should write down my reasoning.

First, let me be explicit about terminology. When we do tests at Etsy, they work like this:

We have a fixed number of treatments that might be shown.
We assign the weighting of the treatments at the outset of the test, and we don’t change them.
We pick a sample size ahead of time that makes us likely to notice differences of consequential magnitude.

In addressing “bandit testing,” I’m referring to any strategy that might involve adaptively re-weighting an ongoing test or keeping experiments running for indefinitely long periods of time.

Noel Welsh at Untyped has written a high-level overview of bandit testing, here. It’s a reasonable introduction to the concept and the problems it addresses, although I view the benefits to be more elusive than it presents. “It is well known in the academic community that A/B testing is significantly sub-optimal,” it says, and I have no reason to doubt that this is true. But as I hope to explain, the domain in which this definition of “sub-optimal” applies is narrowly constrained.

Gremlins: The Ancient Enemy

At Etsy, we practice continuous deployment. We don’t do releases in the classical sense of the word. Instead, we push code live a few dozen lines at a time. When we build a replacement for something, it lives beside its antecedent in production code until it’s finished. And when the replacement is ready, we flip a switch to make it live. Cutting and pasting an entire class, making some small modifications, and then ramping it up is not unheard of at all. Actually, it’s standard practice, the aesthetics of the situation be damned.

This methodology is occasionally attacked for its tendency to leave bits of dead code lying around. I think that this criticism is unfair. We do eventually excise dead code, thank you. And all other methods of operating a consumer website are inferior. That said, if you twist my arm and promise to quote me anonymously I will concede that yes, we do have a pretty epic pile of dead code at this point. I’m fine with this, but it’s there.

Continuous delivery: artist's conception. Photo credit: Ben Pollard

My experience here has revealed what I take to be a fundamental law of nature. Given time, the code in the “off” branch no longer works. Errors in a feature ramped up to small percentages of traffic also have a way of passing unnoticed. For practitioners of continuous deployment, production traffic is the lifeblood of working code. Its denial is quickly mortal.

The Various Species of “Hacker News” Readers

This relates to the discussion at hand in that bandit testing will ramp the losers of experiments down on its own, and keep them around at low volume indefinitely. The end result is a philosophical conundrum, of sorts. Are the losers of experiments losing because they are broken, or are they broken because they are losing?

Accounting Irregularities

The beauty of Etsy’s A/B testing infrastructure lies in its simplicity.

Experiments are initiated with minimal modifications of our config file.
Visitors are bucketed based on a single, fixed-width value in a persistent cookie.

One of the advantages of this parsimony is that new tests are “free,” at least in the engineering sense of the word. They’re not free if we are measuring the mass of their combined cognitive overhead. But they are free in that there are no capacity implications of running even hundreds of experiments at once. This is an ideal setup for those of us who maintain that the measurement of our releases ought to be the norm.

Bandit testing upsets this situation in an insidious way. As I explained above, once we weight our tests we don’t tweak the proportions later. The reason for this is to maintain the consistency of what visitors are seeing.

Imagine the flow of traffic immediately before and after the initiation of an experiment on Etsy’s header. For visitors destined for the new treatment, at first the header looks as it has for several years. Then in their next request, it changes without warning. Should we attribute the behavior of that visitor to the old header or to the new one? Reconciling this is difficult, and in our case we dodge it by throwing out visits that have switched buckets. (We are not even this precise. We just throw out data for the entire day if it’s the start of the experiment.)

Bandit testing, in adjusting weights much more aggressively, exacerbates this issue. We would be forced to deal with it in one way or another.

Balancing the minimization of regret with the maximization of test adoption in your organization: artist's conception.

We could try to establish a rule for what to do with visits that see inconsistent behavior. A universally-applicable heuristic for this is not straightforward. And even if feasible, this approach would necessitate making the analysis more complicated. Increasing complexity in analysis increases the likelihood of it being incorrect.

We could continue to ignore visits that see inconsistent behavior. Depending on specifics, this could discard a large amount of data. This decreases the power of the experiment, and undermines its ability to reach a correct conclusion.

We could attempt to ensure that visits only ever see one treatment, while re-weighting the test for fresh visitors. This sounds like a great idea, but ruins the notion of tests as an operational free lunch. Test variant membership, for Etsy, is independent across web requests. Introducing dependence brings tradeoffs that developers should be familiar with. We could keep test membership in a larger cookie, but if the cookie gets too large it will increase the number of packets necessary for user requests. We could record test membership on the server, but we would have to build, maintain, and scale that infrastructure. And every time we added an experiment, we would have to ask ourselves if it was really worth the overhead.

On the Ridiculous Expectations of Runaway Victory

When we release any new feature, it is our hope that it will be a gigantic and undeniable success. Sadly (and as I have discussed at length before), this is practically never what happens. Successful launches are almost always characterized by an incremental improvement in some metric other than purchases or registrations.

Wins in terms of purchases do happen occasionally, and they make life considerably more bearable when they do. But they’re exceedingly rare. What is not rare is the experience of releasing something that makes purchase conversion worse. This turns out to be very easy in an annoyingly asymmetrical way.

What we are usually aiming for with our releases is tactical progress on our longer-term strategic goals. Modest gains or even just “not extremely broken” is what we can rationally hope for. Given this background, bandit testing would be wildly inappropriate.

Regret Approaches Zero

Let me point out something that may not be obvious: when we test features on Etsy, we are not typically testing the equivalent of banner advertisements with a limited shelf life. Not that I am suggesting that there is anything wrong with doing so. Nor do I think this is the only scenario in which bandit testing is called for.

But new features and the redesign of existing features are different in several important ways. The unlikelihood of purchase or registration conversion wins means that “regret” in the vernacular sense is minimal to begin with, obviating the need for an algorithm that minimizes regret in the technical sense. And the fact that we are building features for the longer term implies that any regret accumulated during the course of an experiment is minor from the perspective of all history. From this vantage point, the elegant simplicity of not banding testing wins out.

In Closing

Is bandit testing right for you? I believe it is a question worth asking. It may be the case that you should (to borrow Noel’s imagery) “join their merry band.” And if so, master, be one of them; it is an honourable kind of thievery.

In the absence of practical constraints, I have no argument against this. But reality is never lacking in practical constraints.

Testing to Cull the Living Flower
January 16th, 2013

I was once oblivious to A/B testing.

My first several years out of college were spent building a financial data website. The product, and the company, were run by salesmen. Subscribers paid tens of thousands of dollars per seat to use our software. That entitled them to on-site training and, in some cases, direct input on product decisions. We did giant releases that often required years to complete, and one by one we were ground to bits by long stretches of hundred-hour weeks.

Whatever I might think of this as a worthwhile human endeavor generally, as a business model it was on solid footing. And experimental rigor belonged nowhere near it. For one thing, design was completely beside the point: in most cases the users and those making the purchasing decisions weren’t the same people. Purchases were determined by a comparison of our feature set to that of a competitor’s. The price point implied that training in person would smooth over any usability issues. Eventually, I freaked out and moved to Brooklyn.

When I got to Etsy in 2007, experimentation wasn’t something that was done. Although I had some awareness that the consumer web is different animal, the degree to which this is true was lost on me at the time. So when I found the development model to be the same, I wasn’t appropriately surprised. In retrospect, I still wouldn’t rank waterfall methodology (with its inherent lack of iteration and measurement) in the top twenty strangest things happening at Etsy in the early days. So it would be really out of place to fault anyone for it.

Here is an official video produced by Etsy a few months after your author started work.

So anyway, in my first few years at Etsy the releases went as follows. We would plan something ambitious. We’d spend a lot of time (generally way too long, but that’s another story) building that thing (or some random other thing; again, another story). Eventually it’d be released. We’d talk about the release in our all-hands meeting, at which point there would be applause. We’d move on to other things. Etsy would do generally well, more than doubling in sales year over year. And then after about two years or so we would turn off that feature. And nothing bad would happen.

Some discussion about why this was possible is warranted. The short answer is that this could happen because Etsy’s growth was an externality. This is still true today, in 2013. We have somewhere north of 800,000 sellers, thousands of whom are probably attending craft fairs as we speak and promoting themselves. And also, our site. We’re lucky, but any site experiencing growth is probably in a similar situation: there’s a core feature set that is working for you. Cool. This subsidizes anything else you wish to do, and if you aren’t thinking about things very hard you will attribute the growth to whatever you did most recently. It’s easy to declare yourself to be a genius in this situation and call it a day. The status quo in our working lives is to confuse effort with progress.

Growth at Etsy as an externality — An illustration of the problem of product development at Etsy, or generally when growth is an externality. Success and failure both superficially resemble success.

But I had stuck around at Etsy long enough to see behind the curtain. Eventually, the support tickets for celebrated features would reach critical mass, and someone would try to figure out if they were even worth the time. For a shockingly large percentage, the answer to this was “no.” And usually, I had something to do with those features.

I had cut my teeth at one job that I considered to be meaningless. And although I viewed Etsy’s work to be extremely meaningful, as I still do, I couldn’t suppress the idea that I wasn’t the making the most of my labor. Even if the situation allowed for it, I did not want to be deluded about the importance and the effectiveness of my life’s work.

Measurement is the way out of this. When growth is an externality, controlled experiments are the only way to distinguish a good release from a bad one. But to measure is to risk overturning the apple cart: it introduces the possibility of work being acknowledged as a regrettable waste of time. (Some personalities you may encounter will not want to test purely for this reason. But not, in my experience, the kind of personalities that wind up being engineers.)

Through my own experimentation, I have uncovered a secret that makes this confrontation palatable. Here it is: nearly everything fails. As I have measured the features I’ve built, it’s been humbling to realize how rare it is for them to succeed on the first attempt. I strongly suspect that this experience is universal, but it is not universally recognized or acknowledged.

If someone claims success without measurement from an experiment, odds are pretty good that they are mistaken. Experimentation is the only way to seperate reality from the noise, and to learn. And the only way to make progress is to incorporate the presumption of failure into the process.

Don’t spend six months building something if you can divide it into smaller, measurable pieces. The six month version will probably fail. Because everything fails. When it does, you will have six months of changes to untangle if you want to determine which parts work and which parts don’t. Small steps that are validated not to fail and that build on one another are the best way, short of luck, to actually accomplish our highest ambitions.

To paraphrase Marx: the demand to give up illusions is the demand to give up the conditions that require illusions. I don’t ask people to test because I want them to see how badly they are failing. I ask them to test so that they can stop failing.

Yes! The Deploy Dashboard Graphs "Screwed Users."
January 11th, 2013

In my post about real-time analysis I shared a screenshot of part of Etsy’s deployment dashboard. This is the dashboard that every engineer watches as he or she pushes code to production. A bunch of alert readers noticed some odd things about it:

Strange doings on Etsy's deployment dashboard

The screenshot is not doctored, so yes we do graph “Three-Armed Sweaters” and “Screwed Users.” I can explain. In fact, I can give you excruciating detail about it, if you’re interested! Here goes.

“Three-Armed Sweaters” refers to our error pages, which feature one of my favorite drawings in the world. It was done by Anda Corrie:

Although purely theoretical at first, real versions of the sweater have since been commissioned. These are handed out yearly to the Etsy engineer that brings the site down in the most spectacular fashion.

So the graph on the dashboard is just counting the number of times this page is shown. But in order to reduce the frequency of false alarms, the graph is actually based on requests to an image beacon hidden on the page. This excludes most crawlers and vulnerability scanners. Those constituencies have a habit of generating thousands of errors when nothing is malfunctioning. But lucky for us, they almost never waste bandwidth on images.

Now, there are many reasons why Etsy might not be working, and they don’t all result in our machines serving a sweater page. If our CDN provider can’t reach our production network, it will show an error page of its own instead. In these cases, our infrastructure may not even be seeing the requests. But we can still graph these errors by situating their image beacon on a wholly separate set of web machines.

The “screwed users” graph is the union of all of these conditions. So-called, presumably, because all of this nuance is relatively meaningless to outsiders. “Screwed users” also attempts to only count unique visitors over a trailing interval. This has the nice property of causing the screwed users and sweaters graphs to diverge in the event that a single person is generating a lot of errors. The internet, after all, is full of weird people who occasionally do weird things with scripts and browsers.

This is what it looks like when many real users are seeing error pages. Both graphs spike in concert. In this case, the dark green vertical line shows a search deploy resolving the issue.

You now know exactly as much as I do about the graphing of web errors in real time. I assume that this is a tiny fraction of the world’s total knowledge pertaining to the graphing of web errors in real time. So you would be ill-advised to claim expert status on the basis of grasping everything I have explained here.

By the way, most of the software Etsy uses to produce these graphs is freely available. Here’s StatsD and Logster.

The Case Against Bandit Testing January 24th, 2013