Manual Delivery
March 10th, 2014

The person on build rotation, or the nightly schlimazel I suppose, went into a hot 5’x8’ closet containing an ancient computer. This happened after everyone else had left, so around 8:30PM. Although in crunch time that was more like 11:30PM. And we were in crunch time at one point for a stretch of a year and a half. “That release left a mark,” my friend Matt used to say. In a halfhearted attempt at fairness to those who will take this post as a grave insult, I’ll concede that my remembrance of these details is the work of The Mark.

Anyway, the build happened after quitting time. This guaranteed that if anything went wrong, you were on your own. Failure in giving birth to the test build implied that the 20 people in Gurgaon comprising the QA department would show up for work in a matter of hours having nothing to do.

You used a tool called “VBBuild.” This was a GUI tool, rumored to be written by Russians:

VBBuild

VBBuild did mysterious COM stuff to create the DLLs that nobody at the time understood properly. It presented you with dozens of popups even when it was working perfectly, and you had to be present to dismiss each of them. The production of executable binary code was all smoke and lasers. And, apparently, popups.

Developers wrote code using the more familiar VB6 IDE. The IDE could run interpreted code as an interactive debugger, but it could not produce finished libraries in a particularly repeatable or practical way. So the release compilation was different in many respects from what programmers were doing at their desks. Were there problems that existed in one of these environments but not the other? Yes, sometimes. I recall that we had a single function that weighed in at around 70,000 lines. The IDE would give up and execute this function even if it contained clear syntax errors. That was the kind of discovery which, while exciting, was wasted in solitude somewhere past midnight as you attempted to lex and parse the code for keeps.

VB6
Isaiah 2:4: "And he shall displace VB6 in search engine results with a book written by vegans."

Developers weren’t really in the habit of doing complete pulls from source control. And who could blame them, since doing this whitescreened your machine for half an hour. They were also never in any particular hurry to commit, at least until it was time to do the test build. As there was no continuous integration at the time, this was the first time that all of the code was compiled in several days.

Often [ed: always] there were compilation errors to be resolved. We were using Visual Sourcesafe, so people could be holding an exclusive lock on files containing the errors. Typically, this problem was addressed by walking around the office an hour before build time and reminding everyone to check their files in. In the event that someone forgot [ed: every time], there was an administrative process for unlocking locked files. Not everyone had the necessary rights to do this, but happily, I did.

By design, the build tried to assume an exclusive lock on all of the code. As a result, nobody could work while the build was in progress. Sometimes, the person performing the build would check all of the files out and not check them back in. So your first act the morning after a build might be to walk over to the build closet and release the source files from their chains.

Visual Sourcesafe
The Visual Sourcesafe documentation strongly advised against its use on a team of more than four programmers, and apparently this was not a joke.

Deployment required dozens of manual steps that I will never be able to remember. When the build was done, you copied DLLs over to the test machines and registered them there. By “copied” I mean that you selected them in an explorer window, pressed “Ctrl-C,” and then pressed “Ctrl-V” to paste them into another. There was no batch script worked out to do this more efficiently. Ok, this is a slight lie. There had been a script, but was put out to pasture on account of a history of hideous malfunction. And popups. On remote machines sometimes, where they could only be dismissed by wind and ghosts.

Registration involved connecting to each machine with Remote Desktop and right clicking all the DLLs. You could skip a machine or just one library, and things would be very screwy indeed.

The production release, which happened roughly twice a year under ideal conditions, was identical to this but with the added complexity of about eight more servers receiving the build. And we might take the opportunity to add completely new machines, which would not necessarily have the same patch levels for, oh, like 700,000 windows components that were relied upon.

Given eight or ten machines, the probability of a mistake on at least one of the servers approached unity. So the days and weeks following a production release were generally spent sussing out all of the minute differences and misconfigurations on the production machines. There would be catastrophic bugs that affected a tiny sliver of requests, under highly specific server conditions, and only if executed on one server out of eight. I was an expert at debugging in disassembly at the time. Upon leaving the job, I thought that this was pretty badass. But in the seven years since–do you know what? It’s never come up.

Nonstandard & poorly reproducible builds is more like it am I right
"The code could be structured by cows and we would build it by hand."

At one point I wrote a new script to perform the deployment. It was an abomination of XML to be sure, but it got the job done without all of the popups. I started doing the test build with this with some success and suggested that we use it for the production release. This was out of the question, I was told by one of my closer allies in the place. The production release was “too important to use a script.”

The operating systems and supporting libraries on the machines were also set up by hand, by a separate team, working from printed notes. The results were similar. This is kind of another story.

This all happened in 2003.


Scalding at Etsy
March 2nd, 2014

Here’s a presentation I gave about how Etsy wound up using Scalding for analysis. Given at the San Francisco Cascading Meetup.


The Case for Secrecy in Web Experiments
January 16th, 2014

For four months ending in early 2011, I worked on team of six to redesign Etsy’s homepage. I don’t want to overstate the weight of this in the grand scheme of things, but hopes flew high. The new version was to look something like this:

There were a number of methodological problems with this, one of our very first web experiments. Our statistics muscles were out of practice, and we had a very difficult time fighting the forces of darkness who wanted to enact radical redesigns after five minutes of real-time data. We had no toolchain for running experiments to speak of. The nascent analytics pipeline jobs failed every single night.

But perhaps worst of all, we publicized the experiment. Well, “publicized” does not accurately convey the magnitude of what we did. We allowed visitors to join the treatment group using a magic URL. We proactively told our most engaged users about this. We tweeted the magic URL from the @Etsy account, which at that point had well over a million followers.

The magic URL was chosen to celebrate the CEO's 31st birthday.
The magic URL was chosen to celebrate the CEO's 31st birthday. None of this was Juliet's fault.

This project was a disaster for many reasons. Nearly all of the core hypotheses turned out to be completely wrong. The work was thrown out as a total loss. Everyone involved learned valuable life lessons. I am here today to elaborate on one of these: telling users about the experiment as it was running was a big mistake.

The Diamond-Forging Pressure to Disclose Experiments

If you operate a website with an active community, and you do A/B testing, you might feel some pressure to disclose your work. And this seems like a proper thing to do, if your users are invested in your site in any serious way. They may notice anyway, and the most common reaction to change on a beloved site tends to be varying degrees of panic.

If you can't beat 'em, join 'em
"If you can't beat 'em, join 'em."

As an honest administrator, your wish is to reassure your community that you have their best interest at heart. Transparency is the best policy!

Except in this case. I think there’s a strong argument to be made against announcing the details of active experiments. It turns out to be easier for motivated users to overturn your experiment than you may believe. And disclosing experiments is work, and work that comes before real data should be minimized.

Online Protests: Not Necessarily A Waste of Time

A fundamental reason that you should not publicize your A/B tests is that this can introduce bias that can affect your measurements. This can even overturn your results. There are many different ways for this to play out.

Most directly, motivated users can just perform positive actions on the site if they believe that they are in their preferred experiment bucket. Even if the control and treatment groups are very large, the number of people completing a goal metric (such as purchasing) may be just a fraction of that. And the anticipated difference between any two treatments might be slight. It’s not hard to imagine how a small group of people could determine an outcome if they knew exactly what to do.

Group Visits Conversions (organic) Conversions (gamed) Proportion
Control 10000 50 10 0.0060
New 10000 55 0 0.0055
Control New
10000 visits 10000 visits
50 organic conversions 50 organic conversions
10 gamed conversions 0 gamed conversions
0.60% converted 0.55% converted
Figure 1: In some cases a small group of motivated users can change an outcome, even if the sample sizes are large.

As the scope and details of an experiment become more fully understood, this gets easier to accomplish. But intentional, organized action is not the only possible source of bias.

Even if users have no preference as to which version of a feature wins, some will still be curious. If you announce an experiment, visitors will engage with the feature immediately who otherwise would have stayed away. This well-intentioned interest could ironically make a winning feature appear to be a loss. Here’s an illustration of what that looks like.

Group Visits (oblivious) Visits (rubbernecking) Visits (total) Conversions Proportion
Control 500 50 550 30 0.055
New 500 250 750 35 0.047
Control New
500 oblivious visits 500 oblivious visits
50 rubbernecking visits 250 rubbernecking visits
550 total visits 750 total visits
30 conversions 35 conversions
5.5% converted 4.7% converted
Figure 2: An example in which 100 engaged users are told about a new experiment. They are all curious and seek out the feature. Those seeing the new treatment visit the new feature more often just to look at it, skewing measurement.

These examples both involve the distortion of numbers on one side of an experiment, but many other scenarios are possible. Users may change their behavior in either group for no reason other than that they believe they are being measured.

Good experimental practice requires that you isolate the intended change as the sole variable being tested. To accomplish this, you randomly assign visitors the new treatment or the old, controlling for all other factors. Informing visitors that they’re part of an experiment places this central assumption in considerable jeopardy.

Predicting Bias is Hard

“But,” you might say, “most users aren’t paying attention to our communiqués.” You may think that you can announce experiments, and only a small group of the most engaged people will notice. This is very likely true. But as I have already shown, the behavior of a small group cannot be dismissed out of hand.

Obviously, this varies. There are experiments in which a vocal minority cannot possibly bias results. But determining if this is true for any given experiment in advance is a difficult task. There is roughly one way for an experiment to be conducted correctly, and there are an infinite number of ways for it to be screwed.

A/B tests are already complicated: bucketing, data collection, experimental design, experimental power, and analysis are all vulnerable to mistakes. From this point of view, “is it safe to talk about this?” is just another brittle moving part.

Communication Plans are Real Work

Something I have come to appreciate over the years is the role of product marketing. I have been involved in many releases for which the act of explaining and gaining acceptance for a new feature constituted the majority of the effort. Launches involve a lot more than pressing a deploy button. This is a big deal.

Product marketing: this is serious business.

It also seems to be true that people who are skilled at this kind of work are hard to come by. You will be lucky to have a few of them, and this imposes limits on the number of major changes that you can make in any given year.

It makes excellent sense to avoid wasting this resource on quite-possibly-fleeting experiments. It will delay their deployment, steal cycles from launches for finished features, and it will do these things in the service of work that may never see the light of day!

Users will tend to view any experiment as presaging an imminent release, regardless of your intentions. Therefore, you will need to put together a relatively complete narrative explaining why the changes are positive at the outset. A “minimum viable announcement” probably won’t do. And you will need to execute this without the benefit of quantitative results to bolster your case.

Your Daily Reminder that Experiments Fail

Doing data-driven product work really does imply that you will not release changes that don’t meet some quantitative standard. In such an event you might tweak things and start over, or you might give up altogether. Announcing your running experiments is problematic given this reality.

Obviously, product costs will be compounded by communication costs. Every time you retool an experiment, you will have to bear the additional weight of updating your community. Adding marginal effort makes it more difficult for humans to behave rationally and objectively. We have a name for this well-known pathology: the sunk cost fallacy. We’ve put so much into this feature, we can’t just give up on it now.

The fear of admitting mistakes in public can be motivating.

Announcing experiments also has a way of raising the stakes. The prospect of backtracking with your users (and being perceived as admitting a mistake) only makes killing a bad feature less palatable. The last thing you need is additional temptation to delude yourself. You have plenty of this already. The danger of living in public is that it will turn a bad release that should be discarded into an inevitability.

Consistency and Expectations

Let’s say you’ve figured out workarounds for every issue I’ve raised so far. You are still going to want to run experiments that are not publicly declared.

Some experiments are inherently controversial or exploratory. It may be perfectly legitimate to try changes that you would never release to learn more about your site. Removing a dearly beloved feature temporarily for half of new registrations is a good example of this. By doing so, you can measure the effect of that feature on lifetime value, and make better decisions with your marketing budget.

Other experiments work only when they’re difficult to detect. Search ranking is a high-stakes arms race, and complete transparency can just make it easier for malicious users gain unfair advantages. It’s likely you’re going to want to run experiments on search ranking without disclosing them.

It would be malpractice to give users the expectation that they will always know the state of running experiments. They will not have the complete picture. Leading them to believe otherwise can do more harm to your relationship than just having a consistent policy of remaining silent until features are ready for release.

What can you share?

Sharing too much too soon can doom your A/B tests. But this doesn’t mean that you are doomed to be locked in a steel cage match with your user base over them.

Forum moderators of the world: good luck.
Forum moderators of the world: good luck.

You can do rigorous, well-controlled experiments and also announce features in advance of their release. You can give people time to acclimate to them. You can let users preview new functionality, and enable them at a slower pace. These practices all relate to how a feature is released, and they are not necessarily in conflict with how you decide which features should be released. It is important to decouple these concerns.

You can and should share information about completed experiments. “What happened in the A/B test” should be a regular feature of your release notes. If you really have determined that your new functionality performs better than what it replaces, your users should have this data.

Plain-language A/B test results can ease user anxiety in launches.

Counterintuitively, perhaps, trust is also improved by sharing the details of failed experiments. If you only tell users about your victories, they have no reason to believe that you are behaving objectively. Who’s to say that you aren’t just making up your numbers? Showing your scars (as I tried to do with my homepage story above) can serve as a powerful declaration against interest.

Successful Testing is Good Stewardship

Your job in product development, very broadly, is to make progress while striking a balance between short and long term concerns.

  • Users should be as happy as possible in the short term.
  • Your site should continue to exist in the long term.

The best interest of your users is ultimately served by making the correct changes to your product. Talking about experiments can break them, leading to both quantitative errors and mistakes of judgment.

I firmly believe that A/B tests in any organization should be as free, easy, and cheap as humanly possible. After all, running A/B tests is perhaps the only way to know that you’re making the right changes. Disclosing experiments as they are running is a policy that can alleviate some discontent in the short term. But the price of this is making experiments harder to run in the long term, and ultimately making it less likely that measurement will be done at all.

Thanks to Nell Thomas, Steve Mardenfeld, and Dr. Parker for their help on this.