The question of how long an A/B test needs to run comes up all the time. And the answer is that it really depends. It depends on how much traffic you have, on how you divide it up, on the base rates of the metrics you’re trying to change, and on how much you manage to change them. It also depends on what you deem to be acceptable rates for Type I and Type II errors.
In the face of this complexity, community concerns (“we don’t want too many people to see this until we’re sure about it”) and scheduling concerns (“we’d like to release this week”) can dominate. But this can be setting yourself up for failure, by embarking on experiments that have little chance of detecting positive or negative changes. Sometimes adjustments can be made to avoid this. And sometimes adjustments aren’t possible.
To help with this, I built a tool that will let you play around with all of the inputs. You can find it here:
Here’s an example of what you might see using this tool:
The source code is available on github here. The sample size estimate in use is the one described by Casagrande, Pike and Smith.
The following people were all great resources to me in building this: Steve Mardenfeld, James Lee, Kim Bost, William Chen, Roberto Medri, and Frank Harris. Peter Seibel wrote an internal tool a while back that got me thinking about this.