Dan McKinley

On Misery

2025-01-10T00:00:00+00:00

Mark Zuckerberg is in the middle of a coordinated-if-haphazard heel turn, removing tampons from mens’ rooms and welcoming slurs back to his platforms. This occurs while the neighborhood next to mine is still on fire, and his behavior stands in stark contrast to the imperative to love one another or die that is all around me. I connected with this take:

[I]t is probably worth thinking about what is happening in Silicon Valley as a revolt of the bosses against their workers. none of this is rational. it is simply causing misery for the purpose of causing it.

@theophite.bsky.social

First of all, fuck anyone who is for what Zuck is up to, and may god give you blood to drink. I find the hateful nihilism in all of this quite depressing. But I guess I am not shocked to see it revealed that many industry leaders have invested in progressive causes transactionally, rather than out of solidarity, basic decency, and an ideological commitment to a brighter future in which the humanity of others is an end rather than a means to an end.

That it is right to affirm the humanity of others should be the only reason one needs to do it. But we have just established that many powerful people are transactional, and not moved by morality. In the interest of working towards better outcomes, let me take on some of the labor of expanding upon why the boss revolt is not rational. Winning, fun, and positivity are correlated, and spiteful misery as a business strategy is very stupid.

Stand by while I turn my temperature down 30 degrees

As the tech labor market has cooled off, skepticism about perks and positive vibes at work is a broad trend. Zuck’s latest rollout is definitely on the more bigoted and mean-spirited end of it, but this has been going on for some time now. Let’s play “spot the logical error:”

Some things that feel good and are positive are a distraction from work.
This thing feels good and is positive.
Therefore it is a distraction from work.

Look, I have responded to production incidents whilst a founder played Guitar Hero (not even the good songs) on a projector directed two feet above my head. At times I have had to share an office with sales bells, round-the-clock ping pong tournaments, mechanical keyboards, kegerators, and a nontrivial fortune in DJ equipment. I have watched anti-footwear coworkers put bare feet on communal tables. One time a guy passed out in a boat that we had in the office for obscure reasons. While many of these things have a time and a place, let’s say that it has not left me as a workplace hedonism maximalist.

I am here to tell you that these sorts of workplace culture programs, which are all good and fun, are not that:

Hack weeks, i.e. “go make a cool thing with coworkers.”
Bootcamps and rotations, e.g. “go experience what a totally different team is doing.”
Official slack time, e.g. 20% time [1].

These were all practices that got started in the middle 2000s, when startup funding was hard to come by and we needed to stretch our headcount as far as we possibly could. All of these have what might have been called “ulterior motives,” however that would be ahistorical since the motives were made explicit at the time.

Hack weeks and bootcamps create new edges in your relationship graph, and spread knowledge. Slack time provides the fertile soil for those edges and that knowledge to bear fruit: serendipitous product, organizational efficiency, or what have you. All of this makes people want to keep working hard. Not for you, really, but for each other.

To the extent we can still stomach identifying as “hackers” in its original sense, we should seek clever, synergistic, joyful, high-leverage, and possibly subversive ways to get the outcomes we want. That’s what all of this was about.

Again, I view DE&I [2] as a moral imperative. But it can also be understood within the framework of this school of thought. We can find genius where society has overlooked it. And we should take seriously the project of creating the incentive and permission structures that allow these talented people we’ve convinced to work here to contribute to their greatest potential.

I wish I didn’t feel so insane and frustrated while pointing this out.

Winning is fun-correlated

“Winning is fun” is a mantra I have deployed in the past, and I have meant it in the spirit of “the main thing that’s gonna make people happy is if the product is making users happy.” As is the way with these things, magical thinking can certainly take hold and we can get cause and effect mixed up.

I wrote a whole talk about exactly this kind of confusion once.

Fun is not a precondition for winning. But if you hope to win, you should expect to be having fun. This is all to say that if you are systematically eradicating fun things, using “anything that feels positive must be wasting time” as a heuristic, you have thoroughly disappeared up your own ass. If you are committing willful acts of harm as Zuckerberg is, may an even darker abyss than that await you.

Meta’s size makes it de facto unkillable, and I’m sure it’ll exist in some form for centuries. But it’s my hope that it will exist in the sense that IBM exists today. Theoretically you know it’s out there, but it’s very hard to grasp the point of it and it feels thoroughly irrelevant. Nobody remembers who started it or why.

A healthy, happy, positive, spiritually fulfilled workforce is an end unto itself but yes, also a means to an end. Eventually, either our industry or an adjacent one will figure this out again [3] and the ironic points of light that constitute the historical tech workforce will go congregate there instead.

Like this? I have a recent related talk called Egoless Engineering.

Google’s 20% time in popular imagination is conflated with the idea that engineers will ship products on their own. That is typically a bad idea, because successful product launches take a village. I have no idea if that is really what that was about at Google, but unscheduled slack time in my experience has explicitly made that kind of outcome a non-goal. ↥
Noting that I am saying “DE&I” here and not “DE&I programs” is the moral imperative. I’d like to defer to the experience of under-represented people in tech, many of whom have experienced the actually-existing programs as a sham. ↥
Barring positive black swan events that might come from organizing, or negative black swan events such as the return of indentured servitude. ↥

The Guerilla’s Guide to Influencing Leadership

2024-12-02T00:00:00+00:00

In my talk Egoless Engineering I make the case that results are better when teams cooperate, that punching down and other forms of brilliant-jerkhood are actually dumb, and that leaders should reward curiosity and generosity. I think misery is a dumb strategy and I am encouraged that some folks have found this case compelling.

Some of those who find the content compelling wish they could persuade their leaders on these points, and have asked me for advice on how to accomplish this. I address this a little in the talk:

I don’t think you can grass roots [it], beyond what I’m trying to do here by making the idea more popular.

It’s on leaders to value cooperation and to reward curiosity.

I stand by this overall assessment. It’s unrealistic to expect workers to build a culture that executes well without leadership support in the best case, and against leadership’s instincts in the worst. It’s taking risks, especially in proximity to down-punchers that will attempt to undermine you.

That said, I don’t think there’s nothing you can do to win support. Over the decades I have discovered a few hacks for influencing superiors. I apologize in advance for how dumb some of this sounds. Please know that I mean this pragmatically, and sincerely. None of this advice is sarcastic. It may be darkly funny, but it is not a joke.

Hack #1: Write about it Publicly

It could be that you were hired at a company to solve a shiny problem, and came in with some amount of momentum and a mandate. That’s great, but there are only a few possibilities from here:

The shiny problem at the company changes because you solved it (good job)!
The shiny problem at the company changes because leadership is fickle or was wrong, or just because the world changed.
The shiny problem is really difficult and it was unrealistic to expect anyone was going to ride in on a white horse and fix it in a short amount of time.

All of which is to say that once you work someplace, your cool factor has a brief half-life. Did they hire you to do ML? Sorry, they need “AI” now, and they are going to hire someone else on a really fantastic, different white horse for this [1].

So let’s suppose you find yourself in this situation and you want to influence leadership, but they now find you boring and don’t hear you. One tactic that has served me very well is to make my agenda shiny (again, possibly) by getting people in the broader industry buzzing about it.

Write a talk or a blog post explaining your point of view. If it’s something you’re already doing at work, feel free to imply that it’s how everyone is doing that stuff where you work. Get it on BlueSky, Mastodon, Hacker News, or wherever the conversations are happening about it. If your cause is truly just people will talk about it and give you positive feedback.

Companies and their leaders love this kind of reflected glory. Their peers will congratulate them on being such great leaders that they motivated or inspired this great work you’ve done. In reaction they will support you. Lean into it and you’ll regain your sheen (for a while).

Hack #2: Have an Outsider Say It

The odds are very good that at least some of you are reading this while facemuted in a McKinsey consultant workshop about how to write email subject lines, or something equally thrilling. If so, another leadership PSYOP tactic is right in front of you: have outsiders say the thing.

Outsiders are definitionally novel to insiders, and often already have a whole deck full of classic Simpsons references punched up and ready to present. The move here is to volunteer to start an external speaker series, and stack the agenda in your favor. Talks are great for morale anyway, so nobody needs to know what your specific ulterior motive is.

If you’re trying to persuade leadership that it’s counterproductive to frame risk management as “who’s going to get fired if X happens” (as I have occasionally tried to do) it’s more effective to launder this feedback through an ostensibly neutral outsider who is at no risk of getting fired.

A Practical Example

Much like cards in a really sick Slay the Spire deck, these tactics can synergize with one another and bestow scaling properties upon you [2]. A concrete example of this was my talk Data Driven Products Now!, wherein I tried to make a case that it was a good idea to do some napkin math before spending a year coding something. (Or, “opportunity sizing for engineers,” as my friend Roberto called it.)

Anyway, I wrote this talk after the fourth or fifth year in a row that Etsy’s big product push consumed 75% of the team’s effort for something that didn’t really stand up to such scrutiny. The other 25% of us were doing math and it was going pretty well, so I decided to represent the minority approach as “The Etsy Way” in a public talk.

This got positive attention from outsiders, which successfully made the idea of doing napkin math much more popular internally (hack #1).
A while later I left the company but kept doing this talk in public, where it could still ricochet back into Etsy and make things better (hack #2 - still good for me since I owned stock at this time).
I also delivered the talk as an outside speaker at many companies, including Mailchimp and Mozilla where I eventually wound up working (hack #2).
At Mailchimp, after my new aqui-hire sheen had worn off, we brought in an outside consultant to deliver the same material again (hack #2). This person had coincidentally been in the audience to see me give this talk at least once.

By my count, that’s at least six distinct units of influence at companies I either worked at or owned stock in for one act of writing. What a coup!

My Boring Technology work was a similar dynamic on an even grander scale.

Go Forth

I hope this has been helpful. I ask that you only deploy these tactics, which are powerful, for the causes of justice. If I can help you do this as an outsider (hack #2), don’t hesitate to drop me a line.

I know this stings now but stay positive! As long as you don’t overreact to this, trust me, you will be best friends with that person for life once their white horse drops dead. ↥
While writing both Boring Technology and Data Driven Products Now!, I was talking extensively with my friend Steve about the content. He had a big influence on both pieces. Some of the punchiest (and therefore best) sections of these were spiritually born of me ranting back and forth with Steve.

So there’s a third dimension in which my work lives on that I enjoy perhaps most of all: people mansplaining my work to Steve. Steve will suggest something wherever he works now, and a colleague of his will object. “It’s called Boring Technology Steve, ever heard of it?” This is truly the gift that keeps on giving. ↥

I Tried to use AI to Read an AI Book

2024-06-09T00:00:00+00:00

I recently read Co-Intelligence by Ethan Mollick. It was good! You should read it. I want to say this up front, since after some preamble I’m going to describe a Rube-Goldbergian attempt to poke petty holes in it. I don’t want the reader to lose sight of the big picture, which is that I was trying to do this in the spirit of the book. Which again, is pretty good.

The Zeitgeist is a-Polterin’

How are we going to know what’s true? How are we going to find information, now? It’s been on my mind lately, as it’s been on everyone else’s mind. The web has been thrown into chaos. As of right now if you ask Google if there’s a country in Africa that starts with “k,” you get a confident “no” that cites one post or another lampooning the entire debacle.

Web publications are presumably competing for the dregs of display ad revenue by seeing who can roast Google for this the hardest. Which (objectively) rules.

Google isn’t shooting itself in the face right before our eyes because they all think these results are good. (I bet the internal conversations are hilarious.) They’re shooting themselves in the face because they’re in a desperate steel cage match with the Innovator’s Dilemma. Our relationship with information retrieval seems like it’s changing, and this will affect Google. But as a participant in this shitshow they are constrained to seek the set of different equilibria that still more or less resemble web search. When they fuck up, they are fairly scrutinized in ways that their competitors are not. They have to transmogrify the golden goose into some sort of hovercraft, which is a significantly harder task than simply killing it.

LLM’s are now training on their own hallucinated content in a doom loop, and media companies are too busy dying [1] to be plausible as a solution to this. I don’t know what to say about Twitter except “good luck.” You’d be forgiven for hoping for the Nothing but Flowers scenario, in which we all collectively and abruptly decide to go back to the land.

But despite all of this I am not an LLM detractor. Whereas the entire web3 era came and went without ever coalescing into a legible concept of any kind, LLM’s are very much a non-fake technology. We haven’t figured out the right way to hold them, yet, but that’s no reason to just give up and walk into the sea.

Idk, Let’s All Read Books Instead?

When I’m really chewing on something I read books about it, and I recommend the practice. I do not recommend it as a solution to everything, as books are not necessarily written by humans and even when they are they are not necessarily using their whole ass to do it. But as a way to let ideas really stretch out in your mind and stink up the place, I don’t have a better way. So again, I read Co-Intelligence. And again, it was pretty good.

Inspiration struck right on schedule when my friend Peter Seibel also read Co-Intelligence. Peter noticed a claim about 3/4 of the way through:

[R]epeated studies found that differences between the programmers in the top 75th percentile and those in the bottom 25th percentile can be as much as 27 times along some dimensions of programming quality... [b]ut AI may change all that. [2]

Peter is a programming book author, and tech industry veteran turned high school CS teacher. You could say he’s dedicated himself to spreading the craft, and so this claim is something of a pet issue of his.

Indeed the claim is a well-established industry trope at this point. It is widely considered to be thinly sourced at best, and entirely vibes at worst. It usually relies on scant evidence when it’s sourced at all. But in this case, it was sourced! In a paper we hadn’t heard of before! The specific citation was:

The gap between the programmers: L. Prechelt, “An Empirical Comparison of Seven Programming Languages,” IEEE Computer 33, no. 10 (2000): 23–29, https://doi.org/10.1109/2.876288.

So I decided to dig in and see if the paper supported this [3].

No, The Cited Paper does not Support the Existence of 27x Programmers

Prechelt’s paper is a comparison of programming languages, not programmers. Its conclusion is closer to “C++ sucks” [4] than anything to do with programmer ability. There are two overlapping problems with using it here:

The paper acknowledges weaknesses in its samples, and other reasons we may be looking at biased results. (This is lovely to see.)
The paper is not trying to make any points about programmer capabilities.

Hence the conclusions of the paper don’t really support the premise in that part of Mollick’s text.

Again, I Have to Stress that Co-Intelligence is a Good Book

Co-Intelligence has sources that we are invited to check, and of course many books do not do this. Mollick is a serious person who is trying to do a good job, in good faith, and many are not. There’s nothing unusual about noticing a problem like this in a book. This is just what it’s like to read something that touches on topics that you happen to know a great deal about.

Many of 2024’s gravest epistemological dangers arise when we read things that we don’t know much about [5]. In those situations, we’re liable to reinforce our own biases or blithely accept the authority of the text. How can we do better?

The answer is probably something like “critical thinking,” or “close reading.” We should be putting more thought into the sources of what we’re consuming. We should be questioning whether those sources support the conclusions drawn, and what problems they may have themselves. By doing so we can form a more nuanced interpretation of what we’re consuming.

Of course, the downside is that this all takes a metric shit-ton of time.

What to do? Mechanize!

It occurred to me at this point that perhaps I could use AI to augment my critical thinking skills. It occurred to me because this is the sort of thing the book was constantly encouraging me to do:

Research has successfully demonstrated that it is possible to correctly determine the most promising directions in science by analyzing past papers with AI [6]

What if AI could be an asset in skepticism about itself? Could AI be both the cause of and solution to all of our filter bubble problems? I am not sure. Let’s find out!

My first few naive attempts were to simply feed the LLM [7] some content by hand. I’d give it a PDF, a passage from the book, and the specific claim that was being supported by the citation. I’d ask it what it thought, either in general or as a two parter (“How would you rate this paper? Do you think it provides good support for this claim in this text?”).

The results of this were disappointing–the LLM universally responded with paragraphs amounting to “yeah, lookin’ good hoss!” Being at least vaguely on top of the conversations around using LLM’s in anger, I figured that the problem here was that I was asking it to do far too much at once.

Revising the Approach

After thinking about it a bit more, I realized that the goal should probably be about prioritization. This is also in line with Mollick’s advice:

The closer we move to a world of Cyborgs and Centaurs in which the AI augments our work, the more we need to maintain and nurture human expertise. We need expert humans in the loop. [8]

I am definitely not going to check the 90+ academic papers cited by this book, let alone the web pages and other books cited. And on the basis of its output so far, I am also not going to just trust the LLM to do that for me without help. Instead, the goal would be to use AI to get the drudgery of sifting through references out of my way. I decided that I’d try this instead:

I’ll ask the LLM to give each of the cited papers an overall trustworthiness score.
I’ll ask the LLM to rate how well each citation supports the claim in the text.
From those two scores, I’ll make a weighted list of things to dig into by hand, leveraging my own abilities better.

I spent a day day collecting the papers, and managed to find nearly all of them without having to pay a wall.

Scoring Papers

My first attempt at scoring the papers was direct: I just fed the LLM the paper and asked it to rate its trustworthiness on a scale of one to ten. The LLM scored nearly every paper a 9 or a 10 out of 10. That’s perhaps unsurprising, since there are several sources of selection combining to bias the book towards citing papers that aren’t just total nonsense [9]. But unfortunately that’s useless as a means to differentiate. Asking the same question while providing a sample of other papers as a basis for comparison produced the same results [10].

I switched to asking the LLM to stack rank a set of papers. I’d give it a paper with nine others, and ask it to give me the ranking of trustworthiness. At first this seemed to produce better results, meaning the LLM would relent and say, “ok, this paper is a four out of ten in this set.” But repeating this a few times showed that the rankings were unstable - the same paper would get a range of rankings between 1 and 10 that seemed quite broad.

It occurred to me that the instability might average out in a useful way with repeated trials. If we ask the LLM to repeatedly stack rank a paper, it might occasionally rate it as an 8 but ultimately average it as a 3. Like so:

Simulating repeatedly scoring papers if it's the case that the LLM can differentiate them unreliably. In the ideal case the average scores (the center lines) will be different from each other and the standard errors (the shaded regions) should get tighter with repeated trials.

But after doing this 20 times for a set of referenced papers, in practice that didn’t work:

Asking the LLM to rank papers repeatedly results in every paper converging on a 5/10 with relatively wide distributions, i.e. the LLM's answers seem to be random.

I redid the process one more time by asking the LLM to focus on just critiques of the papers. That produced much more pessimism, but not in a way that would give me a principled smaller set of papers to scrutinize by hand:

Asking the LLM to focus on ranking critiques of papers results in very pessimistic scores, but doesn't differentiate them.

So progress seemed to stall out here. Having failed to find a way to give a paper a trustworthiness score that made any sense, it didn’t seem worthwhile to work out a way to rate the faithfulness of the citations to paper conclusions.

What Did We Learn?

The premise here was that I could write some AI automation to help me walk away from an overall good work with a more nuanced view than I otherwise would have. Given the correlated nature of the attempt and the subject matter, that definitely worked! I have a more nuanced view of what someone could reasonably ask a current LLM to do now! But it is not clear that I achieved anything durable yet, beyond building a very inefficient shuffle algorithm that cost me $100 and ate a few days.

I approached this with some classical ML system expertise, and therefore applied at least a little statistical thinking to what I did. I think in a lot of applications, folks just wouldn’t do this. The path of least resistance would be to ask the LLM for opinions, observe that it gives them, and plow forward.

This could be good or bad! In purely creative scenarios it’s probably a win. But it’d be a convoluted way to reinforce confirmation bias in others, i.e., the exact opposite of what I was trying to accomplish.

It seems like you could certainly use a current LLM to distinguish sources that are intrinsically terrible. But they aren’t particularly good at drawing out this kind of nuance right now, at least not in any straightforward way.

Hi, thank you for reading. If you liked this you might like some of my talks such as the notable banger Choose Boring Technology or maybe some of my other writing. To old friends, I apologize for not writing in a while. I assure you I was embroiled in some really baroque psychodrama that seemed important at the time.

Thanks to Camille Fournier, Peter Seibel, Lonnen, Moishe Lettvin, et al for help with this!

Here's where I thought I might link you to a great and relevant Search Engine episode about the media apocalypse, except as far as I can tell we've all decided to break the ability to link to podcast episodes. Case in point. Regardless, you should consider subscribing to Search Engine. ↥
Ethan Mollick, Co-Intelligence (New York: Penguin Random House, 2024), 156. ↥
Peter would like it to be noted that he reviewed the Prechelt paper himself, quickly concluded that it didn't support the book's claim, and moved on. The reader is encouraged to come to their own conclusions about our differing priority preferences and life choices. ↥
It's unclear if this needs to be empirically proven, but it is correct. The paper does this:

Prechelt studies a big set of programs written in different languages. He contrives a "bad to good ratio," which is the median of the slowest half of the programs divided by the median of the fastest half. The difference of "27 times" is the spread of outcomes within a language, which the book then conflates with programmer capability.

The paper talks a bit about how the programs in different languages are sourced from different places. The C++ programs are from CS master students, the Tcl programs are from open calls, etc. The paper discusses how there will be bias in the outcomes as the result of this. ↥
An example that's close to home for me is that some of my college friends (who also completed an ivy league engineering program) are now moon landing deniers. This seems to be the result of choosing the Joe Rogan podcast as a source of information and identity.

I don't actually know since I won't listen to it, but as far as I can discern this podcast is a decades-long freefall into the bottomless abyss of nonsense that yawns beyond the boundaries of one's own expertise. He had an actor on who thinks 1×1=2, and apparently took it seriously. That may be an extreme example, but we'd be mistaken to believe we're categorically immune from these kinds of errors just because we aren't megadosing shark cartilage and suffering head trauma regularly, or whatever.

Incidentally, do not contact me to discuss Joe Rogan. ↥
Mollick, 202. The paper he is citing in this case is not using a large language model, but is cool work regardless. They built their own semantic graph of research topics with more mundane extraction techniques, and then tried to predict future edges in it. They found that models with hand-crafted features outperformed unsupervised approaches, including a transformer model. And "[s]urprisingly, using purely network theoretical features without machine learning works competitively." But, this was all in 2022. ↥
Everything in this writeup was done with GPT-4o. ↥
Mollick, 182. ↥
Indeed if you ask it to read a very bad paper, it will rate its trustworthiness very low. ↥
Various attempts at excoriating the LLM to behave differently didn't get me anywhere either. "Your rankings overall should be normally distributed! Your mean ranking should be a five! Don't worry about your rankings getting back to the authors! Nobody is going to judge you for this!" ↥

Google Reader Killed RSS

2019-12-18T00:00:00+00:00

There were rumblings earlier this week that Alphabet executives mused about killing GCP. I think they probably won’t do it [1]. But as a side effect this has provoked yet another round of everyone pouring one out for the most beloved Google ex-feature ever, Google Reader.

I miss the RSS world of the early 2000’s as much as anyone. I miss it almost as much as I miss McCarren Pool having no water in it and new Spoon albums sounding fresh. This is why I feel compelled to point out that those mourning Google Reader are forgetting that it was actually responsible for ruining the whole thing.

Computer, engage shitpost. Attack pattern "digging up graves."

It went like this: Google Reader killed RSS, and then like a decade later Google killed Google Reader. You’re having a funeral for the tame old fox that was mysteriously living in your henhouse.

A Bull Moose Stomping around the Primordial Tidepool

The existence of Google Reader wiped out a generation of attempts at building hosted, social feed readers. I was working on one. We had maybe a thousand users, so I’m not trying to overestimate the cardinality of the set of alternate universes in which ours won. But the survival of any of them as independent actors became untenable once Google Reader came out.

Hosted feed aggregation was a relatively expensive product to attempt at the time. There were no clouds yet, and bandwidth pricing on shared hosts was oppressive to those of us just getting by on bootstrapped budgets. Everyone subscribed to less than a hundred feeds, but it was fat-tailed and everyone chose a different set of less than a hundred feeds. Your servers had to download a lot of stuff, and they had to do it as often as you could afford.

There was a significant amount of toil involved in maintaining the perception of quality, because blogging software was a much more fragmented space then, and feeds of the era were a mess. Remember Cute Overload? I do, mainly because it was a freaking frameset around a blogger site. This kind of kluge was typical [2].

mfw we realized it was a freaking frameset

As long as Google Reader existed, the two available paths out of this were out of reach. Anyone with money who believed in RSS as a consumer technology also believed Google would dominate the space [3]. The aura of infallability that Google possessed in this era before laughingstocks like Glass, Google+, Wave, etc, is hard to relate. Picture showing up for an audition and getting in line behind Denzel Washington.

And of course you couldn’t charge a fee, because Google Reader was free.

Google Reader: Not Impressive

This all would have been water under the bridge if Google had followed through with making Reader what it deserved to be. But they did not. They kept it on starvation rations for more than ten years.

Reader’s social features, for example, were only slightly less catastrophically haphazard than Buzz.

What passed as the social features of Reader ca. 2008

For years and years it wasn’t even obvious how the friend list worked, at all.

If you check the associated help page, it turns out that to remove someone, you have to remove them as a Gmail/Google Talk contact. Wow.
Search Engine Land, 2007

Despite this, people out there lament the loss of the communities they’d built on Reader. It’s frankly incredible that anyone managed this with tools this bad. It validates that there was something there, something that could have been more than what we got.

Hello from a Smoking Crater Inside the Kill Zone

Google Reader reigned for so long that people towards the end of its run weren’t wistful for a return to the old ways. They were wistful for the thing that wrecked the old ways. The old ways were a world not even remembered.

Allowing Reader to exist, but not attempting to make it something that could achieve broader adoption–or even just be great inside its niche–was sufficient to doom the medium. Reader was a worse product than Twitter by the time Twitter came around. I don’t think it needed to be that way.

I will say that if killing beloved products is your bag, then building a cloud platform is the smartest strategy because it allows you to shut down products you don't even own. What legends!
The Daily Kluge, though, ran a tight ship.
This was not wrong at all, but it played out differently than RSS fans expected. Some folks just didn't believe RSS would work at all, which I think is somewhat discredited now with the resurgence of podcasts.

Thanks to tef, Laura Thomson, and Moishe.

Some Recent Work

2017-05-08T00:00:00+00:00

Here are some links to recent work I’ve done elsewhere.

Ship Small Diffs - I tried to transmute the anguish I feel looking at huge changesets into words.
Mistakes You Apparently Just Have to Make Yourself - Getting youngfolk to listen to you is harder than I realized.
Fourteen Months with Clojure - Going back to my Lisp roots here.
The Push Train - Trying to frantically document some of the human element of making engineering function at a high level, which for whatever reason didn’t strike me as vital at the time.
Deploying Often is a Very Good Idea - Conditional probability is extremely good.
You Can’t Have a Rollback Button - Please engrave “but what if you didn’t?” on my tombstone I guess.
A Simple Pattern for Jobs and Crons on AWS - Not only did I stoop to writing a practical post for once, I also wrote a followup.
No Way Out but Through - More ranting and raving about deploying more than once a year.

Fourteen Months with Clojure

2017-03-30T00:00:00+00:00

Coda and I have been using Clojure to build Skyliner for the last fourteen months or so. I thought it might be a good idea to write down some of our experiences with this, for the benefit of others considering it for practical work.

The beating heart of Skyliner, a deploy encoded as a finite state machine.

Learning languages is easy, learning the idioms is less easy

“Lisp has no syntax,” or so they say. It does have some, but significantly less than other languages. Clojure has a slightly larger pile of stuff that you could mistake for syntax, but, it’s still compact and simple. The tricky part isn’t the language so much as it is the slang.

As a seasoned engineer who theoretically “knows” a few dozen languages, I got productive with Clojure pretty fast. Nevertheless I definitely emitted some crappy code in my first few months. Stuff like:

(every? #(= % “success”) (map :status (:state task)))

Which I’d write like this today:

(->> task
     :state
     (map :status)
     (every? #(= % “success”)))

Threading macros and transducers specifically took a few months to become second nature.

This is the kind of thing that would matter to you if you were going to try to onboard a few new engineers a week. I never read a tutorial, because this is a startup, and I did not have time. You’d probably want to rectify that mistake and review their stuff for a while.

When the going gets tough, the tough use maps

If I were going to give you a quick summary of what our codebase is like, I’d say it’s procedural code that manipulates maps. That is literally 90% of it. This is a lot less bad than it probably sounds if you’ve never written Clojure, because the entire language is oriented around manipulating maps and lists.

We keep the wheels on a few ways.

Schemas for our maps are pretty handy, particularly when they’re of the user-supplied data variety. We’re using prismatic/schema for this, although if we were starting today we might use clojure.spec.
Our codebase has better test coverage than nearly anything I’ve ever worked on.
We use Kibit and Eastwood in our build pipeline for the sake of general cleanliness.

Bells and whistles are very rare

I kind of assumed writing Clojure professionally would involve communing with the grand harmony of the spheres, or something, but it really doesn’t. And this isn’t bad. It is actually extremely good.

So then like McCarthy’s student Russell noticed that EVAL could serve as an interpreter and *goes limp & rolls down steep mountainside for 10 minutes or so, banging head on branches and rocks, surely dead.*

In fourteen months I count about six uses of recur. I think I wrote some code using trampoline once or twice and then decided against shipping it.

We’ve written defmacro ourselves less than ten times. Most of those are for logging, so that we can grab the caller’s value of *ns*. Others are setting dynamically scoped variables for the sake of implementing feature flags. They’re all really simple macros.

Types of any kind are rare to a degree that astonish me. We’ve written a handful of protocols, for example our scm protocol is there to provide a uniform interface for both GitHub and private git repos. We have records representing different kinds of CloudFormation stacks that we create and manipulate. That is pretty much it.

Multimethods are less rare

One thing we do use more extensively are multimethods. We use this to dispatch asynchronous jobs in our workers, to fan out to different handlers from webhook endpoints, and to make transitions in our deploy finite state machine.

Using a simple little multimethod to convert java types into primitives that are acceptable to our frontend templates.

In other languages we’d probably want to use some object abstraction or other, but multimethods handle things like this cleanly.

Clojure is not Scala

I had some anxiety when we were getting started with Clojure, and that was grounded in my years of experience with Scala. Scala has scarred both of us for a number of reasons. Scala builds on JVM typing to erect additional complexity, and in my opinion the results are mixed.

A cathedral of covariance and contravariance built on the soft sandy base layer of type erasure.

Clojure doesn’t ask you to type anything if you don’t want to. That has its pluses and minuses, but you can write most of your code without getting into any slapfights with the JVM. So as a higher-level abstraction over Java, it works.

Building a server application with Clojure is a better experience than with many compiled languages, because as with any Lisp, you can just hotpatch everything in the REPL as you build it.

I’ll grant you that maybe Scala has answers to all of these problems now, as I haven’t had the pleasure of using it in several versions. Do not @ me to talk about this.

Nesting sucks

Although Common Lisp has return-from, Clojure has no facility like return or goto. This isn’t something you miss writing idiomatic Clojure, but sometimes you find yourself boxed into writing non-idiomatic Clojure. A good example of such a situation is dealing with a morass of heterogenous functions that can return error codes.

Let’s say that you have a list of steps that need to complete in a specific order, and may fail. Conceptually, in Python:

x = foo()
if not x:
    return False

y = bar(x)
if not y:
    return False

return baz(y)

This can be elegantly handled if the methods in the pipeline all return nil in the failure case, and we don’t care to do much else.

(some-> (foo) bar baz)

But things start to fall apart as the signatures of the functions in the pipeline vary, or if we want to instrument the pieces with logging.

(if-let [x (foo)]
  (if-let [y (bar x)]
    (if-let [z (goo x y)]
      (do
          (qux x y z)
        (log "it worked")
        true)
      (do
        (log "goo failed")
        false))
    (do
      (log "bar failed")
      false))
  (do
    (log "foo failed")
    false))

We have a decent amount of old code that looks like this. It’s all well tested and in that sense it’s relatively safe, but it’s still craptacular and tricky to modify.

Before a throng of enlightened individuals amble up to the mic stand in the aisle to tell us this, I should say that we are wonk as hell and therefore realized we were building a composition of either monads.

Could you not

But a highbrow-yet-idiomatic solution to that in a language otherwise devoid of category theory wasn’t immediately obvious. I messed around the idea of tackling this with specialized macros, but decided this was an unmaintainable tarpit.

In the end we decided to just try using a category theory library, cats. That lets you write something equivalent to the above like so:

(require '[cats.core :as m])
(require '[cats.monad.either :as either])

@(m/mlet [x (if-let [v (foo)]
              (either/right v)
              (either/left))

          y (if-let [v (bar x)]
              (either/right v)
              (either/left))

          z (if-let [v (goo x y)]
              (either/right v)
              (either/left))]

  (m/return (qux x y z)))

Which cuts out the nesting and makes a big difference in sufficiently complicated scenarios.

It is unclear to me if the category theory would still be a win on a less experienced team. I have a long history of being skeptical of things like this, but it has improved our lives recently.

Thanks for reading!

I hope this helps if you’re considering building something real with Clojure.

You Can’t Have a Rollback Button

2017-02-28T00:00:00+00:00

I’ve worked with deploy systems in the past that have a prominent “rollback” button, or a console incantation with the same effect. The presence of one of these is reassuring, in that you can imagine that if something goes wrong you can quickly get back to safety by undoing your last change.

But the rollback button is a lie. You can’t have a rollback button that’s safe when you’re deploying a running system.

The majestic bison is insouciant when monopolizing the push queue, stuck in a debug loop, to the annoyance of his colleagues.

The Old Version does not Exist

The fundamental problem with rolling back to an old version is that web applications are not self-contained, and therefore they do not have versions. They have a current state. The state consists of the application code and everything that it interacts with. Databases, caches, browsers, and concurrently-running copies of itself.

What they don’t tell you in school is the percentage of your life as a working programmer that will be spent dealing with the “plus” sign.

You can roll back the SHA the webservers are running, but you can’t roll back what they’ve inflicted on everything else in the system. Well, not without a time machine. If you have a time machine, please use the time machine. Otherwise, the remediation has to occur in the direction of the future.

A Demonstration

Contriving an example of a fault that can’t be rolled back is trivial. We can do this by starting with a python script that emulates a simple read-through cache:

# version1.py
from pymemcache.client.base import Client

c = Client(('localhost', 11211))
db = {'a': 1}

def read_through(k):
    v = c.get(k)
    if not v:
        # let’s pretend this reads from the database.
        v = db[k]
        c.set(k, v)
    return int(v)

print('value: %d' % read_through('a'))

We can verify that this works fine:

$ python version1.py
value: 1

Now let’s consider the case of pushing some bad code over top of it. Here’s an updated version:

# version1.py
from pymemcache.client.base import Client

c = Client(('localhost', 11211))
db = {'a': 1}

def read_through(k):
    v = c.get(k)
    if not v:
        # let’s pretend this reads from the database.
        v = db[k]
        c.set(k, v)
    return int(v)

def write_through(k, val):
    c.set(k, val)
    db[k] = int(val)

# mess up the cache lol
write_through('a', 'x')
print('value: %d' % read_through('a'))

That corrupts the cache, and promptly breaks:

$ python version2.py
ValueError: invalid literal for int() with base 10: ’x’

At this point, red sirens are going off all over the office and support reps are sprinting in the direction of our desks. So we hit the rollback button, and:

$ python version1.py
ValueError: invalid literal for int() with base 10: b’x’

Oh no! It’s still broken! We can’t resolve this problem by rolling back. We’re lucky that in this case, nothing has been made the worse. But that is also a possibility. There’s no guarantee that the path from v1 to v2 and then back to v1 isn’t actively destructive.

A working website can eventually be resurrected by writing some new code to cope with the broken data.

def read_through(k):
    v = c.get(k)
    if not v:
        # let’s pretend this reads from the database.
        v = db[k]
        c.set(k, v)

    try:
        return int(v)
    except ValueError:
        # n.b. we screwed up some of the cached values on $DATE,
        # this remediates
        v = db[k]
        c.set(k, v)
        return int(v)

You might dispute the plausibility of a mistake as transparently daft as this. But in my career I’ve carried out conceptually similar acts of cache destruction many times. I’m not saying I’m a great programmer. But then again maybe you aren’t, either.

A Sharp Knife, Whose Handle is also a Knife

Adding a rollback button is not a neutral design choice. It affects the code that gets pushed. If developers incorrectly believe that their mistakes can be quickly reversed, they will tend to take more foolish risks. It might be hard to talk them out of it.

Mounting a rollback button within easy reach (as opposed to git revert, which you probably have to google) means that it’s more likely to be pressed carelessly in an emergency. Panic buttons are for when you’re panicking.

Practice Small Corrections

Pushbutton rollback is a bad idea. The only sensible thing to do is change the way we organize our code for deployment.

Push “dark” code. You should be deploying code behind a disabled feature flag that will not be invoked. It’s relatively easy to visually inspect an if statement for correctness and check that a flag is disabled.
Ramp up invocations of new code. Breaking requests without a quick rollback path is bad. But it’s much worse to break 100% of requests than it is to break 1% of requests. If we ramp up new code gradually, we can often contain the scope of the damage.
Maintain off switches. In the event that a complicated remediation is required, we’re in a stronger position if we can disable broken features while we work on them in relative calm.
Roll forward. Production pushes will include many commits, all of which need to be evaluated for reversibility when a complete rollback is proposed. Reverting smaller diffs as a roll-forward is more verifiable.

Complete deployment rollbacks are high-G maneuvers. The implications of initiating one given a nontrivial set of changes are impossible to reason about. You may decide that one is called for, but you should do this as a last resort.

Ship Small Diffs

2017-02-09T00:00:00+00:00

Building a web application is a young and poorly-understood activity. Toolchains for building code in general are widely available, relatively older, and they also happen to be closest at hand when you’re getting started. The tendency, then, is to pick some command line tools and work forwards from their affordances.

Git provides methods for coping with every merge problem conceivable. It also gives us support for arbitrarily complicated branching and tagging schemes. Many people reasonably conclude that it makes sense to use those features all the time.

I found myself in a dark wood, where the straight way was lost. The good lord would not have given me this 25 ton hydraulic splitter if I weren’t meant to cut up some logs.

This is a mistake. You should start from practices that work operationally, and follow the path backwards to what is done in development. Even allowing for discardable MVP’s, ultimately in a working business most of the cost of software is in operating it, not building it.

I’ll make the case for one practice that works very well operationally: deploying small units of code to production on a regular basis. I think that your deploys should be measured in dozens of lines of code rather than hundreds. You’ll find that taking this as a fixed point requires only relatively simple uses of revision control.

Ship small diffs, and stand a snowball’s chance of inspecting them for correctness.

Your last chance to avoid broken code in production is just before you push it, and to that end many teams think it’s a good idea to have standard-ish code reviews. This isn’t wrong, but return on effort diminishes.

Submitting hundreds of lines of code for review is a large commitment. It encourages sunk cost thinking and entrenchment. Reviews for large diffs are closed with a single “lgtm,” or miss big-picture problems for the weeds. Even the strongest cultures have reviews that devolve into Maoist struggle sessions about whitespace.

Your tormentors will demand baffling, seemingly-trivial concessions.

Looking at a dozen lines for mistakes is the sort of activity that is reasonably effective without being a burden. This will not prevent all problems, or even fail to create any new ones. But as a process it is a mindful balance between the possible and the practical.

Ship small diffs, because code isn’t correct until it’s running production.

The senior developer’s conditioned emotional response to a large deploy diff is abject terror. This is an instinctive understanding of a simple relationship.

Quick, find the red one

Every line of code has some probability of having an undetected flaw that will be seen in production. Process can affect that probability, but it cannot make it zero. Large diffs contain many lines, and therefore have a high probability of breaking when given real data and real traffic.

In online systems, you have to ship code to prove that it works.

Ship small diffs, because the last thing you changed is probably setting those fires.

We cannot prevent all production problems. They will happen. And when they do, we’re better off when we’ve been pushing small changesets.

Many serious production bugs will make themselves evident as soon as they’re pushed out. If a new database query on your biggest page is missing an index, you will probably be alerted quickly. When this happens, it is reasonable to assume that the last deploy contains the flaw.

Oops

At other times, you’ll want to debug a small but persistent problem that’s been going on for a while. The key pieces of information useful to solving such a mystery are when the problem first appeared, and what was changed around that time.

In both of these scenarios, the debugger is presented with a diff. Finding the problem in the diff is similar to code review, but worse. It’s a code review performed under duress. So the time to recover from problems in production will tend to be proportional to the size of the diffs that you’re releasing.

Taking Small Diffs Seriously

Human frailty limits the efficacy of code review for prophylactic purposes. Problems in releases are inevitable, and scale with the amount of code released. The time to debug problems is a function of (among other things) the volume of stuff to debug.

This isn’t a complicated list of precepts. But taking them to heart leads you to some interesting conclusions.

Branches have inertia, and this is bad. I tell people that it’s fine with me if working in a branch helps them, as long as I’m not ever able to tell for sure that they’re doing it. It’s easier to double down on a branch than it is to merge and deploy, and developers fall into this tiger trap all the time.
Lightweight manipulation of source code is fine. PR’s of GitHub branches are great. But git diff | gist -optdiff also works reasonably if we are talking about a dozen lines of code.
You don’t need elaborate Git release rituals. Ceremony such as tagging releases gets to feel like a waste of time once you are releasing many times per day.
Your real problem is releasing frequently. Limiting the amount of code you push is going to block progress, unless you can simultaneously increase the rate of pushing code. This is not as easy as it sounds, and it will shift the focus of your developer tooling budget in the direction of software built with this goal in mind.

That is not an exhaustive list. Starting from operations and working backwards has lead us to critically examine what we do in development, and this is a good thing.

No Way Out But Through

2016-08-25T00:00:00+00:00

Note: This was a post for Skyliner, which was a startup I co-founded in 2016. The post is recreated here since it makes some good points and was reasonably popular. But be advised the startup it describes is now defunct (we sold ourselves to Mailchimp in 2017).

I’ve been around long enough to see production releases done a few different ways.

My first tech job began back when delivering software over the internet wasn’t quite normal, yet. Deployments happened roughly every 12 to 18 months, and they were unmitigated disasters that stretched out for weeks.

When I got to Etsy in 2007, deploys happened a bit more often. But they were still arcane and high-stress affairs. An empowered employee typing commands manually pushed weeks of other people’s work, and often it Did Not Go Well.

But by the time I left Etsy in 2014, we were pushing code live to production dozens of times per day, with minimal drama. This experience has convinced me of a few things.

Changing code is risky.
Unfortunately, achieving business goals generally involves changing code.
The best coping strategy I’m aware of is to change code as frequently as possible.

I believe deploys should be routine, cheap, and safe. That is the philosophy we’ve used to build Skyliner, and we built Skyliner with the intent of sharing this philosophy with other teams.

Routine

In deployment, the path of least resistance should also be the right way to do it. It should be easier and quicker to deploy the right way than to circumvent the process. Making “proper” deploys more complex, slower, or riddled with manual steps backfires. Human nature will lead to chaotic evil, like hand-edited files on production machines.

I’ve been there. I have debugged more than one outage precipitated by live edits to php.ini. Our team worked hard in the years following those incidents to build a deployment system that was too easy and joyful to evade.

Cheap

Deploys can only be routine if they’re relatively quick. If it takes you hours to deploy your code, obviously this imposes a natural limit on how often deploys can be done. But the secondary effects of the latency are worse.

Rare, expensive deploys bundle many changes; quick, cheap deploys can bundle just a few. This becomes important when things don’t go as planned. The most plausible answer to “what went wrong” is usually “the last thing we changed.” So when debugging a problem in production, it matters a great deal whether the release diff is a handful of lines or thousands.

Many interesting things in the field of web operations immediately follow a code deploy. Here’s the record of me causing mass hysteria with several pushes, back in 2010.

Infrequent deploys also create natural deadlines. Engineers will tend to rush to get their changes in for a weekly push, and rushing leads to mistakes. If pushes happen hourly, the penalty for waiting for the next one to write a few more unit tests is much less severe.

Safe

Total safety in deploying code is not possible, and the deployment engine is only one part of the operational puzzle. Striving for a purely technical solution to deploy-driven outages is bound to lead to complexity that will have the opposite effect. As I’ve explained, I think that routine and cheap deploys are inherently safer, and these are cultural choices as much as they are a set of technical solutions.

But, mechanics are still important. Early versions of Etsy’s Deployinator stopped pushing code if the browser of the person performing the deploy disconnected. That was a bad choice, and that became evident immediately the first time I tried to deploy from an airplane somewhere over Kansas. That’s ridiculous, but many teams use a single machine to orchestrate deployments and just hope that it never dies in the act.

Etsy's Deployinator, an inspiration for much of the Skyliner deployment experience.

It is also nontrivial to replace code as it’s running. In the bad old days we’d just do deploys during maintenance windows, but that’s become untenable. In the 21st century we have to make changes to sites while they’re live, and getting this right is a challenge.

Baking Hard Lessons Into Skyliner

Skyliner deploys are easy to use: you just wait for the build to finish and press the button. They’re all logged and recorded, and it’d take significantly more effort to do anything less safe.

The deployment view in a Skyliner application.

We value simplicity, and are believers in Paul Hammond’s advice that you should always ship trunk. Skyliner affords you a single deployment branch. You’re free to act out baroque git contortions if you wish, but we suggest that you keep your release process simple and just deploy a master branch.

We’ve worked hard to make Skyliner deploys as fast as possible. The speed of deploys is decoupled from the instance count, so pushes to small clusters as well as large clusters can both be expected to finish in two or three minutes.

That’s not quite as fast as might be possible with a system that just copied files, but Skyliner deploys are much more than this. We think that the benefits are worth a minor amount of extra waiting.

Our engine models each deploy as a finite state machine. Workers cooperate to complete (idempotent) tasks to advance the deploy state, which means that our instances can die without breaking running deploys.

The coordination of Skyliner deploys is distributed. Deploy workers advance a finite state machine, and can safely be killed without breaking a running deploy.

Every Skyliner deploy is a blue/green deploy. We spin up an entirely new cluster with the new code, make sure it’s healthy, and then make it live as an atomic switch at the load balancer level. This has a few notable advantages to deploying files in place:

Given a sufficiently good healthcheck, the system never makes a totally-broken version live. (Application bugs, regrettably, are still possible.)
By routinely destroying the entire cluster, we eliminate the possibility that the application has inadvertently become reliant upon local machine state.

My Gray Hairs, Grown For You

Deployment is tricky business. We wanted to give Skyliner users a system informed by several decades of our own mistakes. “Well, that sucked,” I said to myself, “but there’s no reason that the rest of the world needs to trip over the same cord.”

The Unreasonable Effectiveness of Mathematics in Planning

2016-02-03T00:00:00+00:00

I was speaking on a panel the other day that was handed the topic, “the challenges of balancing data-light product bets vs purely data driven incremental improvements.” Camille Fournier was also a panelist and wrote up her thoughts here. Camille’s take (which I think is right) is that even if you don’t have data to work from, you can still approach projects analytically.

For me, the process of behaving analytically incorporates mathematical reasoning but not necessarily data. And I think this kind of spitballing is a useful activity, even if the numbers are made up. The reason for this is that human brains were forged on the African savanna where nothing is very fast, very large, or very small, cosmically speaking, and we are laughably equipped for coping with orders of magnitude.

That is also why you think this looks awesome, but don't let that spoil it for you.

The kind of thinking I’m describing works like this: “ok that’s a thing measured in thousands multiplied by a thing measured in tens of thousands, and then filtered through a rate of a few percent, are we even close?”. When permitted to skip this check on deficient intuition, most humans will sense their way to the wrong answers.

But on the panel and in subsequent discussions, it’s been easy to run with the dichotomy that you’ve either got data to work from, or you’ve got nothing at all. The temptation is to jump into philosophical takes given examples of products or entire markets that could not have been calculated with forsight before they existed. While that’s valid, I think it doesn’t describe most of the situations that you encounter in the wild.

Data Exists, and We Don’t Want to Look

The daily grind at a company consists of building in proximity to a thing that’s satisfying some definition of “working.” Yes, there’s always the innovator’s dilemma to worry about and the prospect of weird new platforms that will enable use cases you don’t understand yet. But the degree to which we’re striking out into the undiscovered country is overstated.

Companies release products that you’d figure shouldn’t have survived opportunity analysis all the time. They just don’t pitch them that way:

This feature notifies pairs of individuals that have arranged an unlikely relationship on the internet beforehand. The notifications are delivered two or three times a year, and only if the parties are in close geographic proximity. And they both have an optional iOS app installed. And in this scenario one of the people is known to be in a cohort that tends to not have that iOS app installed. And then at the end of this funnel we’re hoping that some small percentage of these folks will wind up showing up online and buying a thing. Later.

I have a real launch in mind with that, but I’ve rendered it unrecognizable and absurd by describing it accurately. This isn’t a situation where the volume couldn’t be estimated. If it were, I’d have a harder time lampooning it. This is the neglected scenario: we have all the data we need, but instead of deploying it we shipped something doomed.

When you hear people speak in defense of such things, they act out the same misdirection and head straight for the words we use when we’re discussing the iPod. You can’t, like, quantify vision, man. What they’re really espousing is the idea that product success obeys an uncertainty principle. If we look at things too closely, the magic disappears. And of course the good vibes would sublimate in this case, because the magic is nonsense.

The Hazards of Narrative Arc

Of course, this is not what anyone is actually thinking. Nobody sets out to ignore data on purpose, hoping to improve their chances of failing. You just watched me retcon an ethos onto feral behavior. And in doing so, I am part of the problem.

Everyone’s the hero of the novel they’re writing in their heads. That is the human condition. And having saved a company by inventing a new market is a great narrative arc, which is why we reach for it when we’re actually engaged in something mundane. We just systematically find stories too compelling.

It is rarely the case that vision can’t be at least sketched using arithmetic. Mathematics is the language we use to describe reality, and vision is generally assumed to have effects in reality. That’s what makes numeric methods more powerful than they should reasonably be. We’re constantly engaged in the art of self-deception, and they force you to snap out of it.

Do You Work at Amazon?

2016-01-26T00:00:00+00:00

Please note that Roberto Medri is a coauthor on this post.

Albert Wenger has been one of the VC’s I most admire for a long time. He was very present in the early days at Etsy, and sat in giving counsel on some, uh, significantly astray engineering team meetings. Albert is a smart, data-driven guy whose values roughly align with my own.

That said, I have an axe to grind with his latest post, Don’t Mind the Share Price. In it, Albert deploys the story of Amazon as a warning against focusing too much on how the market values a company. This is the story of Amazon:

Amazon was riding high in the late 90s, then felt the DotCom burst roughly along with the rest of the tech sector. Albert points out that history has shamed anyone that might’ve judged Amazon on its share price fifteen years ago, since it’s returned north of 2000% in the years since.

So whether you are running a tech company, working for one, or investing in one I highly recommend not reading too much into changes in share price. Focus instead on whether your company is making real progress.

Albert is careful to stress that you should focus on fundamentals over fluctuations in the price, which is generally good advice. But I think the subtext is clear: don’t be discouraged by even large declines in price, because you might be working at the next Amazon.

This is a premise that we can investigate quantitatively.

The Odds of Being an Amazon

Suppose that we’re working at a public company that’s experienced a decline in its share price of at least 50%, relative to a recent high price. We’d like to approximate the odds that this company is going to recover [1].

It turns out that since 2002, there have been 2,132 companies traded on the NASDAQ that fit this description. One of these is indeed Amazon. But how many others are like it?

We can take this set of companies and categorize them. Let’s identify companies that wound up being completely wiped out—losing 90% of their remaining value or more—and then all other companies that declined in value. For companies that increased in value, we’ll differentiate those that beat the market (defined as the S&P 500 Index) from those that didn’t. The idea being that you would have been better off just buying an index fund with your cash surplus from working for Google in a parallel universe. And finally, we’ll identify the miraculous: those companies that return 1000% or more, of which Amazon is one example.

If we do that, it looks like this:

Category	Count	Percent	Cumulative Percent
Wiped Out	239	11.21%	11.21%
Declined	794	37.24%	48.45%
Beaten by Market	344	16.14%	64.59%
Beat Market	661	31.00%	95.59%
Miracle	94	4.41%	100.00%

Here we can see that about 65% percent of public companies that find themselves in this situation don’t recover. But 35% of companies do. These are tough odds, but definitely not impossible odds, right?

Recovery is not Good Enough

Albert asks us to consider investors, officers, and employees of the company as having roughly identical situations. This is a mistake. Things are significantly worse in the case of employees [2] at a public company that have been issued options. In these cases, the company may very well recover, but we have to contemplate several other horrifying possibilities.

Employees may have already exercised options at a strike price higher than the current market price. If so, they’re screwed if the company never recovers above that price. Even if the company beats the market from here out.
The strike price may be below the current market price, meaning that the options are worth something. But employees may owe taxes (or AMT), forcing them to sell before the recovery.
Options may be underwater and worthless. At least in this scenario, there is clarity.

From these situations we can see that as an employee [3], it makes sense to consider the odds that the company will not just recover, but will ultimately get back to where it was. That looks like this:

Category	Count	Percent	Cumulative Percent
Wiped Out	239	11.21%	11.21%
Declined	794	37.24%	48.45%
Beaten by Market	344	16.14%	64.59%
Recovered Below High Price	210	9.84%	75.04%
Beat Market	441	20.66%	95.69%
Miracle	92	4.31%	100.00%

This makes it worse: 75% of companies won’t recover using this definition. And only about 4% will make miraculous comebacks of Amazon’s order of magnitude.

Are You Making Progress?

Remember that Albert provides us with an important caveat: we should “[f]ocus … on whether the company is making real progress.” But this can be tricky to surmise as an employee, for several reasons:

You are in unavoidably close proximity to a coordinated propaganda campaign. It’s called the company’s internal communications and morale efforts. You may find yourself thinking unreasonably positively about these things.
You are putting in hours at this company, and human nature compels us to confuse effort with progress.
Remember that we’re talking about a public company. So unless you’re an officer, you’ll have a difficult time of getting detailed information about how much progress the company is really making. And of course timing trades on such information would be illegal.

We should agree that the outlook here is going to be hazy at best, and self-deception is a hazard.

The Base Rate Fallacy’s Perverse Tyranny Over the American [4] Mind

If there is any line of reasoning that really drives me crazy, it’s the following:

A series of cosmically unlikely events has unfolded.
This is submitted as evidence that it can happen to anyone.

Examples of this are everywhere. Someone is going to win powerball, therefore it makes sense to buy tickets. Barack Obama was elected president, therefore systematic racism is toothless. Mark Zuckerberg struck it rich, so you’ve just gotta have faith.

By the way this guy also thinks that picking your own numbers gives you a higher chance of winning.

In looking to Amazon (or Google, Facebook, Netflix, or dear god Apple) as consolation in the event that a company has experienced a decline in share price, we make the following mistake. The probability that successful companies have stumbled in their past is not the probability that a company will succeed, having stumbled.

This isn’t a call for nihilism if you find yourself in such a situation. Far from it—it’s a call to realize that the odds are now against you, and to behave proactively.

The code and data for this article is available here, on Github. It’s a bit sloppy and hastily written, sorry. We started from a dataset of companies traded on the NASDAQ that experienced a decline of 50% or more off of a previous high. our dataset started around the year 2000.

You may notice that I've switched questions, from "are you working at Amazon" to "is the company Amazon." Calculating the odds that you are working at Amazon would of course require a richer dataset that includes company headcounts, and I am a lazy man.
Investors can more easily scale their commitment to the company by having a diverse portfolio. Employees and officers, however, give 100% of their labor to the company. And in the event that things go well, a large percentage of their net worth derives from the value of the company. Officers have a high floor on their returns, via guaranteed bonuses, parachute provisions, accelerated vesting schedules in the event of termination, and so on. Employees on the other hand are screwed.
This refinement doesn't apply to all employees. Early employees probably have strike prices that are very low, and can make money despite a large drop in the share price. But at a newly-minted public company, most employees are probably new, and most employees are therefore affected.
I know that Albert Wenger is German.

Are My Push Notifications Driving Users Away?

2015-11-24T00:00:00+00:00

In response to Kellan’s musing about push notifications on twitter, Adam McCue asked an interesting question:

@kellan @mcfunley what's the best way to do this?
— Adam McCue (@mccue) November 25, 2015

I quickly realized that fitting an answer into tweets was hopeless, so here’s a stab at it in longform.

How would we do this?

Let’s come up with a really simple way to figure this out for the case of a single irritating notification. This is limited, but the procedure described ought to be possible for anyone with a web-enabled mobile app. We need:

A way to divide the user population into two groups: a treatment group that will see the ad notification, and a control group that won’t.
A way to decide if users have disappeared or not.

To make the stats as simple as possible, we need (1) to be random and we need (2) to be a binomial measure (i.e. “yes or no,” “true or false,” “heads or tails,” etc).

To do valid (simple) stats, we also want our trials to be independent of each other. If we send the same users the notifications over and over, we can’t consider each of those to be independent trials. It’s easy to intuit why that might be: I’m more likely to uninstall your app after the fifth time you’ve bugged me [1]. So we need to consider disjoint sets of users on every day of the experiment.

Does this hurt us or help us? Let's try science.

How to randomly select users to receive the treatment under these conditions is up to you, but one simple way that should be broadly applicable is just hashing the user ID. Say we need 100 groups of users: both a treatment and control group for 50 days. We can hash the space of all user ID’s down to 100 buckets [2].

So how do we decide if users have disappeared? Well, most mobile apps make http requests to a server somewhere. Let’s say that we’ll consider a user to be “bounced” if they don’t make a request to us again within some interval.

Some people will probably look at the notification we sent (resulting in a request or two), but be annoyed and subsequently uninstall. We wouldn’t want to count such a user as happy. So let’s say we’ll look for usage between one day after the notification and six days after the notification. Users that send us a request during that interval will be considered “retained.”

Some examples of our binomial model. We'll call a user retained if they request data from us on any of days two through seven counting from the time of the notification. User 4 in this example is not retained because (s)he only requests data on the day the notification was sent.

To run the experiment properly you need to know how long to run it. That depends a lot on your personal details: how many people use your app, how often they use it, how valuable the ad notification is, and how severe uninstalls are for you. For the sake of argument, let’s say:

We can find disjoint sets of 10,000 users making requests to us on any given day, daily, for a long time.
(As discussed) we’ll put 50% of them in the treatment group.
60% of people active on a given day currently will be active between one and six days after that.
We want to be 80% sure that if we move that figure by plus or minus 1%, we’ll know about it.
We want to be 95% sure that if we measure a deviation in plus or minus 1% that it’s for real.

If you plug all of that into experiment calculator [3] it will tell you that you need 21 days of data to satisfy those conditions. But since we use a trailing time interval in our measurement, we need to wait 28 days.

An example result

Ok, so let’s say we’ve run that experiment and we have some results. And suppose that they look like this:

Group	Users	Retained users	Bounced users
Treatment	210,000	110,144	99,856
Control	210,000	126,033	83,967

Using these figures we can see that we’ve apparently decreased retention by 12.6%, and a test of proportions confirms that this difference is statistically significant. Oops!

I’ve run the experiment, now what?

You most likely have created the ad notification because you had some positive goal in mind. Maybe the intent was to get people to buy something. If that’s the case, then you should do an additional computation to see if what you gained in positive engagement outweighs what you’ve lost in users.

I don’t think I have enough data.

You might not have 420,000 users to play with, but that doesn’t mean that the experiment is necessarily pointless. In our example we were trying to detect changes of plus or minus one percent. You can detect more dramatic changes in behavior with smaller sets of users. Good luck!

I’m sending reactivation notifications to inactive users. Can I still measure uninstalls?

In our thought experiment, we took it as a given that users were likely to use your app. Then we considered the effect of push notifications on that behavior. But one reason you might be contemplating sending the notifications is that they’re not using it, and you are trying to reactivate them.

If that’s the case, you might want to just measure reactivations instead. After all, the difference between a user who has your app installed but never opens it and a user that has uninstalled your app is mostly philosophical. But you may also be able to design an experiment to detect uninstalls. And that might be sensible if very, very infrequent use of your app can still be valuable.

A procedure that might work for you here is to send two notifications. You could then use delivery failures of secondary notifications as a proxy metric for uninstalls.

I want to learn more about this stuff.

As it happens, I recorded a video with O’Reilly that covers things like this in more detail. You might also like Evan Miller’s blog and Ron Kohavi’s publications.

"How many notifications are too many?" is a separate question, not considered here.
If you do many experiments, you want to avoid using the _same_ sets of people as control and treatment. So include something based on the name of the experiment in the hash. So if user 12345 is in the treatment for 50/50 experiment X, she should be only 50% likely (not 100% likely) to be in the treatment for some other 50/50 experiment Y.
The labeling on the tool is for experiments on a website. The math is the same though.

Choose Boring Technology (Expanded, Slide-Based Edition)

2015-07-27T00:00:00+00:00

I gave a spoken word version of Choose Boring Technology at OSCON in Portland last week. Here are the slides:

Choose Boring Technology

2015-03-30T00:00:00+00:00

Probably the single best thing to happen to me in my career was having had Kellan placed in charge of me. I stuck around long enough to see Kellan’s technical decisionmaking start to bear fruit. I learned a great deal from this, but I also learned a great deal as a result of this. I would not have been free to become the engineer that wrote Data Driven Products Now! if Kellan had not been there to so thoroughly stick the landing on technology choices.

Being inspirational as always.

In the year since leaving Etsy, I’ve resurrected my ability to care about technology. And my thoughts have crystallized to the point where I can write them down coherently. What follows is a distillation of the Kellan gestalt, which will hopefully serve to horrify him only slightly.

Embrace Boredom.

Let’s say every company gets about three innovation tokens. You can spend these however you want, but the supply is fixed for a long while. You might get a few more after you achieve a certain level of stability and maturity, but the general tendency is to overestimate the contents of your wallet. Clearly this model is approximate, but I think it helps.

If you choose to write your website in NodeJS, you just spent one of your innovation tokens. If you choose to use MongoDB, you just spent one of your innovation tokens. If you choose to use service discovery tech that’s existed for a year or less, you just spent one of your innovation tokens. If you choose to write your own database, oh god, you’re in trouble.

Any of those choices might be sensible if you’re a javascript consultancy, or a database company. But you’re probably not. You’re probably working for a company that is at least ostensibly rethinking global commerce or reinventing payments on the web or pursuing some other suitably epic mission. In that context, devoting any of your limited attention to innovating ssh is an excellent way to fail. Or at best, delay success [1].

What counts as boring? That’s a little tricky. “Boring” should not be conflated with “bad.” There is technology out there that is both boring and bad [2]. You should not use any of that. But there are many choices of technology that are boring and good, or at least good enough. MySQL is boring. Postgres is boring. PHP is boring. Python is boring. Memcached is boring. Squid is boring. Cron is boring.

The nice thing about boringness (so constrained) is that the capabilities of these things are well understood. But more importantly, their failure modes are well understood. Anyone who knows me well will understand that it’s only with a overwhelming sense of malaise that I now invoke the spectre of Don Rumsfeld, but I must.

To be clear, fuck this guy.

When choosing technology, you have both known unknowns and unknown unknowns [3].

A known unknown is something like: we don’t know what happens when this database hits 100% CPU.
An unknown unknown is something like: geez it didn’t even occur to us that writing stats would cause GC pauses.

Both sets are typically non-empty, even for tech that’s existed for decades. But for shiny new technology the magnitude of unknown unknowns is significantly larger, and this is important.

Optimize Globally.

I unapologetically think a bias in favor of boring technology is a good thing, but it’s not the only factor that needs to be considered. Technology choices don’t happen in isolation. They have a scope that touches your entire team, organization, and the system that emerges from the sum total of your choices.

Adding technology to your company comes with a cost. As an abstract statement this is obvious: if we’re already using Ruby, adding Python to the mix doesn’t feel sensible because the resulting complexity would outweigh Python’s marginal utility. But somehow when we’re talking about Python and Scala or MySQL and Redis people lose their minds, discard all constraints, and start raving about using the best tool for the job.

Your function in a nutshell is to map business problems onto a solution space that involves choices of software. If the choices of software were truly without baggage, you could indeed pick a whole mess of locally-the-best tools for your assortment of problems.

The way you might choose technology in a world where choices are cheap: "pick the right tool for the job."

But of course, the baggage exists. We call the baggage “operations” and to a lesser extent “cognitive overhead.” You have to monitor the thing. You have to figure out unit tests. You need to know the first thing about it to hack on it. You need an init script. I could go on for days here, and all of this adds up fast.

The way you choose technology in the world where operations are a serious concern (i.e., "reality").

The problem with “best tool for the job” thinking is that it takes a myopic view of the words “best” and “job.” Your job is keeping the company in business, god damn it. And the “best” tool is the one that occupies the “least worst” position for as many of your problems as possible.

It is basically always the case that the long-term costs of keeping a system working reliably vastly exceed any inconveniences you encounter while building it. Mature and productive developers understand this.

Choose New Technology, Sometimes.

Taking this reasoning to its reductio ad absurdum would mean picking Java, and then trying to implement a website without using anything else at all. And that would be crazy. You need some means to add things to your toolbox.

An important first step is to acknowledge that this is a process, and a conversation. New tech eventually has company-wide effects, so adding tech is a decision that requires company-wide visibility. Your organizational specifics may force the conversation, or they may facilitate developers adding new databases and queues without talking to anyone. One way or another you have to set cultural expectations that this is something we all talk about.

One of the most worthwhile exercises I recommend here is to consider how you would solve your immediate problem without adding anything new. First, posing this question should detect the situation where the “problem” is that someone really wants to use the technology. If that is the case, you should immediately abort.

I just watched a webinar about this graph database, we should try it out.

It can be amazing how far a small set of technology choices can go. The answer to this question in practice is almost never “we can’t do it,” it’s usually just somewhere on the spectrum of “well, we could do it, but it would be too hard” [4]. If you think you can’t accomplish your goals with what you’ve got now, you are probably just not thinking creatively enough.

It’s helpful to write down exactly what it is about the current stack that makes solving the problem prohibitively expensive and difficult. This is related to the previous exercise, but it’s subtly different.

New technology choices might be purely additive (for example: “we don’t have caching yet, so let’s add memcached”). But they might also overlap or replace things you are already using. If that’s the case, you should set clear expectations about migrating old functionality to the new system. The policy should typically be “we’re committed to migrating,” with a proposed timeline. The intention of this step is to keep wreckage at manageable levels, and to avoid proliferating locally-optimal solutions.

This process is not daunting, and it’s not much of a hassle. It’s a handful of questions to fill out as homework, followed by a meeting to talk about it. I think that if a new technology (or a new service to be created on your infrastructure) can pass through this gauntlet unscathed, adding it is fine.

Just Ship.

Polyglot programming is sold with the promise that letting developers choose their own tools with complete freedom will make them more effective at solving problems. This is a naive definition of the problems at best, and motivated reasoning at worst. The weight of day-to-day operational toil this creates crushes you to death.

Mindful choice of technology gives engineering minds real freedom: the freedom to contemplate bigger questions. Technology for its own sake is snake oil.

Update, July 27th 2015: I wrote a talk based on this article. You can see it here.

Etsy in its early years suffered from this pretty badly. We hired a bunch of Python programmers and decided that we needed to find something for them to do in Python, and the only thing that came to mind was creating a pointless middle layer that required years of effort to amputate. Meanwhile, the 90th percentile search latency was about two minutes. Etsy didn't fail, but it went several years without shipping anything at all. So it took longer to succeed than it needed to.
We often casually refer to the boring/bad intersection of doom as “enterprise software,” but that terminology may be imprecise.
In saying this Rumsfeld was either intentionally or unintentionally alluding to the Socratic Paradox. Socrates was by all accounts a thoughtful individual in a number of ways that Rumsfeld is not.
A good example of this from my experience is Etsy’s activity feeds. When we built this feature, we were working pretty hard to consolidate most of Etsy onto PHP, MySQL, Memcached, and Gearman (a PHP job server). It was much more complicated to implement the feature on that stack than it might have been with something like Redis (or maybe not). But it is absolutely possible to build activity feeds on that stack.

An amazing thing happened with that project: our attention turned elsewhere for several years. During that time, activity feeds scaled up 20x while nobody was watching it at all. We made no changes whatsoever specifically targeted at activity feeds, but everything worked out fine as usage exploded because we were using a shared platform. This is the long-term benefit of restraint in technology choices in a nutshell.

This isn’t an absolutist position--while activity feeds stored in memcached was judged to be practical, implementing full text search with faceting in raw PHP wasn't. So Etsy used Solr.

Data Driven Products: Lean Startup 2014

2015-01-27T00:00:00+00:00

Here’s a video of me doing a slightly-amended version of my Data Driven Products talk at the Lean Startup Conference back in December.

I am told I upspeak? You be the judge.

Thoughts on the Technical Track

2014-12-09T00:00:00+00:00

I saw lizTheDeveloper’s post about technical leadership at Simple and I realized that I’ve been meaning to write about this for a while. I hope to persuade you that there are a number of systemic biases working against a healthy technical career path. I don’t think that they’re insurmountable, and I don’t disagree with Liz’s post. But I’ve never heard of a company clearing all of these hurdles at once.

I was the first person at Etsy with the title of “Principal Engineer,” which was the technical equivalent to a directorship (i.e., one level below CTO). I’m not saying this to toot my own horn, but rather so that it’s understood that the following comes from someone that was the beneficiary of an existing system.

(Incidentally, I think Etsy is an example of a company whose heart is in the right place, and it’s not my intention to single them out.)

To Review, Management is a Job

My views on the merits of having a technical track align with those of many people in our industry. Management is a different job, with different skills. They’re not necessarily more difficult skills, they’re just different. By and large they’re unrelated to the day-to-day labor of the people who build technology products.

It doesn’t make any sense to divert your technical talent into a discipline where they will need to stop doing technical work. (That’s in the event that they intend to be effective managers, which I concede might be an unrealistic expectation.)

Other people have made this case, so I’ll just proceed as if we agree that there must be a way forward for people that are great programmers other than to simply graduate into not programming at all.

Having that way forward is an ideal. There is always a gap between our ideals and reality, and we cannot act as though we’ve solved a problem simply by articulating it.

Fundamental Asymmetries

Management Just Happens

I have had management responsibility thrust upon me at least four times over the course of my career, and at no point has that been my goal. It just happens. Do you want to be a manager? I will now tell you the secret to becoming a manager in a growing company: just wait.

You have a manager. Eventually, your manager will accrue too many responsibilities, and they will freak out. They will need somebody to take over some of their reports, and that lucky warm body is you.

Good hair: also helpful.

It is entirely plausible to become a manager accidentally. It might even be the norm.

Technical Track Promotions are Post-Hoc

The process for minting a new manager is: crap, we need another manager. There’s no symmetrical forcing function pushing people into the upper ranks of technical leadership.

Mentorship and technical feedback are things everyone does on a functioning engineering team. A technical track “promotion” is merely additional recognition given to someone who is already performing that role notably well.

If the job is already getting done, then filling the job is clearly not a pressing need. Technical promotions are something that happen when it’s convenient, which is generally never.

Stumping

Between the founding of the United States and the end of the 19th century, it was considered tacky for presidential candidates to personally campaign for the job. Instead, they staged an elaborate farce in which they reluctantly answered the call of the nation to serve. Trying to intentionally get a promotion into the technical track is pretty much just like this.

Getting promoted in the technical track is kind of like being James Garfield.

Your work must be recognized, and this is the rub. Let me rephrase: “someone with the power to bestow promotions has to be your fan.” To be promoted you have to be a good mentor, but you also have to worry about playing to an audience. That may be executives, or it may be your peers (and potential competitors). Regardless, you’re running a weird campaign in which actually saying anything directly about wanting the job would be gauche.

The most qualified individual contributors may become known without ever really doing this on purpose, but that doesn’t say much for this as a tenable career goal of the sort that can be counted on.

The Problem of Credibility

Society Applies to Idealistic Tech Companies, Too

American society is not a classless oasis. That’s a lie we tell ourselves. And the person who knows what everyone else gets paid and can fire you is not in your class.

A technical job does not have equivalent prestige to a management position with an equivalent salary just because you say it does. Even if you conquer this within your own company, it’s not true in the rest of the industry, and it’s not true in the world at large. In the world our parents live in, it’s a big deal to be somebody else’s boss.

You’re hiring people from the world at large all the time. Without continuous effort a technical track decays to its ground state, where the jobs are second class.

Halfhearted Managers are The Worst

The natural result of a system in which technical promotions can’t be counted on and are viewed as suspiciously-maybe-second-class anyway is that people who don’t really give a shit about management wind up going into management. Given the choice of waiting for a technical promotion that may never arrive and taking an offer to manage others, almost everyone is going to take the bird in the hand.

Once you let the soulless suspendered lizard in the building, you are screwed.

Managers that have no passion for management are a blight on society. I can say this because I have been one of them. I was never a good manager, and for that I apologize to anyone that ever had to report to me.

I am not an isolated case. Many people in management are frankly terrible at it. And they would rather have technical track jobs anyway, but they have no idea how to make the switch. A credible technical track is a great way to ensure a higher level of satisfaction and competency among the managers.

Ratios Observed in the Wild Make No Sense

You don’t need to take my reasoning about the intrinsic pressure favoring management bloat at face value. You can actually look at the ratio of managers to technical employees at your company.

At one point, I was alone at my level. There were five theoretically-equivalent directors at the time. The ratio was at least that bad on the lower rungs. (I have no idea if this is still true at that company, and it might not be.)

For that to make sense, we’d have to believe a few things that don’t stand up to scrutiny. First, we’d have to believe in a very high proclivity among engineers to manage, and I think that betrays our expectations. Not very many of us got into this business with the hope of not actually building things.

Second, we’d have to believe that although it took five directors to effectively manage the organization, only one technical leader was required to advise the same group on the details of the work they do every day.

What Might Help?

Promotions Should Not Be Miraculous and Rare

Of course, it wouldn’t make logical sense to say that the ratio of individual contributors to managers at a given level must be 1:1. I honestly don’t know if 1:2 or 2:1 is closer to correct. The answer is probably contingent, and the relationship might not be linear.

But I think it’s important for any company that takes the ideal of having a tenable technical track seriously to put a stake in the ground on this question. It’s hard to build a credible technical track, and we need a baseline to grade ourselves against.

I don’t think that proceeding with the assumption that leaders will just naturally emerge produces the best results. Adding a self-imposed quota achieves accountability. It acknowledges the possibility that problems can lie in the system of recognition, and not only in the talents of the people in the pool for promotions.

“Do we think that we hire smart people here? Yes? Then we should be able to find N of them worthy of promotion for every manager. If we can’t then the problem is most likely to be found in how we’re recognizing people for their work.”

I know that the word “quota” is verboten for many, and I gleefully await your flames.

Address Prestige with Superpowers

If we think about why managers and technical employees on even salary footing may be perceived to not truly be equals, it comes down to superpowers. The managers have special capabilities that the technical employees don’t: hiring, firing, compensation, and the like. Is it possible to give technical employees a different set of superpowers, to address the prestige problem?

Maybe. I don’t think that I have seen this done correctly yet. If I had superpowers, they were:

The ability to work on whatever I wanted.
The ability to talk to anyone I wanted.

These were indeed powerful, but using them to create positive action was difficult. It would have been easy for me to opt out of projects that I didn’t believe in and to do my own thing. I did often do my own thing. But I also worked on projects that I didn’t believe in, because I knew that opting out was a selfish act. One of my friends would just be forced to work on it in my place, and sometimes leadership is about jumping on grenades.

I guess there are worse superpowers. For example, the ability to allow oneself to be framed for the good of the city.

Talking to other teams made it possible for me to point out places where resources weren’t intelligently allocated. But this also begat mostly negative actions. “Hey, this isn’t the best way to use these folks,” I’d find myself saying all the time. It was draining, and a bummer.

Giving the technical leadership deeper involvement in the planning process could address this. Of course that would involve dragging the technical leadership to meetings, which I admit is tricky.

In Closing

I hope I’ve demonstrated that creating a career path outside of management for technical employees is only the beginning of your problems. It’s a good and necessary step, but it’s not an achievement by itself.

I’d love to hear from anyone with better ideas. These issues are difficult and I don’t claim to have all of the right answers.

Data Driven Products Now!

2014-09-18T00:00:00+00:00

Back when I was at Etsy, I did a presentation internally about the craft of sizing opportunities. I finally got around to writing a public incarnation of that talk. Here it is:

Manual Delivery

2014-03-10T00:00:00+00:00

The person on build rotation, or the nightly schlimazel I suppose, went into a hot 5’x8’ closet containing an ancient computer. This happened after everyone else had left, so around 8:30PM. Although in crunch time that was more like 11:30PM. And we were in crunch time at one point for a stretch of a year and a half. “That release left a mark,” my friend Matt used to say. In a halfhearted attempt at fairness to those who will take this post as a grave insult, I’ll concede that my remembrance of these details is the work of The Mark.

Anyway, the build happened after quitting time. This guaranteed that if anything went wrong, you were on your own. Failure in giving birth to the test build implied that the 20 people in Gurgaon comprising the QA department would show up for work in a matter of hours having nothing to do.

You used a tool called “VBBuild.” This was a GUI tool, rumored to be written by Russians:

VBBuild did mysterious COM stuff to create the DLLs that nobody at the time understood properly. It presented you with dozens of popups even when it was working perfectly, and you had to be present to dismiss each of them. The production of executable binary code was all smoke and lasers. And, apparently, popups.

Developers wrote code using the more familiar VB6 IDE. The IDE could run interpreted code as an interactive debugger, but it could not produce finished libraries in a particularly repeatable or practical way. So the release compilation was different in many respects from what programmers were doing at their desks. Were there problems that existed in one of these environments but not the other? Yes, sometimes. I recall that we had a single function that weighed in at around 70,000 lines. The IDE would give up and execute this function even if it contained clear syntax errors. That was the kind of discovery which, while exciting, was wasted in solitude somewhere past midnight as you attempted to lex and parse the code for keeps.

Isaiah 2:4: "And he shall displace VB6 in search engine results with a book written by vegans."

Developers weren’t really in the habit of doing complete pulls from source control. And who could blame them, since doing this whitescreened your machine for half an hour. They were also never in any particular hurry to commit, at least until it was time to do the test build. As there was no continuous integration at the time, this was the first time that all of the code was compiled in several days.

Often [ed: always] there were compilation errors to be resolved. We were using Visual Sourcesafe, so people could be holding an exclusive lock on files containing the errors. Typically, this problem was addressed by walking around the office an hour before build time and reminding everyone to check their files in. In the event that someone forgot [ed: every time], there was an administrative process for unlocking locked files. Not everyone had the necessary rights to do this, but happily, I did.

By design, the build tried to assume an exclusive lock on all of the code. As a result, nobody could work while the build was in progress. Sometimes, the person performing the build would check all of the files out and not check them back in. So your first act the morning after a build might be to walk over to the build closet and release the source files from their chains.

The Visual Sourcesafe documentation strongly advised against its use on a team of more than four programmers, and apparently this was not a joke.

Deployment required dozens of manual steps that I will never be able to remember. When the build was done, you copied DLLs over to the test machines and registered them there. By “copied” I mean that you selected them in an explorer window, pressed “Ctrl-C,” and then pressed “Ctrl-V” to paste them into another. There was no batch script worked out to do this more efficiently. Ok, this is a slight lie. There had been a script, but was put out to pasture on account of a history of hideous malfunction. And popups. On remote machines sometimes, where they could only be dismissed by wind and ghosts.

Registration involved connecting to each machine with Remote Desktop and right clicking all the DLLs. You could skip a machine or just one library, and things would be very screwy indeed.

The production release, which happened roughly twice a year under ideal conditions, was identical to this but with the added complexity of about eight more servers receiving the build. And we might take the opportunity to add completely new machines, which would not necessarily have the same patch levels for, oh, like 700,000 windows components that were relied upon.

Given eight or ten machines, the probability of a mistake on at least one of the servers approached unity. So the days and weeks following a production release were generally spent sussing out all of the minute differences and misconfigurations on the production machines. There would be catastrophic bugs that affected a tiny sliver of requests, under highly specific server conditions, and only if executed on one server out of eight. I was an expert at debugging in disassembly at the time. Upon leaving the job, I thought that this was pretty badass. But in the seven years since–do you know what? It’s never come up.

"The code could be structured by cows and we would build it by hand."

At one point I wrote a new script to perform the deployment. It was an abomination of XML to be sure, but it got the job done without all of the popups. I started doing the test build with this with some success and suggested that we use it for the production release. This was out of the question, I was told by one of my closer allies in the place. The production release was “too important to use a script.”

The operating systems and supporting libraries on the machines were also set up by hand, by a separate team, working from printed notes. The results were similar. This is kind of another story.

This all happened in 2003.

Scalding at Etsy

2014-03-02T00:00:00+00:00

Here’s a presentation I gave about how Etsy wound up using Scalding for analysis. Given at the San Francisco Cascading Meetup.

The Case for Secrecy in Web Experiments

2014-01-16T00:00:00+00:00

For four months ending in early 2011, I worked on team of six to redesign Etsy’s homepage. I don’t want to overstate the weight of this in the grand scheme of things, but hopes flew high. The new version was to look something like this:

There were a number of methodological problems with this, one of our very first web experiments. Our statistics muscles were out of practice, and we had a very difficult time fighting the forces of darkness who wanted to enact radical redesigns after five minutes of real-time data. We had no toolchain for running experiments to speak of. The nascent analytics pipeline jobs failed every single night.

But perhaps worst of all, we publicized the experiment. Well, “publicized” does not accurately convey the magnitude of what we did. We allowed visitors to join the treatment group using a magic URL. We proactively told our most engaged users about this. We tweeted the magic URL from the @Etsy account, which at that point had well over a million followers.

The magic URL was chosen to celebrate the CEO's 31st birthday. None of this was Juliet's fault.

This project was a disaster for many reasons. Nearly all of the core hypotheses turned out to be completely wrong. The work was thrown out as a total loss. Everyone involved learned valuable life lessons. I am here today to elaborate on one of these: telling users about the experiment as it was running was a big mistake.

The Diamond-Forging Pressure to Disclose Experiments

If you operate a website with an active community, and you do A/B testing, you might feel some pressure to disclose your work. And this seems like a proper thing to do, if your users are invested in your site in any serious way. They may notice anyway, and the most common reaction to change on a beloved site tends to be varying degrees of panic.

"If you can't beat 'em, join 'em."

As an honest administrator, your wish is to reassure your community that you have their best interest at heart. Transparency is the best policy!

Except in this case. I think there’s a strong argument to be made against announcing the details of active experiments. It turns out to be easier for motivated users to overturn your experiment than you may believe. And disclosing experiments is work, and work that comes before real data should be minimized.

Online Protests: Not Necessarily A Waste of Time

A fundamental reason that you should not publicize your A/B tests is that this can introduce bias that can affect your measurements. This can even overturn your results. There are many different ways for this to play out.

Most directly, motivated users can just perform positive actions on the site if they believe that they are in their preferred experiment bucket. Even if the control and treatment groups are very large, the number of people completing a goal metric (such as purchasing) may be just a fraction of that. And the anticipated difference between any two treatments might be slight. It’s not hard to imagine how a small group of people could determine an outcome if they knew exactly what to do.

Group	Visits	Conversions (organic)	Conversions (gamed)	Proportion
Control	10000	50	10	0.0060
New	10000	55	0	0.0055

Control	New
10000 visits	10000 visits
50 organic conversions	50 organic conversions
10 gamed conversions	0 gamed conversions
0.60% converted	0.55% converted

Figure 1: In some cases a small group of motivated users can change an outcome, even if the sample sizes are large.

As the scope and details of an experiment become more fully understood, this gets easier to accomplish. But intentional, organized action is not the only possible source of bias.

Even if users have no preference as to which version of a feature wins, some will still be curious. If you announce an experiment, visitors will engage with the feature immediately who otherwise would have stayed away. This well-intentioned interest could ironically make a winning feature appear to be a loss. Here’s an illustration of what that looks like.

Group	Visits (oblivious)	Visits (rubbernecking)	Visits (total)	Conversions	Proportion
Control	500	50	550	30	0.055
New	500	250	750	35	0.047

Control	New
500 oblivious visits	500 oblivious visits
50 rubbernecking visits	250 rubbernecking visits
550 total visits	750 total visits
30 conversions	35 conversions
5.5% converted	4.7% converted

Figure 2: An example in which 100 engaged users are told about a new experiment. They are all curious and seek out the feature. Those seeing the new treatment visit the new feature more often just to look at it, skewing measurement.

These examples both involve the distortion of numbers on one side of an experiment, but many other scenarios are possible. Users may change their behavior in either group for no reason other than that they believe they are being measured.

Good experimental practice requires that you isolate the intended change as the sole variable being tested. To accomplish this, you randomly assign visitors the new treatment or the old, controlling for all other factors. Informing visitors that they’re part of an experiment places this central assumption in considerable jeopardy.

Predicting Bias is Hard

“But,” you might say, “most users aren’t paying attention to our communiqués.” You may think that you can announce experiments, and only a small group of the most engaged people will notice. This is very likely true. But as I have already shown, the behavior of a small group cannot be dismissed out of hand.

Obviously, this varies. There are experiments in which a vocal minority cannot possibly bias results. But determining if this is true for any given experiment in advance is a difficult task. There is roughly one way for an experiment to be conducted correctly, and there are an infinite number of ways for it to be screwed.

A/B tests are already complicated: bucketing, data collection, experimental design, experimental power, and analysis are all vulnerable to mistakes. From this point of view, “is it safe to talk about this?” is just another brittle moving part.

Communication Plans are Real Work

Something I have come to appreciate over the years is the role of product marketing. I have been involved in many releases for which the act of explaining and gaining acceptance for a new feature constituted the majority of the effort. Launches involve a lot more than pressing a deploy button. This is a big deal.

Product marketing: this is serious business.

It also seems to be true that people who are skilled at this kind of work are hard to come by. You will be lucky to have a few of them, and this imposes limits on the number of major changes that you can make in any given year.

It makes excellent sense to avoid wasting this resource on quite-possibly-fleeting experiments. It will delay their deployment, steal cycles from launches for finished features, and it will do these things in the service of work that may never see the light of day!

Users will tend to view any experiment as presaging an imminent release, regardless of your intentions. Therefore, you will need to put together a relatively complete narrative explaining why the changes are positive at the outset. A “minimum viable announcement” probably won’t do. And you will need to execute this without the benefit of quantitative results to bolster your case.

Your Daily Reminder that Experiments Fail

Doing data-driven product work really does imply that you will not release changes that don’t meet some quantitative standard. In such an event you might tweak things and start over, or you might give up altogether. Announcing your running experiments is problematic given this reality.

Obviously, product costs will be compounded by communication costs. Every time you retool an experiment, you will have to bear the additional weight of updating your community. Adding marginal effort makes it more difficult for humans to behave rationally and objectively. We have a name for this well-known pathology: the sunk cost fallacy. We’ve put so much into this feature, we can’t just give up on it now.

The fear of admitting mistakes in public can be motivating.

Announcing experiments also has a way of raising the stakes. The prospect of backtracking with your users (and being perceived as admitting a mistake) only makes killing a bad feature less palatable. The last thing you need is additional temptation to delude yourself. You have plenty of this already. The danger of living in public is that it will turn a bad release that should be discarded into an inevitability.

Consistency and Expectations

Let’s say you’ve figured out workarounds for every issue I’ve raised so far. You are still going to want to run experiments that are not publicly declared.

Some experiments are inherently controversial or exploratory. It may be perfectly legitimate to try changes that you would never release to learn more about your site. Removing a dearly beloved feature temporarily for half of new registrations is a good example of this. By doing so, you can measure the effect of that feature on lifetime value, and make better decisions with your marketing budget.

Other experiments work only when they’re difficult to detect. Search ranking is a high-stakes arms race, and complete transparency can just make it easier for malicious users gain unfair advantages. It’s likely you’re going to want to run experiments on search ranking without disclosing them.

It would be malpractice to give users the expectation that they will always know the state of running experiments. They will not have the complete picture. Leading them to believe otherwise can do more harm to your relationship than just having a consistent policy of remaining silent until features are ready for release.

Sharing too much too soon can doom your A/B tests. But this doesn’t mean that you are doomed to be locked in a steel cage match with your user base over them.

Forum moderators of the world: good luck.

You can do rigorous, well-controlled experiments and also announce features in advance of their release. You can give people time to acclimate to them. You can let users preview new functionality, and enable them at a slower pace. These practices all relate to how a feature is released, and they are not necessarily in conflict with how you decide which features should be released. It is important to decouple these concerns.

You can and should share information about completed experiments. “What happened in the A/B test” should be a regular feature of your release notes. If you really have determined that your new functionality performs better than what it replaces, your users should have this data.

Plain-language A/B test results can ease user anxiety in launches.

Counterintuitively, perhaps, trust is also improved by sharing the details of failed experiments. If you only tell users about your victories, they have no reason to believe that you are behaving objectively. Who’s to say that you aren’t just making up your numbers? Showing your scars (as I tried to do with my homepage story above) can serve as a powerful declaration against interest.

Successful Testing is Good Stewardship

Your job in product development, very broadly, is to make progress while striking a balance between short and long term concerns.

Users should be as happy as possible in the short term.
Your site should continue to exist in the long term.

The best interest of your users is ultimately served by making the correct changes to your product. Talking about experiments can break them, leading to both quantitative errors and mistakes of judgment.

I firmly believe that A/B tests in any organization should be as free, easy, and cheap as humanly possible. After all, running A/B tests is perhaps the only way to know that you’re making the right changes. Disclosing experiments as they are running is a policy that can alleviate some discontent in the short term. But the price of this is making experiments harder to run in the long term, and ultimately making it less likely that measurement will be done at all.

Thanks to Nell Thomas, Steve Mardenfeld, and Dr. Parker for their help on this.