Dan McKinley
Math, Programming, and Minority Reports

You Can't Have a Rollback Button
February 28th, 2017

The internet is a big truck. It’s really hard to drive it backwards.

I’ve worked with deploy systems in the past that have a prominent “rollback” button, or a console incantation with the same effect. The presence of one of these is reassuring, in that you can imagine that if something goes wrong you can quickly get back to safety by undoing your last change.

But the rollback button is a lie. You can’t have a rollback button that’s safe when you’re deploying a running system.

A buffalo blocking the road in Yellowstone
The majestic bison is insouciant when monopolizing the push queue, stuck in a debug loop, to the annoyance of his colleagues.
The Old Version does not Exist

The fundamental problem with rolling back to an old version is that web applications are not self-contained, and therefore they do not have versions. They have a current state. The state consists of the application code and everything that it interacts with. Databases, caches, browsers, and concurrently-running copies of itself.

The cover of Niklaus Wirth's Algorithms + Data Structures = Programs
What they don’t tell you in school is the percentage of your life as a working programmer that will be spent dealing with the “plus” sign.

You can roll back the SHA the webservers are running, but you can’t roll back what they’ve inflicted on everything else in the system. Well, not without a time machine. If you have a time machine, please use the time machine. Otherwise, the remediation has to occur in the direction of the future.

A Demonstration

Contriving an example of a fault that can’t be rolled back is trivial. We can do this by starting with a python script that emulates a simple read-through cache:

# version1.py
from pymemcache.client.base import Client

c = Client(('localhost', 11211))
db = {'a': 1}

def read_through(k):
    v = c.get(k)
    if not v:
        # let’s pretend this reads from the database.
        v = db[k]
        c.set(k, v)
    return int(v)

print('value: %d' % read_through('a'))

We can verify that this works fine:

$ python version1.py
value: 1

Now let’s consider the case of pushing some bad code over top of it. Here’s an updated version:

# version1.py
from pymemcache.client.base import Client

c = Client(('localhost', 11211))
db = {'a': 1}

def read_through(k):
    v = c.get(k)
    if not v:
        # let’s pretend this reads from the database.
        v = db[k]
        c.set(k, v)
    return int(v)

def write_through(k, val):
    c.set(k, val)
    db[k] = int(val)

# mess up the cache lol
write_through('a', 'x')
print('value: %d' % read_through('a'))

That corrupts the cache, and promptly breaks:

$ python version2.py
ValueError: invalid literal for int() with base 10: ’x’

At this point, red sirens are going off all over the office and support reps are sprinting in the direction of our desks. So we hit the rollback button, and:

$ python version1.py
ValueError: invalid literal for int() with base 10: b’x’

Oh no! It’s still broken! We can’t resolve this problem by rolling back. We’re lucky that in this case, nothing has been made the worse. But that is also a possibility. There’s no guarantee that the path from v1 to v2 and then back to v1 isn’t actively destructive.

A working website can eventually be resurrected by writing some new code to cope with the broken data.

def read_through(k):
    v = c.get(k)
    if not v:
        # let’s pretend this reads from the database.
        v = db[k]
        c.set(k, v)

    try:
        return int(v)
    except ValueError:
        # n.b. we screwed up some of the cached values on $DATE,
        # this remediates
        v = db[k]
        c.set(k, v)
        return int(v)

You might dispute the plausibility of a mistake as transparently daft as this. But in my career I’ve carried out conceptually similar acts of cache destruction many times. I’m not saying I’m a great programmer. But then again maybe you aren’t, either.

A Sharp Knife, Whose Handle is also a Knife

Adding a rollback button is not a neutral design choice. It affects the code that gets pushed. If developers incorrectly believe that their mistakes can be quickly reversed, they will tend to take more foolish risks. It might be hard to talk them out of it.

Mounting a rollback button within easy reach (as opposed to git revert, which you probably have to google) means that it’s more likely to be pressed carelessly in an emergency. Panic buttons are for when you’re panicking.

Practice Small Corrections

Pushbutton rollback is a bad idea. The only sensible thing to do is change the way we organize our code for deployment.

Complete deployment rollbacks are high-G maneuvers. The implications of initiating one given a nontrivial set of changes are impossible to reason about. You may decide that one is called for, but you should do this as a last resort.

Back home