I got paid to make this. That makes me a professional Actor.
Two New Projects
Ahoy! I have two new projects up on github.
- oplogutils - A set of utilities for manipulating the MongoDB oplog.
- etsy-python - A Python library for using the Etsy API.
The python library is pretty trivial and self-explanatory, but I realized that I had copied it into enough projects that it deserved to be in setuptools.
The rationale for oplogutils is a little involved. Basically, this is what you need in some worst-case recovery scenarios. Those scenarios will be explained in more detail by John Allspaw and myself in upcoming posts on Code as Craft.
MongoDB at Etsy
Please check out my new post over at Code as Craft, MongoDB at Etsy.
There will be at least two more installments in this series, one by John Allspaw and another one by myself.
Java Unix Domain Sockets Library
I have taken over the maintenance of JUDS from Klaus Trainer. You can find it at its new home here.
Scala Parser for PHP Serialized Data
I hope you never need this. But if you do, you can get it from me on github here.
It’s a little craptacular at the moment, but it works. I will maintain and enhance it depending on demand.
Code as Craft
Please take a second to subscribe to the new Etsy engineering blog, Code as Craft. There’s not much there yet except a few introductory posts by Chad, but there is content in the pipeline. My coworkers are uniformly brilliant and amazing and will undoubtedly have many interesting things to say.
I personally will be writing a series of posts about our large-scale flirtations with MongoDB. I also hope to find some time to write about Scala, Buildbot, EC2, Python, Postgres vs. MySQL, and maybe some more abstract topics.
See also: this plug by Fred Wilson.
Programming Must Be An Open System
I decided I needed an hour break from Scala hacking, and I am about halfway finished with Sean Carroll's From Eternity to Here, which goes on at great length about entropy as it relates to time's arrow. So for the fun of it I whipped up a simulation of an Ehrenfest Urn using Processing.js.
Check it out here (requires a browser supporting canvas).
Now that I'm done, there are a couple of things I find amazing about this.
- I felt like I needed a break from being paid to write Scala. As opposed to, like, Visual Basic, Java, C++, or something like that.
- Within an hour I was able to download processing, learn the basics, and hack this animated demo together. And put it on a web page, viewable by pretty much anybody that I care to reach. Back in the early aughts I probably spent fifty or sixty hours just trying to figure out how to step debug javascript.
From eternity to here apparently involves a massive improvement in the state of the programming art. Everything is amazing and nobody's happy.
PGProxy: A Testing Proxy for Postgres
I've released a package that I've been using for well over a year now for the purpose of writing functional test suites against applications using Postgres. The executive summary is that PGProxy allows you to write tests that are transactionally isolated from one another, without doing anything special in your application code.
This project is aimed more at functional tests of a website (using something like Selenium) than it is for unit tests of a single class or module. In those cases, using mock objects or other strategies is more viable. But, you could certainly use PGProxy in those scenarios as well.
You can get it on github here: http://github.com/mcfunley/pgproxy.
PGProxy is written in Python using Twisted, and has its own extensive set of unit and functional tests.
Why You Would Want This
If you have a sufficiently large set of functional tests written by a sufficiently large team, eventually naively-written tests will begin to interfere with one another. A really simple example of two tests that would interfere with one another would be:
- Test A: Tests a workflow for user "steve"
- Test B: Tests deleting the account for user "steve"
Clearly, if we don't take any special precautions in our test suite, Test A will never succeed if it is run after Test B. And if we require that Test A always runs before Test B, then we're forced to recreate our fixture database in between test runs. And this is just two tests–when you're talking about thousands, the potential interactions can be huge. Not to mention that depending on big your fixture databases are, creation at the start of every run can be a pain.
So eventually you will probably decide it would be better to restore the fixture data to a known state in between test cases. There are a few different ways to try to accomplish this:
- Restore the fixture database between test cases using something like CREATE DATABASE .. WITH TEMPLATE.
- Have each test be responsible for undoing whatever it does.
- Have each test be responsible for creating any data that it's going to use.
- Don't write tests for user "steve," but rather locate a user in the fixture data that meets the criteria that you need.
- Make each test work inside of a transaction, and roll the transaction back when the test is completed.
There are problems with all of these approaches. 1) is very slow. 2) is very tedious for developers and error prone. 3) is similarly tedious and can be slow, depending on how much logic creating your entities entails. 4) is really just moving the goalposts, because tests are still going to interfere with each other. 5) works to the extent that you can have all of your database access code share a connection and to the extent that your code does not try to use its own transactions.
But! I am writing this to tell you about an exciting new option, namely, "do something crazy." PGProxy is that crazy thing, and it works pretty well.
How it Works
As mentioned above, your test case can only work inside of a transaction if you are able to use a single connection per database per test case. If your test case is making calls to multiple processes that all want to use your fixture database, that is a pretty difficult thing to do.
PGProxy solves the problem by, you guessed, it, proxying all of your database connections. So if you have a PHP site running in Apache and a scala service running in Jetty, they can now share database connections in your tests. And consequently, they can share a transaction.

The other issue that comes up in using transactions to make test cases is that it's pretty common for the code that you're testing to want to use its own transactions. PGProxy solves this by rewriting transaction usage within the test case into SAVEPOINT usage. In other words, if you have a test that runs this SQL:
BEGIN; update users set username='chuck' where username='steve'; COMMIT; BEGIN; update users set password='foo' where username='chuck'; ROLLBACK;
PGProxy will rewrite that to this:
BEGIN; -- my test case SAVEPOINT x; update users set username='chuck' where username='steve'; RELEASE SAVEPOINT x; SAVEPOINT y; update users set password='foo' where username='chuck'; ROLLBACK TO SAVEPOINT y; ROLLBACK; -- my test case
Running the Proxy
There are a few ways to run the proxy. If you are writing your test suite using python (ie, using unittest), you can set up your test runner like this:
from __future__ import with_statement
import pgproxy
this_dir = os.path.realpath(os.path.dirname(__file__))
pidfile = os.path.join(this_dir, 'pgproxy.pid')
logfile = os.path.join(this_dir, 'pgproxy.log')
def run():
# this will shut down the proxy when the tests complete.
with pgproxy.run(pidfile=pidfile, logfile=logfile):
run_test_suite()
def run_test_suite():
# this should actually run your tests
pass
if __name__ == '__main__':
run()
Or you can use something like this script to start a standalone pgproxy process:
#! /usr/bin/env python
import pgproxy
import os
this_dir = os.path.realpath(os.path.dirname(__file__))
pidfile = os.path.join(this_dir, 'pgproxy.pid')
logfile = os.path.join(this_dir, 'pgproxy.log')
pgproxy.run(listenPort=5433, serverAddr=('localhost', 5432),
pidfile=pidfile, logfile=logfile)
In both of these cases, PGProxy is configured to accept connections on port 5433, and to connect to the Postgres server running on port 5432. In these examples you would tell your application to connect to port 5433.
In order to run PGProxy, you need Twisted version 8.1.0 or later.
Test Harness Integration
PGProxy accepts two special queries that signal the start and the end of tests. Your test suite will need to invoke these in setUp and tearDown (or whatever the equivalents are in the language / framework that you're using). Here's a unittest example:
class TestCase(unittest.TestCase):
def setUp(self):
self.query("BEGIN TEST '%s'" % self._testMethodName)
def tearDown(self):
self.query("ROLLBACK TEST '%s'" % self._testMethodName)
def test_something(self):
# now this has a transaction and can't do any serious damage
# to the fixture data.
pass
As you may have noticed above, the BEGIN/ROLLBACKS are sent to postgres with comments stating which test is running, which can be pretty handy if you find yourself needing to look at the postgres logs to debug something. Here you would see:
BEGIN; -- test_something ROLLBACK; -- test_something
Since setUp and tearDown are frequently overridden by developers for other purposes, I generally like to use a metaclass to wrap test cases in transactions instead. This way if a developer forgets to call the base test case's setUp method, it's no big deal for the rest of the suite. I'll leave that as an exercise. You get the idea.
A Word About Twisted
Lord knows I have a complicated opinion of all things Twisted, and maybe someday I will write something about that. And by "someday" I mean I am almost certainly never going to, because I have enough trouble staying out of nerd fights on the internet.
But I have to say that for this project, with the precise set of requirements that it had, and taking as a given my pre-existing wealth of experience with Twisted, things worked out great. This was a from-scratch rewrite of my first version, which was written using asyncore. The asyncore version was riddled with obscure race conditions, and it turned out to be much easier to just rewrite the damned thing using an event-driven framework than it ever was to debug the original. There was not a better choice of Python framework for this project, though I did toy with the idea of using scala. Anyway, I hope this praise for Twisted doesn't come across as excessively faint.
The End
Enjoy! Don't hesitate to drop me a line if you find this useful or have bug reports.
Functional vs. Imperative Red-Black Tree Insertion
MarkCC posted an insert algorithm for red-black trees in Haskell yesterday. Now I don't want to come across as anything resembling an expert here but I felt honor-bound to raise the point that there is a far simpler algorithm described in Chris Okasaki's 1999 paper Red-Black Trees in a Functional Setting.
It goes basically like this:
data Color = Red | Black deriving (Show, Eq)
data RedBlackTree a =
Empty | Node Color (RedBlackTree a) a (RedBlackTree a)
deriving (Show, Eq)
insert :: Ord a => a -> RedBlackTree a -> RedBlackTree a
insert elem t = makeBlack (insert' elem t)
where makeBlack (Node _ a y b) = Node Black a y b
insert' elem Empty = Node Red Empty elem Empty
insert' elem s@(Node color a y b)
| elem < y = balance color (insert' elem a) y b
| elem > y = balance color a y (insert' elem b)
| otherwise = s
balance Black (Node Red (Node Red a x b) y c) z d =
Node Red (Node Black a x b) y (Node Black c z d)
balance Black (Node Red a x (Node Red b y c)) z d =
Node Red (Node Black a x b) y (Node Black c z d)
balance Black a x (Node Red (Node Red b y c) z d) =
Node Red (Node Black a x b) y (Node Black c z d)
balance Black a x (Node Red b y (Node Red c z d)) =
Node Red (Node Black a x b) y (Node Black c z d)
balance color a x b = Node color a x b
I will not try to replicate the contents of the paper here, it's very short so I encourage you to go read it. The gist of it is as follows.
The reason the Haskell algorithm is simpler is because in functional languages, destructive updates to the tree are impossible. This renders the (relatively complex) optimizations that are commonplace in imperative implementations pointless.
The functional implementation has other advantages, namely: beauty, simplicity, persistence, and all of the other things we choose when we give up pointer manipulation.
Job Advertisement
If you know me, used to work with me, or are unfazed by the ill-considered, hotheaded tirades I infrequently post here, please consider getting in touch. If you do, and it works out, here is what I personally feel comfortable promising you:
- You will work on a relatively small team on one of the largest websites in the world.
- Your work will be interesting. There will be problems of scale that you just don't have at most other places. You will get to use the cool kid NoSQL databases, learn about web operations, and other fun stuff.
- You will like your coworkers, and they will be uniformly brilliant and talented.
- You will work on something that by and large makes a positive impact on people's lives.
- You will not miss your stinky financial industry job even a little bit.