I hope you never need this. But if you do, you can get it from me on github here.
It’s a little craptacular at the moment, but it works. I will maintain and enhance it depending on demand.
I hope you never need this. But if you do, you can get it from me on github here.
It’s a little craptacular at the moment, but it works. I will maintain and enhance it depending on demand.
Please take a second to subscribe to the new Etsy engineering blog, Code as Craft. There’s not much there yet except a few introductory posts by Chad, but there is content in the pipeline. My coworkers are uniformly brilliant and amazing and will undoubtedly have many interesting things to say.
I personally will be writing a series of posts about our large-scale flirtations with MongoDB. I also hope to find some time to write about Scala, Buildbot, EC2, Python, Postgres vs. MySQL, and maybe some more abstract topics.
See also: this plug by Fred Wilson.
I decided I needed an hour break from Scala hacking, and I am about halfway finished with Sean Carroll's From Eternity to Here, which goes on at great length about entropy as it relates to time's arrow. So for the fun of it I whipped up a simulation of an Ehrenfest Urn using Processing.js.
Check it out here (requires a browser supporting canvas).
Now that I'm done, there are a couple of things I find amazing about this.
From eternity to here apparently involves a massive improvement in the state of the programming art. Everything is amazing and nobody's happy.
I've released a package that I've been using for well over a year now for the purpose of writing functional test suites against applications using Postgres. The executive summary is that PGProxy allows you to write tests that are transactionally isolated from one another, without doing anything special in your application code.
This project is aimed more at functional tests of a website (using something like Selenium) than it is for unit tests of a single class or module. In those cases, using mock objects or other strategies is more viable. But, you could certainly use PGProxy in those scenarios as well.
You can get it on github here: http://github.com/mcfunley/pgproxy.
PGProxy is written in Python using Twisted, and has its own extensive set of unit and functional tests.
If you have a sufficiently large set of functional tests written by a sufficiently large team, eventually naively-written tests will begin to interfere with one another. A really simple example of two tests that would interfere with one another would be:
Clearly, if we don't take any special precautions in our test suite, Test A will never succeed if it is run after Test B. And if we require that Test A always runs before Test B, then we're forced to recreate our fixture database in between test runs. And this is just two tests–when you're talking about thousands, the potential interactions can be huge. Not to mention that depending on big your fixture databases are, creation at the start of every run can be a pain.
So eventually you will probably decide it would be better to restore the fixture data to a known state in between test cases. There are a few different ways to try to accomplish this:
There are problems with all of these approaches. 1) is very slow. 2) is very tedious for developers and error prone. 3) is similarly tedious and can be slow, depending on how much logic creating your entities entails. 4) is really just moving the goalposts, because tests are still going to interfere with each other. 5) works to the extent that you can have all of your database access code share a connection and to the extent that your code does not try to use its own transactions.
But! I am writing this to tell you about an exciting new option, namely, "do something crazy." PGProxy is that crazy thing, and it works pretty well.
As mentioned above, your test case can only work inside of a transaction if you are able to use a single connection per database per test case. If your test case is making calls to multiple processes that all want to use your fixture database, that is a pretty difficult thing to do.
PGProxy solves the problem by, you guessed, it, proxying all of your database connections. So if you have a PHP site running in Apache and a scala service running in Jetty, they can now share database connections in your tests. And consequently, they can share a transaction.

The other issue that comes up in using transactions to make test cases is that it's pretty common for the code that you're testing to want to use its own transactions. PGProxy solves this by rewriting transaction usage within the test case into SAVEPOINT usage. In other words, if you have a test that runs this SQL:
BEGIN; update users set username='chuck' where username='steve'; COMMIT; BEGIN; update users set password='foo' where username='chuck'; ROLLBACK;
PGProxy will rewrite that to this:
BEGIN; -- my test case SAVEPOINT x; update users set username='chuck' where username='steve'; RELEASE SAVEPOINT x; SAVEPOINT y; update users set password='foo' where username='chuck'; ROLLBACK TO SAVEPOINT y; ROLLBACK; -- my test case
There are a few ways to run the proxy. If you are writing your test suite using python (ie, using unittest), you can set up your test runner like this:
from __future__ import with_statement
import pgproxy
this_dir = os.path.realpath(os.path.dirname(__file__))
pidfile = os.path.join(this_dir, 'pgproxy.pid')
logfile = os.path.join(this_dir, 'pgproxy.log')
def run():
# this will shut down the proxy when the tests complete.
with pgproxy.run(pidfile=pidfile, logfile=logfile):
run_test_suite()
def run_test_suite():
# this should actually run your tests
pass
if __name__ == '__main__':
run()
Or you can use something like this script to start a standalone pgproxy process:
#! /usr/bin/env python
import pgproxy
import os
this_dir = os.path.realpath(os.path.dirname(__file__))
pidfile = os.path.join(this_dir, 'pgproxy.pid')
logfile = os.path.join(this_dir, 'pgproxy.log')
pgproxy.run(listenPort=5433, serverAddr=('localhost', 5432),
pidfile=pidfile, logfile=logfile)
In both of these cases, PGProxy is configured to accept connections on port 5433, and to connect to the Postgres server running on port 5432. In these examples you would tell your application to connect to port 5433.
In order to run PGProxy, you need Twisted version 8.1.0 or later.
PGProxy accepts two special queries that signal the start and the end of tests. Your test suite will need to invoke these in setUp and tearDown (or whatever the equivalents are in the language / framework that you're using). Here's a unittest example:
class TestCase(unittest.TestCase):
def setUp(self):
self.query("BEGIN TEST '%s'" % self._testMethodName)
def tearDown(self):
self.query("ROLLBACK TEST '%s'" % self._testMethodName)
def test_something(self):
# now this has a transaction and can't do any serious damage
# to the fixture data.
pass
As you may have noticed above, the BEGIN/ROLLBACKS are sent to postgres with comments stating which test is running, which can be pretty handy if you find yourself needing to look at the postgres logs to debug something. Here you would see:
BEGIN; -- test_something ROLLBACK; -- test_something
Since setUp and tearDown are frequently overridden by developers for other purposes, I generally like to use a metaclass to wrap test cases in transactions instead. This way if a developer forgets to call the base test case's setUp method, it's no big deal for the rest of the suite. I'll leave that as an exercise. You get the idea.
Lord knows I have a complicated opinion of all things Twisted, and maybe someday I will write something about that. And by "someday" I mean I am almost certainly never going to, because I have enough trouble staying out of nerd fights on the internet.
But I have to say that for this project, with the precise set of requirements that it had, and taking as a given my pre-existing wealth of experience with Twisted, things worked out great. This was a from-scratch rewrite of my first version, which was written using asyncore. The asyncore version was riddled with obscure race conditions, and it turned out to be much easier to just rewrite the damned thing using an event-driven framework than it ever was to debug the original. There was not a better choice of Python framework for this project, though I did toy with the idea of using scala. Anyway, I hope this praise for Twisted doesn't come across as excessively faint.
Enjoy! Don't hesitate to drop me a line if you find this useful or have bug reports.
MarkCC posted an insert algorithm for red-black trees in Haskell yesterday. Now I don't want to come across as anything resembling an expert here but I felt honor-bound to raise the point that there is a far simpler algorithm described in Chris Okasaki's 1999 paper Red-Black Trees in a Functional Setting.
It goes basically like this:
data Color = Red | Black deriving (Show, Eq)
data RedBlackTree a =
Empty | Node Color (RedBlackTree a) a (RedBlackTree a)
deriving (Show, Eq)
insert :: Ord a => a -> RedBlackTree a -> RedBlackTree a
insert elem t = makeBlack (insert' elem t)
where makeBlack (Node _ a y b) = Node Black a y b
insert' elem Empty = Node Red Empty elem Empty
insert' elem s@(Node color a y b)
| elem < y = balance color (insert' elem a) y b
| elem > y = balance color a y (insert' elem b)
| otherwise = s
balance Black (Node Red (Node Red a x b) y c) z d =
Node Red (Node Black a x b) y (Node Black c z d)
balance Black (Node Red a x (Node Red b y c)) z d =
Node Red (Node Black a x b) y (Node Black c z d)
balance Black a x (Node Red (Node Red b y c) z d) =
Node Red (Node Black a x b) y (Node Black c z d)
balance Black a x (Node Red b y (Node Red c z d)) =
Node Red (Node Black a x b) y (Node Black c z d)
balance color a x b = Node color a x b
I will not try to replicate the contents of the paper here, it's very short so I encourage you to go read it. The gist of it is as follows.
The reason the Haskell algorithm is simpler is because in functional languages, destructive updates to the tree are impossible. This renders the (relatively complex) optimizations that are commonplace in imperative implementations pointless.
The functional implementation has other advantages, namely: beauty, simplicity, persistence, and all of the other things we choose when we give up pointer manipulation.
If you know me, used to work with me, or are unfazed by the ill-considered, hotheaded tirades I infrequently post here, please consider getting in touch. If you do, and it works out, here is what I personally feel comfortable promising you:
Inspired by Peter Norvig, I have created a quick hack to generate haikus from Etsy listing titles. Check it out here.
Technical details for those who are interested:
All in all, about three or four hours of effort. Note that Etsy has nothing to do with these haikus, and doesn’t endorse this app.
DBAPI2 is all well and good. To a point. But if you have the usual website scaling problem, namely the one where you have a master database that worked fine when you were tiny and–dear god–not so well right now, the idea of interchangeable database libraries is basically a crock.
Before I am inundated with hate mail let me dial back my rhetoric a little bit. The existence of an API that works with altogether different databases is a wonderful thing and without it, things like Django or SQLAlchemy would not be possible. So rest assured I am not a complete maniac. I am not even really here to talk about dbapi2. I am just saying that 1) no two libraries are the same, 2) given sufficient scale this will matter to you, 3) the devil is in the details, and 4) the devil likes screwing things with white hot pokers.
Database client drivers intended for the same database can do drastically different things. By Python standards, the Postgres driver situation is completely schizo. There are a lot of them available - there are five dedicated Postgres drivers listed on the wiki, as opposed to just one for MySQL. People might choose different drivers for licensing reasons, for religious reasons, randomly (because they never did any analysis like I am about to do), or for completely inscrutable reasons because they are just plain out of their minds. You really would not believe how much blood I have seen spilled over Postgres client drivers.
Here, let me show you what I am talking about. Examine the following python program, which runs an identical operation on a pyPgSQL connection and a psycopg2 connection.
#! /usr/bin/env python
from __future__ import with_statement
from contextlib import closing
from pyPgSQL import PgSQL as pypgsql
import psycopg2
test_dsn = 'host=127.0.0.1 port=5432 user=dan dbname=postgres'
def test_select(c):
with closing(c.cursor()) as cr:
cr.execute('select 1')
print cr.fetchall()
def test():
with closing(pypgsql.connect(test_dsn)) as c:
test_select(c)
with closing(psycopg2.connect(test_dsn)) as c:
test_select(c)
if __name__ == '__main__':
test()
Here's what happens when pyPgSQL runs that select.
select version() BEGIN WORK DECLARE "PgSQL_0062AF80" CURSOR FOR select 1 FETCH 1 FROM "PgSQL_0062AF80" SELECT typname, -1 , typelem FROM pg_type WHERE oid = 23 FETCH ALL FROM "PgSQL_0062AF80" CLOSE "PgSQL_0062AF80" ROLLBACK WORK
First note that pyPgSQL issues a SELECT VERSION() command for every new connection. Why's it do that? Well since I've already dug through the source I can tell you that it does this to see if it has to do something wacky for PostgreSQL 7.1 and below. There's no way to disable this without patching the library.
This is not an enormous problem if the connection is pooled and reused, but it immediately becomes one if you want to use an out-of-process pool like PgBouncer. Every pyPgSQL connection that you make to pgbouncer will run this query, and in that scenario you are probably making zillions.
We get a transaction, even though I don't remember having asked for one. Since we never commit, it's rolled back. As default behavior this bites–more on this in a bit.
Finally, get a load of this:
SELECT typname, -1 , typelem FROM pg_type WHERE oid = 23;
That is pyPgSQL asking postgres what the name of the type associated with OID 23 is. I can tell you what it is without looking, it's an int4. The OID's of built-in types are hardcoded (see catalog/pg_types.h in the postgres source), so this is worse than pointless.
Well that was a shitshow. Now, what does psycopg2 do?
SET DATESTYLE TO 'ISO' SHOW client_encoding SHOW default_transaction_isolation BEGIN; SET TRANSACTION ISOLATION LEVEL READ COMMITTED select 1 ROLLBACK
This is marginally more acceptable. Which, to my point, means it's radically different. Right off the bat it is obvious that psycopg2 does not really support cursors. Apparently you can argue either side of this, but regardless it's a significant difference between the two libraries.
It would be better if there were a way to tell psycopg2 what the client encoding and default transaction isolation levels are, rather than have it query this with each connection. Again, this is fine unless you want to use an external connection pool. Note that READ COMMITTED is the default, which makes the SET unecessary, but it is issued anyway. (And since Postgres only really supports two isolation levels, it would be likewise pointless if the server setting were READ UNCOMMITTED).
As with pyPgSQL, the transaction-as-default-behavior thing is thoroughly brutal. As far as I can tell, this is not something the DBAPI2 PEP demands. There are many reasons why I think this is a bad idea, but they all boil down to the fact that transactions are not free. If you're executing a stored procedure, or just reading data, this boilerplate is superfluous. And if you want to get the most out of your database, you will have to turn this off. The syntax for disabling it is, of course, totally different and underdocumented in both libraries.
Note that PDO does nothing like this. PDO expects you to know what a transaction is. It gives you a prepared statement that you might not want, but that's a different problem.
If you've read this far, maybe you'd like to know what my advice is. Well, first of all, do not take the choice of driver lightly, and understand what you're getting into. You almost definitely want to use psycopg2, because on balance it is the least evil option*. If it comes to it, you can always patch out the unnecessary SHOW statements (although honestly, I'm not sure this will ever become an issue like pyPgSQL's stupid selects). However, you should take care to manage your use of transactions from the start of your project. That is the kind of thing that will be really painful to change after you have hundreds of queries implicitly relying on the default behavior.
* Also note that the pyPgSQL source mixes tabs and spaces. FML.
It is I who wrote RichardIsAFuckingIdiotControl, voted the "best comment in source code … ever encountered" in this StackOverflow question. I didn't submit it, nor do I know the person that did. (I came across it on reddit, or something. I don't have the will or personal bandwidth to participate in something like StackOverflow.) This happened at a former employer.
I am not sure that I am proud of this. Actually I'm pretty sure that I'm not proud of it. But I'll explain what the deal was for the record anyway. I am also not sure that this deserves the "accolades" that it received. After all it's not technically a comment (although it contains some doozies) and I would never have chosen pure vitriol over, say, something from Ritchie's Odd Comments and Strange Doings in Unix.
Obviously what sets this apart from other snippets in the genre is the over-the-top hatred of a very specific colleague. So let me tell you a little about Richard.
First, his name wasn't Richard. Whoever submitted the sample was wise enough to change that (thanks, guy). I didn't even use his given name in the original source, although it was a moniker that most people would have understood. "Richard" was no longer employed by the company when I wrote this. He had recently been fired for repeatedly showing up in the early to mid-afternoon drunk and coked out of his mind (I guess nobody told him that we real programmers show up on time and drink at our desks).
Richard was a recent college grad with roughly a 1.8 GPA from a decent-but-not-prestigious CS program. Miraculously (if you're Richard), someone decided he was a "cultural fit" and therefore deserved $70K per year. For readers who have only ever been exposed to polite society, it is important to note that this is standard practice in the financial industry and is considered "normal." Around this time, I had gotten a little bent out of shape about hiring for various reasons that I won't go into, and I was the asshole. Go figure.
Personally, there is only one word that describes the kid. That word is broseph. If I were to call him a violent, drug-addled menace, it would not be hyperbole. Let me say it again, "cultural fit."
After Richard's exit, I had to take over his code. I wound up rewriting almost all of it from scratch, and this class was serving as a blast shield around the volatile remains of what I could salvage. Honestly, at this point I was bored to tears, and a lot of what I did was probably total crap. If there's any justice, right now someone's writing a profanity-laden class with my name all over it.
But anyway, let’s go over the code. I’ll add some color, where I can remember what was going on.
// The main problem is the BindCompany() method, // which he hoped would be able to do everything. I hope he dies.
I am firmly opposed to the death penalty, but if a piano fell on his head I would be sincerely happy. (And again, for some reason I'm the asshole.)
public void BindCompany(int companyId) { }
// snip
private void MakeSureNobodyAccidentallyGetsBittenByRichardsStupidity()
{
// Make sure nobody is actually using that fucking bindcompany method
MethodInfo m = this.GetType().GetMethod("BindCompany", BindingFlags.DeclaredOnly |
BindingFlags.Instance | BindingFlags.Public | BindingFlags.NonPublic);
if (m != null)
{
throw new RichardIsAFuckingIdiotException("No!! Don't use the fucking BindCompany method!!!");
}
// P.S. this method is a joke ... the rest of the class is fucking serious
}
There was some deeply-entrenched reason why I could not change the definition of the BindCompany method. It had to be there, but the code paths that called it were all fundamentally flawed, or something like that. I decided to be funny and use reflection to raise an exception if anyone happened to redefine with an implementation of it in a derived class.
/// <summary>
/// This returns true if this control is supposed to be doing anything
/// at all for this request. Richard thought it was a good idea to load
/// the entire website during every request and have things turn themselves
/// off. He also thought bandanas and aviator sunglasses were "fuckin'
/// gnarly, dude."
/// </summary>
protected bool IsThisTheRightPageImNotSureBecauseRichardIsDumb()
{
return Request.QueryString["Section"] == this.MenuItemKey;
}
One of the really crazy things about this application was that for every web request, it would actually load several hundred control classes and call methods on them. Maybe two or three of these would actually be necessary. They would all determine (based on the URL, I think) if they were supposed to be drawing anything. It would have been much easier to, god I don't know, just call the methods that were necessary to draw each page. I still have nightmares about this ridiculous contraption every now and then.
I guess the StackOverflow snippet doesn't capture this, but the best thing about Richard's code was that he loved property getters and setters. No, wait, that's not quite right. Lots of people love getters and setters, but Richard seemed to be in love with getters and setters. So much so that about 70% of his logic took place in them. More than once I deleted code that looked like this:
foo.x = foo.x;
Only to break entire pages, because the side effects of that assignment were doing everything. Anyway, I hope you can all see where I was coming from now.
So, I have finally fixed this stupid site. I spent years devising plans for rewriting it in lisp, or something, then one day read this essay about perfectionism and decided I was being completely ridiculous. Yes, that is my dog in the header, thank you for asking.
Some would have balked at the prospect of hand editing years of predominantly embarrassing posts, but I soldiered on through good times and bad. I am still messing with the layout and I haven’t even looked at the damn thing in Internet Explorer yet. (If you have IE—I don’t at home—let me know if something is really messed up.)
Evidently, once upon a time I had a career writing a lot of code for various Microsoft platforms. I feel like that part of my life needs to remain there for posterity, even though I am very happily participating in the open source, OMG-scale web world these days.
If I leave those posts there, perhaps someday someone can explain them to me. Very few of them make any sense to me now. I do remember fielding the e-mail from the ignoramus I-banker that prompted me to write Significant Digits for the Inummerate. With a little luck, his life is now ruined forever.
I thought I would miss debugging obscure threading issues, and rooting through core dumps, and staring at disassembly, but I was wrong. I have come to appreciate the sublime beauty of fork, the challenge of writing code for epic scale, having the damn source code, and solving problems that matter to people that are not evil mutants. I haven’t hand-edited XML in two years. Life is beautiful. Hopefully with the whole “blog” issue out of the way I will be able to think of something interesting to talk about.