DBAPI2 is all well and good. To a point. But if you have the usual website scaling problem, namely the one where you have a master database that worked fine when you were tiny and–dear god–not so well right now, the idea of interchangeable database libraries is basically a crock.
Before I am inundated with hate mail let me dial back my rhetoric a little bit. The existence of an API that works with altogether different databases is a wonderful thing and without it, things like Django or SQLAlchemy would not be possible. So rest assured I am not a complete maniac. I am not even really here to talk about dbapi2. I am just saying that 1) no two libraries are the same, 2) given sufficient scale this will matter to you, 3) the devil is in the details, and 4) the devil likes screwing things with white hot pokers.
Database client drivers intended for the same database can do drastically different things. By Python standards, the Postgres driver situation is completely schizo. There are a lot of them available - there are five dedicated Postgres drivers listed on the wiki, as opposed to just one for MySQL. People might choose different drivers for licensing reasons, for religious reasons, randomly (because they never did any analysis like I am about to do), or for completely inscrutable reasons because they are just plain out of their minds. You really would not believe how much blood I have seen spilled over Postgres client drivers.
Here, let me show you what I am talking about. Examine the following python program, which runs an identical operation on a pyPgSQL connection and a psycopg2 connection.
#! /usr/bin/env python
from __future__ import with_statement
from contextlib import closing
from pyPgSQL import PgSQL as pypgsql
import psycopg2
test_dsn = 'host=127.0.0.1 port=5432 user=dan dbname=postgres'
def test_select(c):
with closing(c.cursor()) as cr:
cr.execute('select 1')
print cr.fetchall()
def test():
with closing(pypgsql.connect(test_dsn)) as c:
test_select(c)
with closing(psycopg2.connect(test_dsn)) as c:
test_select(c)
if __name__ == '__main__':
test()
Here's what happens when pyPgSQL runs that select.
select version() BEGIN WORK DECLARE "PgSQL_0062AF80" CURSOR FOR select 1 FETCH 1 FROM "PgSQL_0062AF80" SELECT typname, -1 , typelem FROM pg_type WHERE oid = 23 FETCH ALL FROM "PgSQL_0062AF80" CLOSE "PgSQL_0062AF80" ROLLBACK WORK
First note that pyPgSQL issues a SELECT VERSION() command for every new connection. Why's it do that? Well since I've already dug through the source I can tell you that it does this to see if it has to do something wacky for PostgreSQL 7.1 and below. There's no way to disable this without patching the library.
This is not an enormous problem if the connection is pooled and reused, but it immediately becomes one if you want to use an out-of-process pool like PgBouncer. Every pyPgSQL connection that you make to pgbouncer will run this query, and in that scenario you are probably making zillions.
We get a transaction, even though I don't remember having asked for one. Since we never commit, it's rolled back. As default behavior this bites–more on this in a bit.
Finally, get a load of this:
SELECT typname, -1 , typelem FROM pg_type WHERE oid = 23;
That is pyPgSQL asking postgres what the name of the type associated with OID 23 is. I can tell you what it is without looking, it's an int4. The OID's of built-in types are hardcoded (see catalog/pg_types.h in the postgres source), so this is worse than pointless.
Well that was a shitshow. Now, what does psycopg2 do?
SET DATESTYLE TO 'ISO' SHOW client_encoding SHOW default_transaction_isolation BEGIN; SET TRANSACTION ISOLATION LEVEL READ COMMITTED select 1 ROLLBACK
This is marginally more acceptable. Which, to my point, means it's radically different. Right off the bat it is obvious that psycopg2 does not really support cursors. Apparently you can argue either side of this, but regardless it's a significant difference between the two libraries.
It would be better if there were a way to tell psycopg2 what the client encoding and default transaction isolation levels are, rather than have it query this with each connection. Again, this is fine unless you want to use an external connection pool. Note that READ COMMITTED is the default, which makes the SET unecessary, but it is issued anyway. (And since Postgres only really supports two isolation levels, it would be likewise pointless if the server setting were READ UNCOMMITTED).
As with pyPgSQL, the transaction-as-default-behavior thing is thoroughly brutal. As far as I can tell, this is not something the DBAPI2 PEP demands. There are many reasons why I think this is a bad idea, but they all boil down to the fact that transactions are not free. If you're executing a stored procedure, or just reading data, this boilerplate is superfluous. And if you want to get the most out of your database, you will have to turn this off. The syntax for disabling it is, of course, totally different and underdocumented in both libraries.
Note that PDO does nothing like this. PDO expects you to know what a transaction is. It gives you a prepared statement that you might not want, but that's a different problem.
If you've read this far, maybe you'd like to know what my advice is. Well, first of all, do not take the choice of driver lightly, and understand what you're getting into. You almost definitely want to use psycopg2, because on balance it is the least evil option*. If it comes to it, you can always patch out the unnecessary SHOW statements (although honestly, I'm not sure this will ever become an issue like pyPgSQL's stupid selects). However, you should take care to manage your use of transactions from the start of your project. That is the kind of thing that will be really painful to change after you have hundreds of queries implicitly relying on the default behavior.
* Also note that the pyPgSQL source mixes tabs and spaces. FML.
Tags: Angry, Databases, Performance, PostgreSQL, Python
I remember I wanted to patch the .c part of a postgres driver but instead just cached those extra SELECT results with a horrible monkeypatch (because the library was written in a way that complicated monkeypatches). It was awful.
But don’t worry, at least you’re not in memory-leak-and-encoding-hell with mysqldb
Prefork all the way dude.
Hi, was curious if you saw how Psycopg2 handled cursors if you didn’t do fetch all, or selected from a relation that had multiple rows.
Also, it looks like py-postgresql might be nice for python 3.0
@Mike I mean the PG “DECLARE x CURSOR FOR …” syntax (http://www.postgresql.org/docs/8.4/interactive/sql-declare.html). I’m not totally sure I understand the question. But although psycopg2 does have a cursor object (it’s required by DBAPI2), as well as some fancy cursor API in python-land, as far as I can tell it never uses this syntax.
Yeah py-postgresql does look promising if for no other reason than they have not completely thrown in the towel on things like the project website and online documentation. But since most people and linux distros will be on 2.X well into the next decade, I don’t know what good this will do anybody for a while.
psycopg2 does support cursors; it is just, like you say, highly underdocumented. It should be in a FAQ.
You need to ask for a named cursor:
eg:
with closing(c.cursor(name=’somename’)) as cr:
the name is the name given to the cursor.
If the drivers didn’t issue a BEGIN, then you would be in autocommit mode, in which case *every single* statement is wrapped in a separate, implicit, BEGIN/COMMIT (or BEGIN/ROLLBACK on error). Postgres never runs a statement that’s not in a transaction. One large transaction is more efficient than lots of little ones.
Reading the PostgreSQL docs, it looks like cursors without a transactional context are no good unless you use some additional query options, so pyPgSQL is probably doing it the best it can. “Unless WITH HOLD is specified, the cursor created by this command can only be used within the current transaction. Thus, DECLARE without WITH HOLD is useless outside a transaction block: the cursor would survive only to the completion of the statement.”
With that said, thanks for the insight on this stuff. I hadn’t really done any query logging with python drivers so this is good to know. If you run more than one query in a single session, are some of the upfront queries skipped?
@adam yes that’s right, I think the way I have that worded is pretty confusing. Here’s what I am actually trying to complain about.
If you use one of these as part of your own application (and do not cause autocommit to be turned on), at some point it will seem sensible to you to wrap the database access code in a function that calls .commit(). Which will issue a COMMIT WORK for you that you don’t need in the case of reads.
@mcgee yes, these libs all anticipate being pooled or reused so you generally get more reasonable results running multiple queries.
So for example, in the case of pyPgSQL, it will cache the OID to typname relationship in a dict on the connection. You would only see it ask what an int4 is once per session.
That can be a can of worms in itself, of course. If you hold a pyPgSQL connection open indefinitely and DROP/CREATE something, you’ll get errors as you’ve just changed an OID.
It should be pointed out that pyPgSQL effectivly stopped being developed ~6 years ago. As much as I understand the author is now putting his energy into SQLLite. So some PostgreSQL 7.x-isms may be excused… and I definitely do NOT recommend to use a driver 6 years unmaintained for a database which had > 5 major releases since then.
From the psycopg* psycopg1 is also discontinued…
Disclaimer: mese uses psycopg2, evangelizing Python and PostgreSQL as a match in heaven.
Curious if you looked at http://pybrary.net/pg8000/
Never seen pg8000 - cool, it fills a niche. I could have used that for something else I had to do related to unit testing. I would definitely stick to something based on libpq for production as much as possible though.
> pg8000’s name comes from the belief that it is probably about the 8000th PostgreSQL interface for Python. pg8000 is distributed under the terms of a modified BSD license.
At least they acknowledge the situation the right way
Hello,
The worst problem in psycopg2 has always been the lack of documentation. This has recently changed: please check http://initd.org/psycopg/docs/
psycopg2 supports PostgreSQL’s server-side cursor (DECLARE syntax): the topic is covered in the documentation.