psycopg2 and large queries

Laszlo Nagy · Dec 18, 2008

psycopg2 is said to be db api 2.0 compilant, but apparent it is buggy.
By default, when I create a cursor with

cur = conn.cursor()

then it creates a cursor that will fetch all rows into memory, even if
you call cur.fetchone() on it. (I tested it, see below.)

I was looking for psycopg2 documentation, but I found nothing. However,
I found some posts telling that named cursors do support fetching a
single row at a time. Here is how to create a named cursor:

cur = conn.cursor('mycursor')

This is very strange, because DB API 2.0 does not have this feature. Why
this feature was created, and how to use it? Not documented.

Anyway, I wrote a program that uses named/nameless cursors, sets
cur.arraysize to different values before callig cur.execute, then
fetches all rows with fetchone(). I used the following query on a bigger
table:

select name from product limit 1000000

Here are the test results for fetchone():

Named cursor, arraysize=1 2613 rec/sec, query opens in 0.005 sec,
uses only a few memory
Named cursor, arraysize=1000 2831 rec/sec, query opens in 0.005 sec,
uses only a few memory
Nameless cursor, arraysize=1 41335 rec/sec, query opens in 28 sec, uses
100MB+ memory
Nameless cursor, arraysize=1 39602 rec/sec, query opens in 25.8 sec,
uses 100MB+ memory

It is clear that named cursors have very bad performance, I cannot use
them. Nameless cursors cannot be used either, because they are stressing
the system, put 100% disk I/O and big memory usage, without any good reason.

The only one solution I found is to use named cursors, and use
fetchmany(100) instead of fetchone(). This results in fast opening
(0.005 sec) of the cursor, and good speed (30 000 rec/sec, about 75% of
the nameless version). (Do I really need to implement buffering in
Python to be efficient???)

However, this is also not usable, because named cursors do not have a
".description" property! You can try this:

cur = conn.cursor('mycursor')
cur.execute('select name from product limit 1000000')
print repr(cur.description) # -> None

This is unacceptable! This is not DB API 2.0 compilant. I have to know
the names of the columns, how can I do that?

What am I doing wrong? Please help me!

Thanks,

Laszlo

p.s.: I tried to subscribe to the psycopg mailing list, but the
confirmation email did not arrive in 6 hours...

Paul Boddie · Dec 18, 2008

psycopg2 is said to be db api 2.0 compilant, but apparent it is buggy.
By default, when I create a cursor with

cur = conn.cursor()

then it creates a cursor that will fetch all rows into memory, even if
you call cur.fetchone() on it. (I tested it, see below.)

Yes, I filed a bug against psycopg2 on this very subject, although the
project doesn't seem to have a bug-tracker any more.

I was looking for psycopg2 documentation, but I found nothing. However,
I found some posts telling that named cursors do support fetching a
single row at a time. Here is how to create a named cursor:

cur = conn.cursor('mycursor')

This is very strange, because DB API 2.0 does not have this feature. Why
this feature was created, and how to use it? Not documented.

The reason is that PostgreSQL supports server-side cursors through a
DECLARE ... CURSOR statement, but that statement can only be used with
certain SQL statements. Other DB-API modules employ simple but
imperfect tricks to guess whether the statement being issued is
compatible with DECLARE ... CURSOR before automatically creating a
cursor, but the psycopg2 maintainers refused to introduce such a
feature as part of the default behaviour. I didn't pursue the avenue
of making a suitable set of patches to satisfy both them and myself,
and I subsequently went back to using pyPgSQL instead.

[...]

It is clear that named cursors have very bad performance, I cannot use
them. Nameless cursors cannot be used either, because they are stressing
the system, put 100% disk I/O and big memory usage, without any good reason.

The only one solution I found is to use named cursors, and use
fetchmany(100) instead of fetchone(). This results in fast opening
(0.005 sec) of the cursor, and good speed (30 000 rec/sec, about 75% of
the nameless version). (Do I really need to implement buffering in
Python to be efficient???)

You really don't want to be traversing large data sets using fetchone,
anyway. My approach (using pyPgSQL) involves fetchmany and then
looping over each batch of results, if I really have to process the
data in Python; most of the time I can do the processing in the
database itself.

However, this is also not usable, because named cursors do not have a
".description" property! You can try this:

cur = conn.cursor('mycursor')
cur.execute('select name from product limit 1000000')
print repr(cur.description) # -> None

This is unacceptable! This is not DB API 2.0 compilant. I have to know
the names of the columns, how can I do that?

What am I doing wrong? Please help me!

I'm not really a user of the description property, so I can't advise
you there. Strictly, psycopg2 is DB-API compliant, but the interaction
between result sets and cursors really doesn't live up to what the
specification suggests is possible.

Paul

Steve Holden · Dec 18, 2008

Paul Boddie wrote:

[...]>

You really don't want to be traversing large data sets using fetchone,
anyway. My approach (using pyPgSQL) involves fetchmany and then
looping over each batch of results, if I really have to process the
data in Python; most of the time I can do the processing in the
database itself.

Hmm, pypgsql doesn't provide a 2.5 Windows installer. I take it you
aren't a Windows user ... ?

regards
Steve

Paul Boddie · Dec 18, 2008

Hmm, pypgsql doesn't provide a 2.5 Windows installer. I take it you
aren't a Windows user ... ?

Well, there are plenty of PostgreSQL modules around these days, and
even if pyPgSQL isn't suitable, I'm sure that there must be one which
can be made to work on Windows and to support server-side cursors. See
here for more:

http://wiki.python.org/moin/PostgreSQL

Paul

Laszlo Nagy · Dec 18, 2008

Well, there are plenty of PostgreSQL modules around these days, and
even if pyPgSQL isn't suitable, I'm sure that there must be one which
can be made to work on Windows and to support server-side cursors. See
here for more:

http://wiki.python.org/moin/PostgreSQL

I'm just looking for something that can replace psycopg2, because of the
bug mentioned in my original post. Here are my options:

- psycopg1: development stalled
- psycopg2: memory bug and/or not db api compilant (see my original post)
- pyPgSQL: looks like that the last release was in mid 2006 (and it
depends on mxDateTime -> licensing problems)
- mxODBC: commercial
- PyGresSQL: apparently, it does not support fetching one row, only
fetching all rows (see:
http://www.pygresql.org/pg.html#getresult-get-query-values-as-list-of-tuples),
so this is not an option. (Yes, it also has a db api compilant module,
but it is only a wrapper around pgqueryobject, and it does not support
server side cursors...)

So there are many options, but do we have one that:

1. supports server side cursors (fetch some rows after opening the
query, but not all of them)
2. open source, with no licensing problems
3. actively maintained

Thanks,

Laszlo

D'Arcy J.M. Cain · Dec 18, 2008

- PyGresSQL: apparently, it does not support fetching one row, only
fetching all rows (see:
http://www.pygresql.org/pg.html#getresult-get-query-values-as-list-of-tuples),
so this is not an option. (Yes, it also has a db api compilant module,
but it is only a wrapper around pgqueryobject, and it does not support
server side cursors...)

I'm not sure where you get that. The DB-API compliant execute function
uses server side cursors with fetch.

M.-A. Lemburg · Dec 18, 2008

I'm just looking for something that can replace psycopg2, because of the
bug mentioned in my original post. Here are my options:

- psycopg1: development stalled
- psycopg2: memory bug and/or not db api compilant (see my original post)
- pyPgSQL: looks like that the last release was in mid 2006 (and it
depends on mxDateTime -> licensing problems)

mxDateTime is open-source, just like everything you find in our
eGenix mx Base distribution:

http://www.egenix.com/products/python/mxBase/

mxODBC and mxODBC Connect are commercial products, work well
with PostgreSQL and are actively maintained:

http://www.egenix.com/products/python/

Why not give them a try ?

--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source (#1, Dec 18 2008)________________________________________________________________________
2008-12-02: Released mxODBC.Connect 1.0.0 http://python.egenix.com/

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::

eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611
http://www.egenix.com/company/contact/

Paul Boddie · Dec 18, 2008

I'm just looking for something that can replace psycopg2, because of the
bug mentioned in my original post. Here are my options:

- psycopg1: development stalled
- psycopg2: memory bug and/or not db api compilant (see my original post)

If you want, I can try and dig up those patches I made.

- pyPgSQL: looks like that the last release was in mid 2006 (and it
depends on mxDateTime -> licensing problems)

Well, Debian ships mxDateTime, so I suppose the licence is DFSG-
compliant, at least. Whether that's enough only you can say, but
porting the dependent parts to use the Python standard library
datetime implementation might be possible.

Paul

alito · Dec 19, 2008

I was looking for psycopg2 documentation, but I found nothing. However,
I found some posts telling that named cursors do support fetching a
single row at a time. Here is how to create a named cursor:

cur = conn.cursor('mycursor')

This is very strange, because DB API 2.0 does not have this feature. Why
this feature was created, and how to use it? Not documented.

The feature was created to use server-side cursors. The DB API
doesn't specify how to create them. I've got no idea why it isn't
documented.

However, this is also not usable, because named cursors do not have a
".description" property! You can try this:

cur = conn.cursor('mycursor')
cur.execute('select name from product limit 1000000')
print repr(cur.description) # -> None

They do have a description attribute, but it is only populated after
you fetch a row. eg try
cur = conn.cursor(name='mycursor')
cur.execute('select name from blah')
cur.fetchone()
print cur.description

This is unacceptable!

Really? I accepted it just fine.

p.s.: I tried to subscribe to the psycopg mailing list, but the
confirmation email did not arrive in 6 hours...

psycopg website and development in general seems to be comatose.
Emails to the list are being silently dropped. Also, don't believe the
front page, development stopped happening on that svn branch a while
ago and moved to a bzr branch, which has failed to respond for the
last few weeks. Maybe the relevant people are on holidays. If this
persists for a while someone will fork it, since quite a few people
are interested in its continued development.

Laszlo Nagy · Dec 19, 2008

They do have a description attribute, but it is only populated after
you fetch a row. eg try
cur = conn.cursor(name='mycursor')
cur.execute('select name from blah')
cur.fetchone()
print cur.description

Oh, great. I should have known. Thanks. Maybe I can live with psycopg2,
because combining server side (named) cursors with fetchmany(100) gives
fast query opening and relatively fast row fetching.

Thank you for all!

Laszlo

Cursor navigation	2	Jul 19, 2005
[ANN] pysqlite 2.0.0 final released!	0	May 16, 2005
[ANN] pysqlite 2.1.0 released	2	Jan 15, 2006
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	15	Apr 1, 2006
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	1	Feb 1, 2004

psycopg2 and large queries

Laszlo Nagy

Paul Boddie

Steve Holden

Paul Boddie

Laszlo Nagy

D'Arcy J.M. Cain

M.-A. Lemburg

Paul Boddie

alito

Laszlo Nagy

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads