PEP 249 (database api) -- executemany() with iterable?

R

Roy Smith

PEP 249 says about executemany():

Prepare a database operation (query or command) and then
execute it against all parameter sequences or mappings
found in the sequence seq_of_parameters.

are there any plans to update the api to allow an iterable instead of
a sequence?
 
J

Jon Clements

PEP 249 says about executemany():

        Prepare a database operation (query or command) and then
        execute it against all parameter sequences or mappings
        found in the sequence seq_of_parameters.

are there any plans to update the api to allow an iterable instead of
a sequence?

I'm not understanding (probably me). Example?
 
R

Roy Smith

I'm not understanding (probably me). Example?

I have a dictionary, d, which has a million items in it, and want to
do something like:

executemany("update foo set bar = %s where id = %s",
d.iteritems())

If executemany accepted an iterable, that would work. But, it only
accepts a sequence, so I need to do:

executemany("update foo set bar = %s where id = %s", d.items())

which generates a million-item temporary list. Or am I mis-
understanding the PEP?
 
J

Jon Clements

I have a dictionary, d, which has a million items in it, and want to
do something like:

    executemany("update foo set bar = %s where id = %s",
d.iteritems())

If executemany accepted an iterable, that would work.  But, it only
accepts a sequence, so I need to do:

    executemany("update foo set bar = %s where id = %s", d.items())

which generates a million-item temporary list.  Or am I mis-
understanding the PEP?

Interesting, but here's my guess...

Replace d.items() with itertools.repeat( ('a', 'b') )

So, if you have a sequence, which has a length and known size, at
least you can have an attempt at the DB operations: whether the
transaction fails or not is another thing...In short, a sequence is
finite, while an iterable may be infinite.

That's just my guess and makes sense to me at least!

Jon.
 
T

Terry Reedy

PEP 249 says about executemany():

Prepare a database operation (query or command) and then
execute it against all parameter sequences or mappings
found in the sequence seq_of_parameters.

are there any plans to update the api to allow an iterable instead of
a sequence?

That question would best be addressed to the pep author
Marc-André Lemburg <mal at lemburg.com>
 
N

Ned Deily

Jon Clements said:
Interesting, but here's my guess...

Replace d.items() with itertools.repeat( ('a', 'b') )

So, if you have a sequence, which has a length and known size, at
least you can have an attempt at the DB operations: whether the
transaction fails or not is another thing...In short, a sequence is
finite, while an iterable may be infinite.

Also, keep in mind that PEP 249 DB adapter implementations are typically
wrappers around a lower-level client library for a particular DB
implementation and that most of those client APIs - written in C - will
likely require all of the items to be passed in one call. If so, the DB
adapter would need to immediately evaluate the iterable and produce a
list in memory anyway.
 
J

Jon Clements

Interesting, but here's my guess...

Replace d.items() with itertools.repeat( ('a', 'b') )

So, if you have a sequence, which has a length and known size, at
least you can have an attempt at the DB operations: whether the
transaction fails or not is another thing...In short, a sequence is
finite, while an iterable may be infinite.

That's just my guess and makes sense to me at least!

Jon.

Actually, thinking about it some more, I would take the following
approach:
(this is only loosely do with the Python DB API mind you...)

1) Start a transaction
2) Create a temporary table
3) Bulk insert your million update records to the temp table (from my
understanding of the PEP, executemany(), is allowed to repeatedly call
execute() unless it can do something "cleverer")
4) Execute an update with a from statement joining your main table and
temp table (pretty sure that's ANSI standard, and DB's should support
it -- embedded one's may not though, but if you're dealing with 1mil
records, I'm taking a guess you're not dealing with embedded)
5) End the transaction

Far more efficient as a repeated execute of 'update' will only just
hit the DB once, while an update statement with a from should allow
the DB a chance to optimise it.

Hope that makes sense, lemme know.

Cheers,

Jon.
 
A

Antoine Pitrou

That question would best be addressed to the pep author
Marc-André Lemburg <mal at lemburg.com>

Or, rather to python-ideas. Python development isn't supposed to happen
in private.

Regards

Antoine.
 
P

Petite Abeille

4) Execute an update with a from statement joining your main table and
temp table (pretty sure that's ANSI standard, and DB's should support
it -- embedded one's may not though, but if you're dealing with 1mil
records, I'm taking a guess you're not dealing with embedded)

Not ANSI. Perhaps with a subquery, which is ANSI-compliant.
 
J

J. Gerlach

Am 12.10.2010 17:10, schrieb Roy Smith:
[A]re there any plans to update the api to allow an iterable instead of
a sequence?

sqlite3 (standard library, python 2.6.6., Windows 32Bit) does that already::

import sqlite3 as sql

connection = sql.connect(":memory:")

cursor = connection.execute("""
CREATE TABLE test (
id INTEGER PRIMARY KEY AUTOINCREMENT,
text TEXT)
;""")
connection.commit()
cursor.executemany("""
INSERT INTO test (text) VALUES ( ? );
""",
# A generator expression - delivers one row at a time
( ("hello nr %03d!" % i,) for i in xrange(100)))
connection.commit()
cursor.execute("SELECT * FROM test")

for id_, text in cursor.fetchall():
print text, id_
 
J

John Nagle

Actually, thinking about it some more, I would take the following
approach:
(this is only loosely do with the Python DB API mind you...)

1) Start a transaction
2) Create a temporary table
3) Bulk insert your million update records to the temp table (from my
understanding of the PEP, executemany(), is allowed to repeatedly call
execute() unless it can do something "cleverer")
4) Execute an update with a from statement joining your main table and
temp table (pretty sure that's ANSI standard, and DB's should support
it -- embedded one's may not though, but if you're dealing with 1mil
records, I'm taking a guess you're not dealing with embedded)
5) End the transaction

Far more efficient as a repeated execute of 'update' will only just
hit the DB once, while an update statement with a from should allow
the DB a chance to optimise it.

Hope that makes sense, lemme know.

Right. In general, if you find yourself making millions of
SQL database requests in a loop, you're doing it wrong.

Big database loads are usually done by creating a text file
with the desired data, then using a LOAD DATA INFILE command.
This (in MySQL) is tens to hundreds of times faster than
doing individual INSERT or UPDATE commands.

John Nagle
 
L

Lawrence D'Oliveiro

In general, if you find yourself making millions of
SQL database requests in a loop, you're doing it wrong.

I’ve done this. Not millions, but certainly on the order of tens of
thousands.
Big database loads are usually done by creating a text file
with the desired data, then using a LOAD DATA INFILE command.

May not always be flexible enough.
This (in MySQL) is tens to hundreds of times faster than doing individual
INSERT or UPDATE commands.

Why should that be? The database cannot read a text file any faster than I
can.
 
S

Steve Howell

I’ve done this. Not millions, but certainly on the order of tens of
thousands.


May not always be flexible enough.


Why should that be? The database cannot read a text file any faster than I
can.

Bulk-load strategies usually solve one or more of these problems:

network latency
index maintenance during the upload
parsing of the SQL
reallocating storage

They usually come at some cost or inconvenience:

only work locally
indexes not available during the bulk load
requires non-SQL data format
up-front cost to preallocate storage

The exact benefits and costs will vary according to the system, of
course. Also, you might employ multiple strategies; for example, for
a medium-sized problem, there may be no need to sacrifice the
familiarity and convenience of SQL if you can arrange for large
batches of SQL to be submitted from a local client. If indexes are
the problem, and you don't need 24/7 availability, you can drop them
and rebuild them.
 
J

Jon Clements

Am 12.10.2010 17:10, schrieb Roy Smith:
[A]re there any plans to update the api to allow an iterable instead of
a sequence?

sqlite3 (standard library, python 2.6.6., Windows 32Bit) does that already::

import sqlite3 as sql

connection = sql.connect(":memory:")

cursor = connection.execute("""
    CREATE TABLE test (
        id INTEGER PRIMARY KEY AUTOINCREMENT,
    text TEXT)
    ;""")
connection.commit()
cursor.executemany("""
    INSERT INTO test (text) VALUES ( ? );
    """,
    # A generator expression - delivers one row at a time
    ( ("hello nr %03d!" % i,) for i in xrange(100)))
connection.commit()
cursor.execute("SELECT * FROM test")

for id_, text in cursor.fetchall():
    print text, id_

What happens if you do itertools.repeat(0) instead of xrange(100) ?
 
J

J. Gerlach

Am 13.10.2010 14:26, schrieb Jon Clements:
Am 12.10.2010 17:10, schrieb Roy Smith:
[A]re there any plans to update the api to allow an iterable instead of
a sequence?

[sqlite3 example snipped]

What happens if you do itertools.repeat(0) instead of xrange(100) ?

And there I see why I don't want ``executemany()`` to take a generator
-- the never-ending ones will do some damage.

But on a similar note: would a ``execute_n_repeats(statement, iterable,
maximum_rows)`` be better?


Jörg
 
S

Steven D'Aprano

I’ve done this. Not millions, but certainly on the order of tens of
thousands.

"I've done it" doesn't disprove that it is the wrong thing to do. It may
simply mean that you too are doing it wrong.

I have no opinion either way, I'm just saying.
 
J

John Nagle

I’ve done this. Not millions, but certainly on the order of tens of
thousands.

It's a scaling issue. Millions of INSERT or UPDATE requests can
take hours. That's when you need the speedup of bulk loading.
May not always be flexible enough.

True; you can't do anything with LOAD DATA INFILE but load data.
If you need selects or joins within inserts, you may have to do it
the long way.
Why should that be? The database cannot read a text file any faster than I
can.

Because the indices are updated in one big update, rather than
after each change to the database.

Also note that there are some issues with doing a huge volume of
updates in one MySQL InnoDB transaction. The system has to keep the
data needed to undo the updates, and there's a limit on the amount of
pending transaction history that can be stored.

It's common to load data into a new, empty table, then, once
the reload has succeeded, do a RENAME like CURRENT->OLD, NEW->CURRENT.
Rename of multiple databases is atomic and interlocked with other
operations, so you can replace an entire table on a live server.

I have some bulk databases which are updated from external
sources. The PhishTank database is updated with UPDATE statements
every three hours. But the Open Directory database is updated by
downloading a big gzipped file of XML, creating a new database
table, then renaming. That load takes hours, once a week.

(All this applies to MySQL, and to some extent, Postgres.
If you're using SQLite, it's different. But a million records
is big for SQLite, which is, after all, a "lite" database.
At 10,000 records, you don't have to worry about any of this
stuff. At 1,000,000 records, you do.)

John Nagle
 
L

Lawrence D'Oliveiro

In message
Steve said:
Bulk-load strategies usually solve one or more of these problems:

network latency

That’s not an issue. This is a bulk operation, after all.
index maintenance during the upload

There are usually options to temporarily turn this off.
parsing of the SQL

Not usually a problem, as far as I can tell.
reallocating storage

You mean for thr SQL? Again, not usually a problem, as far as I can tell.
 
L

Lawrence D'Oliveiro

Also note that there are some issues with doing a huge volume of
updates in one MySQL InnoDB transaction. The system has to keep the
data needed to undo the updates, and there's a limit on the amount of
pending transaction history that can be stored.

How does “load data†avoid this? Is that not a transaction too?
It's common to load data into a new, empty table, then, once
the reload has succeeded, do a RENAME like CURRENT->OLD, NEW->CURRENT.
Rename of multiple databases is atomic and interlocked with other
operations, so you can replace an entire table on a live server.

Is that to avoid impacting searches on the existing data while loading the
new records?

Seems to me this isn’t going to help, since both old and new tables are on
the same disk, after all. And it’s the disk access that’s the bottleneck.
 
M

Martin Gregorie

How does “load data†avoid this? Is that not a transaction too?
Not usually. Its faster because there's no journalling overhead. The
loader takes out an exclusive table lock, dumps the data into the table,
rebuilds indexes and releases the lock. I can't comment about MySQL
(don't use it) but this has been the case on the RDBMS databases I have
used.
Seems to me this isn’t going to help, since both old and new tables are
on the same disk, after all. And it’s the disk access that’s the
bottleneck.
There's a lot of overhead in journalling - much more than in applying
changes to a table. The before and after images *must* be flushed to disk
on commit. In UNIX terms fsync() must be called on the journal file(s)
and this is an expensive operation on all OSes because committing a
series of small transactions can cause the same disk block to be written
several times. However, the table pages can safely be left in the DBMS
cache and flushed as part of normal cache operation since, after a crash,
the table changes can always be recovered from a journal roll-forward. A
good DBMS will do that automatically when its restarted.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,482
Members
44,901
Latest member
Noble71S45

Latest Threads

Top