PEP 249 (database api) -- executemany() with iterable?

S

Steve Howell

In message



That’s not an issue. This is a bulk operation, after all.


There are usually options to temporarily turn this off.


Not usually a problem, as far as I can tell.


You mean for thr SQL? Again, not usually a problem, as far as I can tell.

If you are avoiding network latency and turning off indexes, then you
are using some kind of a bulk-load strategy.

If you are not concerned about parsing costs or storage churn, then
you are simply evaluating the costs of a non-bulk-oriented strategy
and determining that they are minimal for your needs, which is fine.
 
L

Lawrence D'Oliveiro

M.-A. said:
However, even with iterables, please keep in mind that pushing
the data row-per-row over a network does not result in good
performance, so using an iterable will make you update slower.

cursor.executemany() is meant to allow the database module
to optimize sending bulk data to the database and ideally,
it will send the whole sequence to the database in one go.

You seem to be assuming that using an iterator precludes buffering.

What’s wrong with evaluating the iterator to produce as many records as the
API implementation finds convenient to send at once?
 
L

Lawrence D'Oliveiro

In message
Steve said:
If you are avoiding network latency and turning off indexes, then you
are using some kind of a bulk-load strategy.

I thought we were talking about “load data†versus using simple variants of
“insertâ€. At least, that’s what I was talking about. What were you talking
about?
If you are not concerned about parsing costs or storage churn ...

What is this “storage churn†thing, and why are you raising it now?
 
L

Lawrence D'Oliveiro

Not usually. Its faster because there's no journalling overhead.

So what happens if there’s an error part way? Do you end up with some
records inserted and others not?
The loader takes out an exclusive table lock, dumps the data into the
table, rebuilds indexes and releases the lock.

That’s not acceptable in a situation where other processes are concurrently
making queries on the table.
There's a lot of overhead in journalling - much more than in applying
changes to a table. The before and after images *must* be flushed to disk
on commit. In UNIX terms fsync() must be called on the journal file(s)
and this is an expensive operation on all OSes because committing a
series of small transactions can cause the same disk block to be written
several times. However, the table pages can safely be left in the DBMS
cache and flushed as part of normal cache operation since, after a crash,
the table changes can always be recovered from a journal roll-forward. A
good DBMS will do that automatically when its restarted.

And how does this address the point of whether to load the new records into
a new table or the same table?
 
M

Martin Gregorie

So what happens if there’s an error part way? Do you end up with some
records inserted and others not?
Since the load is usually used to populate an empty table, you simply
empty the table and do it again if it doesn't complete for some reason.
If you're doing a partial load you'd normally dump the table and/or take
a backup beforehand.
That’s not acceptable in a situation where other processes are
concurrently making queries on the table.
Obviously, but you're unlikely to be running anything while you're doing
a bulk load. If the database is used to support interactive processing
you'd normally only use bulk loading during recovery after a disk failure
or as part of a database migration or restructuring exercise.

In fact the only case I know of where bulk loads are a normal part of the
data life cycle is for data warehousing, where bulk loads are the norm.
The last month's transactions are removed from the online system and bulk
loaded into the data warehouse for analysis. In this case the bulk loads
are either done overnight or during a weekend. However, any RDBMS that
has been optimised for data warehousing will almost certainly have the
ability to segment its fact table. This has two benefits:

- you can bulk load a new segment offline and put it online when the
load is complete. This can be done during normal operation.

- the DBMS can automatically split a query and run parallel copies
against each segment before combining the result sets for return
to the program that made the query.
And how does this address the point of whether to load the new records
into a new table or the same table?
That's system design choice. There are a limited set of options and its
up to the designer to choose one:

- if you want to load data into a new table and rename CURRENT->OLD,
NEW->CURRENT you have to design all queries to operate on both OLD
and CURRENT tables

- if your DBMS allows table segmentation and gives the ability to offline
segments you can bulk load as I already described.

- you can stop the system during a quiet time while you do a bulk load
and/or minimise overheads by using very large transaction.

- you can bite the bullet, accept the performance hit and simply run a
relatively slow load process using small transactions during normal
operation.

There are different overheads associated with each choice. A good system
designer will understand them, know their relative weight for the target
hardware and database, and use this information to optimise the design.

What I said about transactional and journalling overheads was
explanation: unless you know that you'll never understand why you should
expect bulk loading to be a lot faster than doing it with transactions
designed to logically group related operations or (much, much worse) to
use implied commits. This really is the worse of all worlds since it
maximises journalling overhead.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,065
Latest member
OrderGreenAcreCBD

Latest Threads

Top