Python, MS SQL, and batch inserts

E

ericwoodworth

Hi,
I have a python script I'm writing that grabs some data from a
com object, does a little formatting, and then inserts that data into
a MS SQL 2003 DB. Because I'm using COM objects I'm importing
win32com.client. That also allows me to use ADODB.connection and
ADODB.command objects for working with SQL.

The program works fine but it's a little slow. Inserting ~5500
rows of data takes about 10 seconds using a DB that is on the same
machine running the script.

I've done some general searches on how to speed this up and in
other languages people suggest sending batches of inserts off at a
time instead of executing 1 insert at a time. For java and .net
people recommend using a stringbuilder function to create strings
quickly. I don't know of such a function in python s I tried grouping
my inserts into a single string using string += syntax. I knew that
would be terrible but I wanted to see how terrible. Final reults: It
was pretty terrible. Script went from taking ~18sec to taking
240sec. The overhead for recreating the strings was monster. No real
surprise there.

So I then loaded up the commands into a list and at the end I
used the strong join method to create the string. This was far faster
than using += to create my strings but still took twice as long as
just running my inserts one at a time. So I'm looking for
suggestions.

Basically I have 5000 SQL inserts that I want to do as quickly as
possible. This is purely academic as I can live with the 18 seconds
the script needs to run (9 to talk to the com object and format the
data and 10 to write to SQL) but I'm still curious how to improve on
what I have running.

Thanks in advance for any help,
Eric
 
P

Philip Semanchuk

Hi,
I have a python script I'm writing that grabs some data from a
com object, does a little formatting, and then inserts that data into
a MS SQL 2003 DB. Because I'm using COM objects I'm importing
win32com.client. That also allows me to use ADODB.connection and
ADODB.command objects for working with SQL.

The program works fine but it's a little slow. Inserting ~5500
rows of data takes about 10 seconds using a DB that is on the same
machine running the script.

I've done some general searches on how to speed this up and in
other languages people suggest sending batches of inserts off at a
time instead of executing 1 insert at a time. For java and .net
people recommend using a stringbuilder function to create strings
quickly. I don't know of such a function in python s I tried grouping
my inserts into a single string using string += syntax. I knew that
would be terrible but I wanted to see how terrible. Final reults: It
was pretty terrible. Script went from taking ~18sec to taking
240sec. The overhead for recreating the strings was monster. No real
surprise there.

So I then loaded up the commands into a list and at the end I
used the strong join method to create the string. This was far faster
than using += to create my strings but still took twice as long as
just running my inserts one at a time. So I'm looking for
suggestions.

Basically I have 5000 SQL inserts that I want to do as quickly as
possible. This is purely academic as I can live with the 18 seconds
the script needs to run (9 to talk to the com object and format the
data and 10 to write to SQL) but I'm still curious how to improve on
what I have running.

Are you sure your logjam is in Python? Inserting 5500 rows can take a
few seconds if you're COMMITting after each INSERT. Wrap the whole
thing in an explicit transaction and see if that helps.

Also, toss in a few print statements containing timestamps so you know
more about where the script is spending time.


bye
Philip
 
E

ericwoodworth

Are you sure your logjam is in Python? Inserting 5500 rows can take a  
few seconds if you're COMMITting after each INSERT. Wrap the whole  
thing in an explicit transaction and see if that helps.

Also, toss in a few print statements containing timestamps so you know  
more about where the script is spending time.

bye
Philip

I'm not 100% sure it's python and not SQL but I do suspect there's a
better way to do this than just serial inserts. I could be wrong
about that which is what i'm trying to explore.
I already do use the time stamps and this is what I see:
at 9 secs in I've gotten my data, formatted it, and placed it on the
list
at 9.047 secs in the string.join() is done and I have my command
string
at 35 secs the program ends. So somehow my SQL is taking a lot longer
when I format it as single string.

How would I make the whole thing one transaction? Just insert BEGIN
TRANSACTION at the start and COMMIT at the end? Is that enough to do
it?
 
P

Philip Semanchuk

I'm not 100% sure it's python and not SQL but I do suspect there's a
better way to do this than just serial inserts. I could be wrong
about that which is what i'm trying to explore.

I don't think the SQL standard provides a way to do bulk inserts and
as a result most DB vendors have extended the standard to address this
common need (e.g. the COPY command in Postgres).

If you're doing a mass insert to populate a blank table it also often
helps to postpone index creation until after the table is populated.
I already do use the time stamps and this is what I see:
at 9 secs in I've gotten my data, formatted it, and placed it on the
list
at 9.047 secs in the string.join() is done and I have my command
string

You said you're inserting ~5500 rows, so are you calling .join() on a
list of 5500 items? If so, 9 seconds seems painfully slow unless
you're on old hardware.

at 35 secs the program ends. So somehow my SQL is taking a lot longer
when I format it as single string.

How would I make the whole thing one transaction? Just insert BEGIN
TRANSACTION at the start and COMMIT at the end? Is that enough to do
it?

That's the basic idea, but your database adapter (i.e. the Python
wrapper that talks to SQLServer) might want you to do it another way.
For instance, the connection object has a .commit() method and using
that is probably a better ideal than calling cursor.execute("COMMIT").

Per the Python DB API, your connection should be opened in
transactional mode by default. ("Note that if the database supports an
auto-commit feature, this must be initially off.")
http://www.python.org/dev/peps/pep-0249/

In other words, if you adapter is DB-API compliant then the only
reason you're not using a transaction is because you're explicitly
turning them off.



bye
P
 
E

ericwoodworth

I forget the name of the SQL Server bulk loader, but for large loads, I
used to populate a fresh table with the bulk data, then do UPDATEs and
INSERTs to get the data spread out into the main tables.  You (the OP)
might try a scheme like that.

--Scott David Daniels
(e-mail address removed)

Hmm..I think I'm going to move my question over to a SQL forum because
this is starting to feel like a SQL, rather than a python issue to me.

Three times now after letting the system "rest" where I go off an do
other things and then run my script it completes in 10 seconds. If I
drop tables and start fresh immediately after that it takes 35
seconds. If I drop the tables and wait an hour and then run the
script it'll finish in 10 seconds again.

That makes me think it's a SQL config or optimization issue more than
a python issue.

oh and the times I listed above were totals from the start of
execution so the string.join() was taking 0.047 seconds to run. It
was taking 9 seconds to get my data from the com object and format it
but the join was quite fast.
 
E

ericwoodworth

Hmm..I think I'm going to move my question over to a SQL forum because
this is starting to feel like a SQL, rather than a python issue to me.

Three times now after letting the system "rest" where I go off an do
other things and then run my script it completes in 10 seconds.  If I
drop tables and start fresh immediately after that it takes 35
seconds.  If I drop the tables and wait an hour and then run the
script it'll finish in 10 seconds again.

That makes me think it's a SQL config or optimization issue more than
a python issue.

oh and the times I listed above were totals from the start of
execution so the string.join() was taking 0.047 seconds to run.  It
was taking 9 seconds to get my data from the com object and format it
but the join was quite fast.

Also if I restart SQL it will respond very quickly as well...down to
10 secs again. Not sure why.
 
P

pruebauno

bcp (bulk copy) was the name of the bulk loader.  I just remembered.

Sorry, no more exciting advice.  it does sound like a good idea to go
to a SQL Server group.

--Scott David Daniels
(e-mail address removed)

For MS SQL try:

BULK INSERT AdventureWorks.Sales.SalesOrderDetail
FROM ''<drive>:\<path>\<filename>''
WITH (ROWTERMINATOR = '''+CHAR(10)+''')
 
D

Dennis Lee Bieber

Hi,
I have a python script I'm writing that grabs some data from a
com object, does a little formatting, and then inserts that data into
a MS SQL 2003 DB. Because I'm using COM objects I'm importing
win32com.client. That also allows me to use ADODB.connection and
ADODB.command objects for working with SQL.
Unfortunately, ADO DB isn't a Python DB-API interface, so the common
"standard" recommendations won't apply...
time instead of executing 1 insert at a time. For java and .net
people recommend using a stringbuilder function to create strings
quickly. I don't know of such a function in python s I tried grouping
my inserts into a single string using string += syntax. I knew that
would be terrible but I wanted to see how terrible. Final reults: It
was pretty terrible. Script went from taking ~18sec to taking
240sec. The overhead for recreating the strings was monster. No real
surprise there.
You don't mean to say you are creating full SQL insert statements
WITH THE DATA!

Surely ADO DB has a means of doing parameterized SQL!

How about...
http://msdn.microsoft.com/en-us/library/ms677502(VS.85).aspx

If I understand the cryptic M$ documentation, you'd create a
parameterized SQL statement which you feed to a command object; have the
statement "prepared" (basically, preparsed and prepped for use, but
without any real data), then fill in a parameter object with the data of
the fields, and execute the command object with those parameters; refill
the parameters and repeat the execute as needed.
--
Wulfraed Dennis Lee Bieber KD6MOG
(e-mail address removed) (e-mail address removed)
HTTP://wlfraed.home.netcom.com/
(Bestiaria Support Staff: (e-mail address removed))
HTTP://www.bestiaria.com/
 
L

Lawrence D'Oliveiro

Dennis said:
Eeeek! That whole page seems to take pride in violating the common
recommendations for the use of DB-API compliant database adapters.

Let's see you come up with better solutions.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,011
Latest member
AjaUqq1950

Latest Threads

Top