Newbie Question: python mysqldb performance question

C

cjl

Group:

I'm new to python and new to mysql.

I have a csv file that is about 200,000 rows that I want to add to a
mysql database. Yes, I know that I can do this directly from the
mysql command line, but I am doing it through a python script so that
I can munge the data before adding it.

I have the following example code:

conn = MySQLdb.connect(db="database", host="localhost", user="root",
passwd="password")
c = conn.cursor()

reader = csv.reader(open(sys.argv[1]))
for row in reader:
data1, data2, data3, data4 = row
data = (data1,data2,data3,data4)
c.execute("""insert into datatable values (%s, %s, %s, %s)""",
data)
conn.commit()

This takes a really long time to execute, on the order of minutes.
Directly importing the csv file into mysql using 'load infile' takes
seconds.

What am I doing wrong? What can I do to speed up the operation?

Thanks in advance,
cjl
 
D

Dennis Lee Bieber

reader = csv.reader(open(sys.argv[1]))
for row in reader:
data1, data2, data3, data4 = row
data = (data1,data2,data3,data4)

That set is redundant in this example -- It's the very same as
saying

data = row
c.execute("""insert into datatable values (%s, %s, %s, %s)""",
data)
conn.commit()

This takes a really long time to execute, on the order of minutes.
Directly importing the csv file into mysql using 'load infile' takes
seconds.

"load infile", for local to server files (if so configured) means
all processing is done directly by the server. Using Python, you are
going through a TCP/IP connection with send/response traffic for each
command.


Step one: Don't commit after every single row... Feed all the rows
and then commit (among other things, that means an error in the middle
of reading the CSV rolls back all database activity, allowing you to try
again without having duplicate records in the database).

Step two: Don't process the rows one-at-a-time. Build up batches,
and use .executemany() on the batch.
--
Wulfraed Dennis Lee Bieber KD6MOG
(e-mail address removed) (e-mail address removed)
HTTP://wlfraed.home.netcom.com/
(Bestiaria Support Staff: (e-mail address removed))
HTTP://www.bestiaria.com/
 
R

rurpy

conn = MySQLdb.connect(db="database", host="localhost", user="root",
passwd="password")
c = conn.cursor()

reader = csv.reader(open(sys.argv[1]))
for row in reader:
data1, data2, data3, data4 = row
data = (data1,data2,data3,data4)
c.execute("""insert into datatable values (%s, %s, %s, %s)""",
data)
conn.commit()

This takes a really long time to execute, on the order of minutes.
Directly importing the csv file into mysql using 'load infile' takes
seconds.

What am I doing wrong? What can I do to speed up the operation?

In addition to the previous poster's suggestions,
if you have indexes, foreign keys, or other constraints
on the table, and you are sure that you are loading
"good" data, you may want to drop them before doing
the inserts, and recreate them after. Updating indexes
and checking contraints can be time consuming.

But you will probabably never get your code to run
as fast as the mysql "load" command, which (I'm
guessing here, don't use mysql much) skips all
the sql machinery and writes directly to the table.
 
J

John Nagle

cjl said:
Group:

I'm new to python and new to mysql.

I have a csv file that is about 200,000 rows that I want to add to a
mysql database. Yes, I know that I can do this directly from the
mysql command line, but I am doing it through a python script so that
I can munge the data before adding it.

I have the following example code:

conn = MySQLdb.connect(db="database", host="localhost", user="root",
passwd="password")
c = conn.cursor()

reader = csv.reader(open(sys.argv[1]))
for row in reader:
data1, data2, data3, data4 = row
data = (data1,data2,data3,data4)
c.execute("""insert into datatable values (%s, %s, %s, %s)""",
data)
conn.commit()

This takes a really long time to execute, on the order of minutes.
Directly importing the csv file into mysql using 'load infile' takes
seconds.

What am I doing wrong? What can I do to speed up the operation?

"LOAD INFILE" is generally faster than doing
many inserts. There are ways to speed things up, especially if
you can reconfigure MySQL to use more memory. You may want to
load the data without indexing it, then build the indices. Ask
in a MySQL group, or read the manual.

If you can use LOAD INFILE, do so. Preferably from an empty
database. Then it can sort the records and insert them all at once.
You can create a file for LOAD INFILE from Python, then issue the
LOAD INFILE command.

A few minutes isn't a "really long time" for that. I had to do
15,000,000 INSERT operations a few months back, and it took three days.

John Nagle
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,901
Latest member
Noble71S45

Latest Threads

Top