DBI Performance Issues

C

Chris H.

im currently working on an issue for work, in which processing of a pipe
delimited text file consisting of 356,400 lines of data with 19 fields in
each row.

the issue that i am having is not the performance of reading and splitting
this data, but in writing to the database. using dbi/dbd::mysql, using
'localhost' as the server the script connects to i get the following
results:
time ./ushbh-dev.pl hbh.txt
Reading input file - 08/25/06 10:19:09
Updating us_hourly table - 08/25/06 10:19:09
1000 records processed.

real 0m1.096s
user 0m0.424s
sys 0m0.008s

------------------------

using 'servername.tld' from a remote machine, i get the following
performance:

time ./ushbh-dev.pl hbh.txt
Reading input file - 08/25/06 10:17:49
Updating us_hourly table - 08/25/06 10:17:49
1000 records processed.

real 1m11.606s
user 0m0.250s
sys 0m0.034s

------------------------

the issue seems to be with remote connections either through the dbi
module, or the dbd::mysql driver. this also appears with the sybase dbi
driver going from a unix machine to a mssql machine.

are there any remote connection variables or performance enchancing calls
that im missing?

the sql query thats being run is:

update table foo set var1 = 'foo', var2 = 'foo2' where var3 = 'bar' and
var4 = 'bar2';

the table is indexed based on var3 and var4, which did show a huge
increase in performance once it was indexed properly.

this is data that needs to be updated every 15 minutes, so dropping the
data and re-inserting is not an option.

thanks for any help or insight you can provide on this issue.
 
X

xhoster

Chris H. said:
im currently working on an issue for work, in which processing of a pipe
delimited text file consisting of 356,400 lines of data with 19 fields in
each row.

the issue that i am having is not the performance of reading and
splitting this data, but in writing to the database. using
dbi/dbd::mysql, using 'localhost' as the server the script connects to i
get the following results:

Reading input file - 08/25/06 10:19:09
Updating us_hourly table - 08/25/06 10:19:09
1000 records processed.

real 0m1.096s
user 0m0.424s
sys 0m0.008s

------------------------

using 'servername.tld' from a remote machine, i get the following
performance:

Reading input file - 08/25/06 10:17:49
Updating us_hourly table - 08/25/06 10:17:49
1000 records processed.

real 1m11.606s
user 0m0.250s
sys 0m0.034s

------------------------

How fast is your network connection? I see barely any change in
performance at all between local and remote.

the issue seems to be with remote connections either through the dbi
module, or the dbd::mysql driver.

I would suggest it is at a lower lever, at the network or at the compiled
binaries that DBD::mysql links against. Have you implemented the same
thing in another language (C, Java, Python) and seen better performance?
If not, I doubt your problem has anything to do with Perl or perl modules
per se.

Xho
 
C

Chris H.

How fast is your network connection? I see barely any change in
performance at all between local and remote.

the real machine will be 100mbit to 100mbit via lan. the testing machines
that i used are from my home cable connection (8mb down/2mb up) to 100mbit
colo. its easy to blame the cable connection, however the traffic going
across the line during the updates are a whopping 7kbit/s which a 14.4k
could handle. thats why im at a loss for ideas. its like its refusing to
use the cpu or the network hardware during remote tests. as i dont have
iptraf on the actual machine that will run this, i cant say for certain
how much bandwidth its using on the lan...but i'd imaging not much if any
more than my home to colo tests.

I would suggest it is at a lower lever, at the network or at the compiled
binaries that DBD::mysql links against. Have you implemented the same
thing in another language (C, Java, Python) and seen better performance?
If not, I doubt your problem has anything to do with Perl or perl modules
per se.

Xho

i have not implemented this in another language, however, there are a few
changes. first the live script is run from a sun solaris machine using
dbi's sybase odbc driver to connect to a mssql database on a windows 2003
server machine. the dev script i've modified to suit my environment which
consists of mysql. the tables i have set up identical, and the data file
was scp'ed from the live machine and run against both databases to check
for performance differences.

the only difference short of this i can find is that on my colo box
running mysql local, and running the script local with 'localhost' as the
server to connect to will use 50% cpu for the perl script, and 50% cpu for
the mysql process.

running it from my home machine to the mysql box using 'domain.tld' as the
server to connect to, it uses less than .5% cpu as if its not even working
or attempting to work.

this effect is seen in the live version of the script too, with the remote
server in the connect string, the script on the solaris box will use less
than 1% cpu and the mssql server machine under task manager reports a cpu
usage of 8%-14% (it doesnt run this database exclusively).

i realize there are quite a few variables at play here, but im hoping its
something really simple im missing thats blocking it from just running at
its true speed.

i cant say how the dbi or dbd drivers on the solaris machine are compiled,
as i didnt set them up. the dbi and dbd drivers on both my home machine
and the colo box are standard gentoo compiled versions. as far as the
ebuild's are concerned, they are setting no additional options from the
defaults.

thanks.
 
X

xhoster

Chris H. said:
the real machine will be 100mbit to 100mbit via lan. the testing machines
that i used are from my home cable connection (8mb down/2mb up) to
100mbit colo. its easy to blame the cable connection, however the traffic
going across the line during the updates are a whopping 7kbit/s which a
14.4k could handle. thats why im at a loss for ideas.

You are not making the important distiction between throughput and latency.
Your Perl script sends the sql command, then waits to get a response.
While waiting, it does nothing--it doesn't use CPU and/or bandwidth by
"sending ahead on prospect" more update statements. So you are limited by
latency, not by throughput. With an insert, you can circumvent this by
using the MySQL-specific multiple row insert statement to add many rows per
network round-trip. Maybe you can use the MySQL-specific "INSERT...ON
DUPLICATE KEY UPDATE" syntax to accomplish the same thing--just a
suggestion, I've never looked into it in detail myself. However, if your
real situation will have the two machines on the same LAN, then latency
will probably not be a problem. You simply can't do benchmarking on such
disparate setups and expect the results to be meaningful.

Xho
 
C

Chris H.

You are not making the important distiction between throughput and latency.
Your Perl script sends the sql command, then waits to get a response.
While waiting, it does nothing--it doesn't use CPU and/or bandwidth by
"sending ahead on prospect" more update statements. So you are limited by
latency, not by throughput. With an insert, you can circumvent this by
using the MySQL-specific multiple row insert statement to add many rows per
network round-trip. Maybe you can use the MySQL-specific "INSERT...ON
DUPLICATE KEY UPDATE" syntax to accomplish the same thing--just a
suggestion, I've never looked into it in detail myself. However, if your
real situation will have the two machines on the same LAN, then latency
will probably not be a problem. You simply can't do benchmarking on such
disparate setups and expect the results to be meaningful.

Xho

i understand the differences in the setup, but i dont see latency playing
that much of a role in this issue. ping response from solaris to windows
on the lan (live setup) is 2.73ms. ping response from my home machine to
my colo box is 17.5ms. my issue is that the two setups perform the same in
regards to how slow they are getting the information to the server. i can
understand it being slower, that's a given, but what im having an issue
with is that it takes just as long to update 10 records remotely as 1,000
locally...then something is going wrong somewhere. i'd suspect it should
be faster than that.

thanks again for all the information thus far, it is giving me different
ideas to try. i'll definately look at the 'insert...on duplicate key
update' documentation to see if it'll work.
 
X

xhoster

Chris H. said:
i understand the differences in the setup, but i dont see latency playing
that much of a role in this issue. ping response from solaris to windows
on the lan (live setup) is 2.73ms. ping response from my home machine to
my colo box is 17.5ms.

Is that under load or under no load? Anyway, you reported 71 seconds for
1000 records, which is 71ms per record. 71ms is only fourfold more than
17.5 ms, so I would say that latency is definitely in the ball park to be a
problem. (I would expect MySQL to have more overhead then ping--4 times
more would not greatly surprise me.)
my issue is that the two setups perform the same
in regards to how slow they are getting the information to the server. i
can understand it being slower, that's a given, but what im having an
issue with is that it takes just as long to update 10 records remotely as
1,000 locally...then something is going wrong somewhere. i'd suspect it
should be faster than that.

I'm afraid you lost me. You have a localhost msyql connection with 1000
records per second, and an over-cable-modem mysql connection with 14
records per second. Do you have a *mysql* connection on a LAN? If so,
what speed did that provide?

Xho
 
P

Peter J. Holzer

DBI also has an execute_array method, but I don't know if this is
implemented by DBD::MySQL. If it is, it can improve performance quite
dramatically - I've measured a factor 50 in an (admittedly very simple)
benchmark with Oracle.
i understand the differences in the setup, but i dont see latency playing
that much of a role in this issue. ping response from solaris to windows
on the lan (live setup) is 2.73ms.

This seems very slow for a 100Mbit network. 0.27ms is more what I would
expect.
ping response from my home machine to my colo box is 17.5ms. my issue
is that the two setups perform the same in regards to how slow they
are getting the information to the server. i can understand it being
slower, that's a given, but what im having an issue with is that it
takes just as long to update 10 records remotely as 1,000 locally...

ping localhost reports an rtt of 0.05 ms on my (rather old) laptop. I
assume your server is at least as fast. So the latency of your broadband
connection is at least 350 times longer than that of the loopback
interface. So I don't find it very surprising that a mostly
latency-bound job takes 71 times as long. Assuming there are four round
trips per insert (don't know why there should be so many) and an acual
processing time of 0.8 ms we would arrive almost exactly at your
measurements:

broadband: 4 * 17.5ms + 0.8ms = 70.8ms
loopback: 4 * 0.05ms + 0.8ms = 1.0ms

If this theory is correct you should expect
4 * 2.73ms + 0.8ms = 11.72ms
per insert (or about 85 inserts/second) over the LAN.


hp
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,020
Latest member
GenesisGai

Latest Threads

Top