speeding up cgi perl

S

skeldoy

Hey! I am working on an application that features a huge database
(mysql) and some cgi (perl) for listing, sorting, searching,
dupechecking and more. I see that the configuration for the mysql is
pretty much spot on - most of the data is cached and so the mysqld-
process isn't really doing much in terms of causing bottlenecks. But I
believe that the perl-code may be the bottleneck here. I have turned
off buffering completely and I render pretty much just the things that
are needed. Still it can take up to a minute to print (in html) a
query that returns in a second on the mysql-console.

The output from the cgi is around 15MB for every operation the user
does, so I see the potential for slowness right there, in the sheer
amount of data that has to be produced and transmitted over the net,
but still I don't really understand what I have done to make the cgi
so slow.

The cgi is mostly taking a parameter like $query and doing a "select
from db where value like "%$query%"" and returning that in pretty <td>
$_</td>-form. That seems to work reasonably fast. But when I do a
"select * from db" things tend to get really slow when dealing with
15000-entries++ (even though mysql has it all cached and spits it out
in a split second). The cgi-process sits there, spitting out html to
the client, using up 95% of the cpu-time of one of the cores and using
50MBs of memory or so. I have no idea what it does. I have replaced
most of the " with '. And I can't really see that I am doing something
that needn't be done. Is there an issue with creating multiple
database-connections (DBD::mysql) that I should be aware of?

If somebody has experience in doing huge db's with perl, can you
please give me some pointers? Is this a code issue or is it a network-
issue or is it a browser issue? Does anyone have any tips for doing
huge databases with perl?
 
S

smallpond

"select * from db"

If you're sending back everything, why not just render the page once
and
return static html? You can schedule updates when the db changes.
 
P

Peter J. Holzer

Hey! I am working on an application that features a huge database
(mysql) and some cgi (perl) for listing, sorting, searching,
dupechecking and more. I see that the configuration for the mysql is
pretty much spot on - most of the data is cached and so the mysqld-
process isn't really doing much in terms of causing bottlenecks. But I
believe that the perl-code may be the bottleneck here. I have turned
off buffering completely and I render pretty much just the things that
are needed. Still it can take up to a minute to print (in html) a
query that returns in a second on the mysql-console.

The output from the cgi is around 15MB for every operation the user
does, so I see the potential for slowness right there, in the sheer
amount of data that has to be produced and transmitted over the net,
but still I don't really understand what I have done to make the cgi
so slow.

The cgi is mostly taking a parameter like $query and doing a "select
from db where value like "%$query%""

You should use placeholders instead for security, but that won't affect
speed much.
and returning that in pretty <td> $_</td>-form. That seems to work
reasonably fast. But when I do a "select * from db" things tend to get
really slow when dealing with 15000-entries++ (even though mysql has
it all cached and spits it out in a split second). The cgi-process
sits there, spitting out html to the client, using up 95% of the
cpu-time of one of the cores and using 50MBs of memory or so. I have
no idea what it does.

To find out where it spends the time, use Devel::NYTProf.

The simplest method to use it is usually to just invoke it from the
commandline:

export REQUEST_METHOD=GET
export QUERY_STRING='whatever'
perl -d:NYTProf yourscript.cgi

This creates a file nytprof.out in the current directory which you can
convert to a nice HTML report with the nytprofhtml command.

There are also other ways for invoking it, see the docs.


CGI does have considerable overhead per row. However, for just 15000
rows I'd still expect that to be finished in a few seconds at most.
If somebody has experience in doing huge db's with perl, can you
please give me some pointers?

Is 700 GB and 6 billion rows in the largest table huge enough? I don't
use mysql for that one, though.
Is this a code issue or is it a network-
issue or is it a browser issue?

Since you wrote that your CGI script uses 95% CPU time during the time
it runs it is unlikely to be a browser problem. But browsers in general
are rather slow at rendering tables. For 15000 rows, the browser may
easily take a few dozen seconds to render the table (depending on the
speed of your workstation, how much memory you have, etc.) Try accessing
the CGI with wget or a similar tool - is it much faster than with the
browser?

hp
 
X

xhoster

skeldoy said:
Hey! I am working on an application that features a huge database
(mysql) and some cgi (perl) for listing, sorting, searching,
dupechecking and more.

Isn't dupechecking more a back-office thing best done from the command
line, not CGI?
I see that the configuration for the mysql is
pretty much spot on - most of the data is cached and so the mysqld-
process isn't really doing much in terms of causing bottlenecks. But I
believe that the perl-code may be the bottleneck here. I have turned
off buffering completely

What does that mean? $|=1? mysql_use_result?

I wouldn't know how to identify *all* sources of buffering present in
a complex system, let alone know how to disable them completely.
and I render pretty much just the things that
are needed. Still it can take up to a minute to print (in html) a
query that returns in a second on the mysql-console.

It takes less than a second to stream the entire <big_num> of data
to your screen on the mysql-console? Or do you somehow tell it to
run the query but suppress the actual results?
The output from the cgi is around 15MB for every operation the user
does, so I see the potential for slowness right there, in the sheer
amount of data that has to be produced and transmitted over the net,
but still I don't really understand what I have done to make the cgi
so slow.

Does the CGI have to be invoked by POST, or will GET work?

Use wget to fetch the page. See how long it takes when the results just
have to fetched and stored and not rendered.

And once you do that and have the results stored as a static file, you
might as well put that in a location that is served up via http and open it
with the client browser. See how long does it take to pull the data over
the net and render it in a browser when it is coming from a static file
rather than a CGI.
The cgi is mostly taking a parameter like $query and doing a "select
from db where value like "%$query%"" and returning that in pretty <td>
$_</td>-form. That seems to work reasonably fast. But when I do a
"select * from db" things tend to get really slow when dealing with
15000-entries++ (even though mysql has it all cached and spits it out
in a split second). The cgi-process sits there, spitting out html to
the client, using up 95% of the cpu-time of one of the cores

You can try adding -d:DProf or -d:SmallProf (or you favorite profiler) to
the CGI shebang line. Or come up with an adapter run the CGI in a
command-line environment, and profile it from there.
and using
50MBs of memory or so.

On most modern computers, 50MB is a rounding error. Is that the case for
your computer?

I have no idea what it does. I have replaced
most of the " with '.
???

And I can't really see that I am doing something
that needn't be done. Is there an issue with creating multiple
database-connections (DBD::mysql) that I should be aware of?

*Anything* can be done incorrectly, in which case there are things to be
aware of. Are you using multiple MySQL connections in your CGI? If so,
how are you doing it?

If somebody has experience in doing huge db's with perl, can you
please give me some pointers?

Profile, profile, profile. If that doesn't work, comment out chunks
of code and see if that affects speed (for example, do all the work except
for the actual printing of the html table.)
Is this a code issue or is it a network-
issue or is it a browser issue?

If the CGI script itself is using 95% of a CPU, then it probably isn't
a browser or network issue. Unless your CPU utilization reporting tool
accounts for I/O waits as CPU usage (I've seen some that do).
Does anyone have any tips for doing
huge databases with perl?

15,000 records is not huge. It is just barely even medium. Unless you
have giant blob or text fields. If you do have a large database with
DBD::mysql, it is important to know the benefits and drawbacks of
mysql_use_result.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.
 
S

skeldoy

... why not just render the page once
return static html?  You can schedule updates when the db changes.

I think that I am somewhat doing that with configuring the mysql to
cache almost everything it can. The server basically is just one big
mysql-cache and some memory for the cgi-process to run.. But I see
your point.. I can do it to test out if the browser may be the issue
 
S

skeldoy

You should use placeholders instead for security, but that won't affect
speed much.

I know. But it is intended to be used by inhouse people, and if they
mess it up, I will just restore it. :)
To find out where it spends the time, use Devel::NYTProf.
The simplest method to use it is usually to just invoke it from the
commandline:
export REQUEST_METHOD=GET
export QUERY_STRING='whatever'
perl -d:NYTProf yourscript.cgi
This creates a file nytprof.out in the current directory which you can
convert to a nice HTML report with the nytprofhtml command.
There are also other ways for invoking it, see the docs.

Good tip! I will try it out and see what it says. Thanks man, this is
really good stuff right there.
CGI does have considerable overhead per row. However, for just 15000
rows I'd still expect that to be finished in a few seconds at most.

I know! when I do a "mysql -u root \n select * from db" it returns
pretty much in the speed that my gfx-adapter can print it out, and
within a couple of secs if I pipe it to a file.
Is 700 GB and 6 billion rows in the largest table huge enough? I don't
use mysql for that one, though.

Is it a mysql issue that make you use another db?
Since you wrote that your CGI script uses 95% CPU time during the time
it runs it is unlikely to be a browser problem. But browsers in general
are rather slow at rendering tables. For 15000 rows, the browser may
easily take a few dozen seconds to render the table (depending on the
speed of your workstation, how much memory you have, etc.) Try accessing
the CGI with wget or a similar tool - is it much faster than with the
browser?

A really good question. Never really thought about seeing whether the
data-transmission is the issue in that way.
Just another thing for me to do..

Really appreciate your time/help on this. Good pointers there. I got
pretty one lane looking at the code
and tuning the mysql there. Never occurred to me that there is basic
ways of testing the data-transmission (ie. curl/wget).
And the Devel::NYTProf package is a definate plus in finding the
culprit here. Thanks again! :)
 
S

skeldoy

Isn't dupechecking more a back-office thing best done from the command
line, not CGI?

I know. I got some very particular non-technical customers that wants
to oversee the entire process to spend time. I think the bureaucrats
have too little work and too much time on their hands. But hey! I
can't really tell them what THEY want to do, right? :)
What does that mean?  $|=1?  mysql_use_result?

$|=1 on everything, everywhere. The mysql doesn't really buffer output
anymore, but the whole thing is in memory - so it is, as I said,
probably not the culprit.
I wouldn't know how to identify *all* sources of buffering present in
a complex system, let alone know how to disable them completely.

I know, but I thought there might be something trivial that I didn't
know about. Good to hear though.
It takes less than a second to stream the entire <big_num> of data
to your screen on the mysql-console?  Or do you somehow tell it to
run the query but suppress the actual results?

If i render it on screen from the mysql - I get bottlenecked by the
graphics-adapter/Xorg. When I pipe it to a file it gets done in like a
couple of seconds, and I figure that is how long it really does to do
the query.
Does the CGI have to be invoked by POST, or will GET work?

I use GET.
Use wget to fetch the page.  See how long it takes when the results just
have to fetched and stored and not rendered.

I will.
And once you do that and have the results stored as a static file, you
might as well put that in a location that is served up via http and open it
with the client browser.  See how long does it take to pull the data over
the net and render it in a browser when it is coming from a static file
rather than a CGI.

The next step. I agree. I plainly just have to be more thorough.. See
your pointers there.
You can try adding -d:DProf or -d:SmallProf (or you favorite profiler) to
the CGI shebang line.  Or come up with an adapter run the CGI in a
command-line environment, and profile it from there.

Yeah. Mister Peter mentioned a Devel::NYTProf that I intend to run.
But I will definatly try out DProf and SmallProf like you say.
Glad some of you actually know about these things (like profiling and
such other stuff that I am totally blind to).
On most modern computers, 50MB is a rounding error.  Is that the case for
your computer?

? huh ? Is this a new feature of the VonNeumann machine that slipped
my attention?
*Anything* can be done incorrectly, in which case there are things to be
aware of.  Are you using multiple MySQL connections in your CGI?  If so,
how are you doing it?

doing multiple "my $db1 = the connection stuff ; my $db2 = the
connection stuff ; my $db...
I open them all up in the beginning of the code and keep them over the
length of the run.
Strange thing is that when I parse the data into the DB I do it in the
exactly same manner and the
multiple connections speed that up. Can't really see how multiple
connections would suddenly be an issue,
but strange things happen in the Perl-world, so I have all
possibilities open.
Profile, profile, profile.  If that doesn't work, comment out chunks
of code and see if that affects speed (for example, do all the work except
for the actual printing of the html table.)

Again with the profiling. I will man. Thanks for that. Never actually
been a profiling man. Just code and it normally works the way I
intended. Will definelty try it out now..
If the CGI script itself is using 95% of a CPU, then it probably isn't
a browser or network issue.  Unless your CPU utilization reporting tool
accounts for I/O waits as CPU usage (I've seen some that do).

I don't really know. All I have done is run top on the linux-server
that runs it.
Not really into the whole userland thing - can it be an apache issue??
15,000 records is not huge.  It is just barely even medium.  Unless you
have giant blob or text fields.  If you do have a large database with
DBD::mysql, it is important to know the benefits and drawbacks of
mysql_use_result.

I know 15000 is supposed to be nada. But 15000 is the magic number
where things tend to slow down.
You must excuse me, I don't normally type English, so I may tend to
write awkwardly.
I will read up on mysql_use_result though. Thanks for the tip. I see
there is much more docmentation to dive into.

Thanks for all the tips and pointers and not least your time!
 
B

Bart Lateur

skeldoy said:
I know. But it is intended to be used by inhouse people, and if they
mess it up, I will just restore it. :)

Tsk.

It's not hard, and here is how:

my $sth = $dbh->prepare("select * from db '%' || ? || '%'");
$sth->execute($query);

How hard was that?
 
S

skeldoy

Update: Found out (with curl) that the problem is in the browsers. The
queries completes and downloads in a second or so but it takes forever
to render in all browsers I have tested.

Thanks for your help guys!
 
P

Peter J. Holzer

Is it a mysql issue that make you use another db?

Actually for us, mysql would be "another db". We are traditionally an
Oracle shop, so using Oracle was the natural thing to do for us. We are
using mysql for a few smaller databases, and I'm planning to look at
postgres again - the inverted indices could be really useful for our
application.

hp
 
J

John

skeldoy said:
Update: Found out (with curl) that the problem is in the browsers. The
queries completes and downloads in a second or so but it takes forever
to render in all browsers I have tested.

Thanks for your help guys!

If it is a browser problem try googling 'website optimisation'.
There's a lot you can do to speed up display of <tables> and you mentioned
earlier you use <td>s
O'Reilly have a book of 'Website Optimization'

Regards
John
 
B

Bart Lateur

Peter said:
Actually for us, mysql would be "another db". We are traditionally an
Oracle shop, so using Oracle was the natural thing to do for us. We are
using mysql for a few smaller databases, and I'm planning to look at
postgres again - the inverted indices could be really useful for our
application.

Well, allegedly, Postgres is a lot closer to Oracle than Mysql is. So
you'll probably feel a lot more at home in Postgres than in Mysql.

One thing I miss is that Postgres doesn't know packages.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top