Parsing large web server logfiles efficiently

A

ashutosh.gaur

Hi
I'm a perl newbie. I've been given the task of parsing through very
large (500MB) web server log files in an efficient manner. I need to
parse about 8 such files in parallel and create corresponding csv files
as output. This needs to be done every hour. In other words, the entire
parsing of about 8 files should complete well within 30 minutes. The
remaining 30 minutes are required for other database related activities
that need to be performed on the csv files generated by the perl
script.

Following is a snippet of my perl routine....

open(INFO, $in_file);
open(DAT, $out_file);

while (<INFO>) {

my ($host, $ident_user, $auth_user, $date, $time,
$time_zone, $method, $url, $protocol, $status, $bytes,
$referer, $agent);

($host, $ident_user, $auth_user, $date, $time,
$time_zone, $method, $url, $protocol, $status, $bytes,
$referer, $agent) =
/^(\S+) (\S+) (\S+) \[([^:]+):(\d+:\d+:\d+) ([^\]]+)\] "(\S+) (.+?)
(\S+)" (\S+) (\S+) "([^"]+)" "([^"]+)"$/
or next;

my $decrypt_url = <decrypting subroutine> $url;

print DAT $host, $ident_user, $auth_user, $date, $time,
$time_zone, $method, $decrypt_url, $protocol, $status,
$bytes, $referer, $agent, "\n";
}

---------------------------------------------------------------------------------------------------
This script takes about 50 minutes to process all the 8 files. I need
some suggestions to improve the performance and bring the processing
time down.

The hardware is a good 8 ( 1.2GHz ) CPU machine with 8GB of memory.
This machine will be used solely for file processing and running one
more application (Informatica)

thanks
Ash
 
L

l v

Hi
I'm a perl newbie. I've been given the task of parsing through very
large (500MB) web server log files in an efficient manner. I need to
parse about 8 such files in parallel and create corresponding csv files
as output. This needs to be done every hour. In other words, the entire
parsing of about 8 files should complete well within 30 minutes. The
remaining 30 minutes are required for other database related activities
that need to be performed on the csv files generated by the perl
script.

Following is a snippet of my perl routine....

open(INFO, $in_file);
open(DAT, $out_file);

while (<INFO>) {

my ($host, $ident_user, $auth_user, $date, $time,
$time_zone, $method, $url, $protocol, $status, $bytes,
$referer, $agent);

You can declare your variables in the next statement. so delete the
above statement by adding my to the beginning of the line.
*my* ($host, $ident_user, $auth_user, $date, $time,
$time_zone, $method, $url, $protocol, $status, $bytes,
$referer, $agent) =
/^(\S+) (\S+) (\S+) \[([^:]+):(\d+:\d+:\d+) ([^\]]+)\] "(\S+) (.+?)
(\S+)" (\S+) (\S+) "([^"]+)" "([^"]+)"$/
or next;

Try replacing the rexexp with split() on space(s) into an array.
my $decrypt_url = <decrypting subroutine> $url;

print DAT $host, $ident_user, $auth_user, $date, $time,
$time_zone, $method, $decrypt_url, $protocol, $status,
$bytes, $referer, $agent, "\n";

you can then use print map { "$_," } @array or join() to add in your
commas for your CSV output.
}

---------------------------------------------------------------------------------------------------
This script takes about 50 minutes to process all the 8 files. I need
some suggestions to improve the performance and bring the processing
time down.
[snip]

thanks
Ash

I'm sure there are much more efficient ways, but something to start
with.

Len
 
T

Tad McClellan

open(INFO, $in_file);


You should always, yes *always*, check the return value from open():

open(INFO, $in_file) or die "could not open '$in_file' $!";
 
M

MikeGee

Hi
I'm a perl newbie. I've been given the task of parsing through very
large (500MB) web server log files in an efficient manner. I need to
parse about 8 such files in parallel and create corresponding csv files
as output. This needs to be done every hour. In other words, the entire
parsing of about 8 files should complete well within 30 minutes. The
remaining 30 minutes are required for other database related activities
that need to be performed on the csv files generated by the perl
script.

Following is a snippet of my perl routine....

open(INFO, $in_file);
open(DAT, $out_file);

while (<INFO>) {

my ($host, $ident_user, $auth_user, $date, $time,
$time_zone, $method, $url, $protocol, $status, $bytes,
$referer, $agent);

($host, $ident_user, $auth_user, $date, $time,
$time_zone, $method, $url, $protocol, $status, $bytes,
$referer, $agent) =
/^(\S+) (\S+) (\S+) \[([^:]+):(\d+:\d+:\d+) ([^\]]+)\] "(\S+) (.+?)
(\S+)" (\S+) (\S+) "([^"]+)" "([^"]+)"$/
or next;

my $decrypt_url = <decrypting subroutine> $url;

print DAT $host, $ident_user, $auth_user, $date, $time,
$time_zone, $method, $decrypt_url, $protocol, $status,
$bytes, $referer, $agent, "\n";
}

---------------------------------------------------------------------------------------------------
This script takes about 50 minutes to process all the 8 files. I need
some suggestions to improve the performance and bring the processing
time down.

The hardware is a good 8 ( 1.2GHz ) CPU machine with 8GB of memory.
This machine will be used solely for file processing and running one
more application (Informatica)

thanks
Ash

Another approach to take is substituting commas for spaces in the
string rather than capturing all the fields. If your fields never
contain spaces then:

tr/ /,/

Couple that with sysread/syswrite, and you should get some big
improvements.
 
I

it_says_BALLS_on_your forehead

MikeGee said:
Hi
I'm a perl newbie. I've been given the task of parsing through very
large (500MB) web server log files in an efficient manner. I need to
parse about 8 such files in parallel and create corresponding csv files
as output. This needs to be done every hour. In other words, the entire
parsing of about 8 files should complete well within 30 minutes. The
remaining 30 minutes are required for other database related activities
that need to be performed on the csv files generated by the perl
script.

Following is a snippet of my perl routine....

open(INFO, $in_file);
open(DAT, $out_file);

while (<INFO>) {

my ($host, $ident_user, $auth_user, $date, $time,
$time_zone, $method, $url, $protocol, $status, $bytes,
$referer, $agent);

($host, $ident_user, $auth_user, $date, $time,
$time_zone, $method, $url, $protocol, $status, $bytes,
$referer, $agent) =
/^(\S+) (\S+) (\S+) \[([^:]+):(\d+:\d+:\d+) ([^\]]+)\] "(\S+) (.+?)
(\S+)" (\S+) (\S+) "([^"]+)" "([^"]+)"$/
or next;

my $decrypt_url = <decrypting subroutine> $url;

print DAT $host, $ident_user, $auth_user, $date, $time,
$time_zone, $method, $decrypt_url, $protocol, $status,
$bytes, $referer, $agent, "\n";
}

---------------------------------------------------------------------------------------------------
This script takes about 50 minutes to process all the 8 files. I need
some suggestions to improve the performance and bring the processing
time down.

The hardware is a good 8 ( 1.2GHz ) CPU machine with 8GB of memory.
This machine will be used solely for file processing and running one
more application (Informatica)

thanks
Ash

Another approach to take is substituting commas for spaces in the
string rather than capturing all the fields. If your fields never
contain spaces then:

tr/ /,/

Couple that with sysread/syswrite, and you should get some big
improvements.

can you explain this more? why does this improve performance? isn't
this only for fixed-length unbuffered input?
 
X

xhoster

Hi
I'm a perl newbie. I've been given the task of parsing through very
large (500MB) web server log files in an efficient manner. I need to
parse about 8 such files in parallel and create corresponding csv files
as output. This needs to be done every hour. In other words, the entire
parsing of about 8 files should complete well within 30 minutes. The
remaining 30 minutes are required for other database related activities
that need to be performed on the csv files generated by the perl
script.

parse the file into a csv file, and then re-parse the csv file to do
database stuff with it? Wouldn't it be more efficient to do the database
stuff directly in the script below?
Following is a snippet of my perl routine....

open(INFO, $in_file);
open(DAT, $out_file);

You should check the success of these. I assume the string in $out_file
begins with a ">"?
while (<INFO>) {

my ($host, $ident_user, $auth_user, $date, $time,
$time_zone, $method, $url, $protocol, $status, $bytes,
$referer, $agent);

($host, $ident_user, $auth_user, $date, $time,
$time_zone, $method, $url, $protocol, $status, $bytes,
$referer, $agent)

You could combine the my into the same statement as the assignment.
Probably not much faster, but certainly more readable.
=
/^(\S+) (\S+) (\S+) \[([^:]+):(\d+:\d+:\d+) ([^\]]+)\] "(\S+) (.+?)
(\S+)" (\S+) (\S+) "([^"]+)" "([^"]+)"$/
or next;

my $decrypt_url = <decrypting subroutine> $url;

print DAT $host, $ident_user, $auth_user, $date, $time,
$time_zone, $method, $decrypt_url, $protocol, $status,
$bytes, $referer, $agent, "\n";
}
This script takes about 50 minutes to process all the 8 files.

That script only processes one file. How do the other 7 get processed?

It takes me less than 3 minutes to process one 635 MB file on a signle CPU
3 GHz machine.
I need
some suggestions to improve the performance and bring the processing
time down.

Take out the print DAT. How long does it take? Take out the decrypting
subroutine, too. How long does it take now? Take out the regex. How long
does it take now?

You could try changing the regex to a split, with some post-processing of
the elements. But from some testing I've done, I doubt that will same more
than 20% or so, and that is without the necessary post-processing.
The hardware is a good 8 ( 1.2GHz ) CPU machine with 8GB of memory.

So, when the program is running, what is happening on the machine? Is the
CPU pegged? Is the network bandwidth pegged? Is the disk bandwidth
pegged? Are you using all 8 CPUs?

Xho
 
R

RedGrittyBrick

Hi
I'm a perl newbie. I've been given the task of parsing through very
large (500MB) web server log files in an efficient manner. I need to
parse about 8 such files in parallel and create corresponding csv files
as output. This needs to be done every hour. In other words, the entire
parsing of about 8 files should complete well within 30 minutes. The
remaining 30 minutes are required for other database related activities
that need to be performed on the csv files generated by the perl
script.

Following is a snippet of my perl routine....

open(INFO, $in_file);
open(DAT, $out_file);

while (<INFO>) {

my ($host, $ident_user, $auth_user, $date, $time,
$time_zone, $method, $url, $protocol, $status, $bytes,
$referer, $agent);

($host, $ident_user, $auth_user, $date, $time,
$time_zone, $method, $url, $protocol, $status, $bytes,
$referer, $agent) =
/^(\S+) (\S+) (\S+) \[([^:]+):(\d+:\d+:\d+) ([^\]]+)\] "(\S+) (.+?)
(\S+)" (\S+) (\S+) "([^"]+)" "([^"]+)"$/
or next;

my $decrypt_url = <decrypting subroutine> $url;

I wonder if that mysterious "decrypting" subroutine is where the
bottleneck is? What does it do?
 
A

Ash

Hi Xho,
I did some modifications based on your suggestions and some on my own.
i tried reading the entire file into a scalar context and removed the
regex. All I did was to substitute the space with a comma
--------------------------------------------------------------
undef $/;
$_ = <INFO>;

# Remove all double quotes
s/ /,/g;
---------------------------------------------------------------
The time came down to 45 seconds for a file. However, doing it this
way, I'll not be able to apply the decrypting subroutine. Moreover,
though it didn't occur, there is a possibility of memory problems with
this approach. The other approach of moving into an array and using the
same regex didn't improve the performance much (just about 5 minute
gain)

I did not understand your questions regarding pegging. Is there a way I
can peg the cpu and the bandwidths? How can I make sure that my script
use all the available CPUs?

This script runs parallely for all 8 files.

The decrypting subroutine is currently being developed by a seperate
team. I'm not sure how efficient would it be. What I wanted was to make
my script efficient before even pluggin that routine in.

thanks to all of you for your inputs
Ash
 
X

xhoster

Ash said:
Hi Xho,
I did some modifications based on your suggestions and some on my own.

Hi Ash,

I don't think you understood my suggestions. Those suggestions were on
ways to *diagnose* the problems more accurately, not ways to fix them. If
it gets much faster when you comment out the print, then you know the print
is the problem. If it gets much faster when you comment out the decrypt,
you know that that is the problem. etc.
i tried reading the entire file into a scalar context and removed the
regex. All I did was to substitute the space with a comma
--------------------------------------------------------------
undef $/;
$_ = <INFO>;

# Remove all double quotes
s/ /,/g;

Just one file, or all 8 in parallel? If the latter, then we now know that
reading the files from disk (or at least slurping it) is not the
bottleneck, but we don't know much more than that.
However, doing it this
way, I'll not be able to apply the decrypting subroutine.

Right. There is no point in testing the performance of s/ /,/g as that
doesn't do what needs to be done. I wanted you to take out the regex
entirely. Read the line, and then throw it away and go read the next line.
If that is much faster than reading the line, doing the regex, throwing
away the result of the regex and just going to the next line, then you know
the regex is the bottleneck. Then you will know where to focus your
efforts.
Moreover,
though it didn't occur, there is a possibility of memory problems with
this approach.

Absolutely. You want to test the line-by-line approach, there is no point
in testing the slurping approach as that is not a viable alternative.
The other approach of moving into an array and using the
same regex didn't improve the performance much (just about 5 minute
gain)

I did not understand your questions regarding pegging. Is there a way I
can peg the cpu and the bandwidths?

By "pegging" I mean using all of the resource which is available, so that
that resource becomes the bottleneck.
How can I make sure that my script
use all the available CPUs?

You use OS-specific tools to do that. On unix-like system, "top" is a good
one. However, how you interpret the results of "top" are OS-dependent. On
Solaris, I think it should list each of the 8 processes as getting nearly
12.5% of the CPU. If not, then it probably not CPU bound, but rather
IO bound.

This script runs parallely for all 8 files.

And it takes 50 minutes for the last of the 8 to finish? How long does
it take if you only process 4 in parallel? (If it still takes 50 minutes,
that suggests you are CPU bound. If it takes substantially less, that
suggests you are IO bound.)
The decrypting subroutine is currently being developed by a seperate
team. I'm not sure how efficient would it be. What I wanted was to make
my script efficient before even pluggin that routine in.

So what are you currently using, just an empty dummy subroutine?

Xho
 
A

Ash

Hi Xho
I don't think you understood my suggestions. Those suggestions were on
ways to *diagnose* the problems more accurately, not ways to fix them. If
it gets much faster when you comment out the print, then you know the print
is the problem. If it gets much faster when you comment out the decrypt,
you know that that is the problem. etc.

Going by the methodical approach you suggested, I figured out that it
is the regex that's the bottleneck. A straightforward line-by-line read
followed by a write for 8 files running in parallel took a little over
2 minutes. Putting the regex back took the time back to about 50 mins.
Just one file, or all 8 in parallel? If the latter, then we now know that
reading the files from disk (or at least slurping it) is not the
bottleneck, but we don't know much more than that.

Reading/writing is not the bottleneck
Right. There is no point in testing the performance of s/ /,/g as that
doesn't do what needs to be done. I wanted you to take out the regex
entirely. Read the line, and then throw it away and go read the next line.
If that is much faster than reading the line, doing the regex, throwing
away the result of the regex and just going to the next line, then you know
the regex is the bottleneck. Then you will know where to focus your
efforts.


Absolutely. You want to test the line-by-line approach, there is no point
in testing the slurping approach as that is not a viable alternative.


By "pegging" I mean using all of the resource which is available, so that
that resource becomes the bottleneck.


You use OS-specific tools to do that. On unix-like system, "top" is a good
one. However, how you interpret the results of "top" are OS-dependent. On
Solaris, I think it should list each of the 8 processes as getting nearly
12.5% of the CPU. If not, then it probably not CPU bound, but rather
IO bound.


And it takes 50 minutes for the last of the 8 to finish? How long does
it take if you only process 4 in parallel? (If it still takes 50 minutes,
that suggests you are CPU bound. If it takes substantially less, that
suggests you are IO bound.)

The process (with regex) takes 47 minutes for a single file and about
50 minutes for 8 parallel files. So, its CPU bound
So what are you currently using, just an empty dummy subroutine?
Right now I dont even have a dummy routine there. I'm just throwing the
encrypted result into the output file

This is a sample log entry
-------------------------------------------------------------------------------
151.205.97.52 - - [23/Aug/2005:11:56:31 +0000] "GET
/liveupdate-aka.symantec.com/common$20client$20core_103.0.3_english_livetri.zip
HTTP/1.1" 304 162 "-" "Symantec LiveUpdate" "-"
-------------------------------------------------------------------------------

The url /liveupdate-aka.s.....
will be encrypted. I tried a few other regex to parse the line but
didn't get anything that would inprove performance.

thanks
Ash
 
X

xhoster

....
The url /liveupdate-aka.s.....
will be encrypted. I tried a few other regex to parse the line but
didn't get anything that would inprove performance.

I'm afraid there will be no easy road for you here. At this point, I'd
first consider throwing more hardware at the problem. If not that, then
dropping from Perl into C to do the parsing. There may be existing Perl XS
modules (i.e. Perl modules fronting for C code) which will help to do that,
like Text::CSV_XS, but I'm not sure that even that is very promising.

Xho
 
M

MikeGee

it_says_BALLS_on_your forehead said:
MikeGee said:
Hi
I'm a perl newbie. I've been given the task of parsing through very
large (500MB) web server log files in an efficient manner. I need to
parse about 8 such files in parallel and create corresponding csv files
as output. This needs to be done every hour. In other words, the entire
parsing of about 8 files should complete well within 30 minutes. The
remaining 30 minutes are required for other database related activities
that need to be performed on the csv files generated by the perl
script.

Following is a snippet of my perl routine....

open(INFO, $in_file);
open(DAT, $out_file);

while (<INFO>) {

my ($host, $ident_user, $auth_user, $date, $time,
$time_zone, $method, $url, $protocol, $status, $bytes,
$referer, $agent);

($host, $ident_user, $auth_user, $date, $time,
$time_zone, $method, $url, $protocol, $status, $bytes,
$referer, $agent) =
/^(\S+) (\S+) (\S+) \[([^:]+):(\d+:\d+:\d+) ([^\]]+)\] "(\S+) (.+?)
(\S+)" (\S+) (\S+) "([^"]+)" "([^"]+)"$/
or next;

my $decrypt_url = <decrypting subroutine> $url;

print DAT $host, $ident_user, $auth_user, $date, $time,
$time_zone, $method, $decrypt_url, $protocol, $status,
$bytes, $referer, $agent, "\n";
}

---------------------------------------------------------------------------------------------------
This script takes about 50 minutes to process all the 8 files. I need
some suggestions to improve the performance and bring the processing
time down.

The hardware is a good 8 ( 1.2GHz ) CPU machine with 8GB of memory.
This machine will be used solely for file processing and running one
more application (Informatica)

thanks
Ash

Another approach to take is substituting commas for spaces in the
string rather than capturing all the fields. If your fields never
contain spaces then:

tr/ /,/

Couple that with sysread/syswrite, and you should get some big
improvements.

can you explain this more? why does this improve performance? isn't
this only for fixed-length unbuffered input?

First, if you are simply replacing all occurances of a space with a
comma, tr/// is the way to go. tr/// does not use regular expressions
and is therefore much faster.

In the simple case where tr/ /,/ is sufficient:

<untested>
while (sysread($in_fh, $_, $read_size)) {
tr/ /,/;
syswrite($out_fh, $_, length);
}
</untested>

I suggest sysread() & syswrite() because I read that someone
implementing a slurp function got the best performance with them.

You should experiment with $read_size to find the optimal value. I
would start with 1k and go up in multiples of two.

This results in many fewer IO operations, and no regexp. I bet it
screams.
 
D

Dr.Ruud

Ash schreef:
The url /liveupdate-aka.s..... will be encrypted.

By encrypted, do you mean that each $20 is changed into %20, or do you
mean something more spectacular?
 
S

Samwyse

Hi
I'm a perl newbie. I've been given the task of parsing through very
large (500MB) web server log files in an efficient manner. I need to
parse about 8 such files in parallel and create corresponding csv files
as output. This needs to be done every hour. In other words, the entire
parsing of about 8 files should complete well within 30 minutes. The
remaining 30 minutes are required for other database related activities
that need to be performed on the csv files generated by the perl
script.

Others have answered the Perl question, so I'll focus on your 30
minutes. Actually, you have an hour in which to do things. Set up your
parser to open $fname.log for reading and $fname.csv for writing, as you
are probably doing. Have a seperate process (possibly forked from the
coverter) run the database-related activities; arrange for that process
to initially rename $fname.csv to, say, active_$fname.csv. This gives
you an hour to convert and an hour to do other stuff.

You can further arrange the file names so that they contain an embedded
timestamp; this could allow these processes to run more or less
indefinitely. That way, as long as you can process 24 hours worth of
data within a 24 hour period, you can afford to fall behind during the
busiest parts of the day.

Note that in the latter case, you'll need enough disk to hold several
hours worth of data, and in either case you'll need enough CPU power to
run the parsing and the database-related programs simultaniously.

An eight-way processor is probably overkill, since these programs have
to wait on I/O from time to time. For example, if one-eighth of the
time is spent waiting, then seven CPUs could handle eight parallel
processes.
 
J

J. Gleixner

Hi
I'm a perl newbie. I've been given the task of parsing through very
large (500MB) web server log files in an efficient manner. I need to
as output. This needs to be done every hour. In other words, the entire
parsing of about 8 files should complete well within 30 minutes. The
remaining 30 minutes are required for other database related activities
that need to be performed on the csv files generated by the perl
script.

In addition to the other suggestions, other approaches to speed things
up would be:

o If you're using Apache you could have it write the log information
directly to your DB. Then you could encode the URL column, possibly as
it's written, or once it's in your DB, or encode it when you write out
your CSV file. http://www.outoforder.cc/projects/apache/mod_log_sql/

o Have a daemon process implement a 'tail -f' (File::Tail) on the logs,
parse the output, modify it, and write it to another log/CSV file or to
the DB.

o If neither of those is an option and the end result is to have the
data in a DB, then skip the CSV file and add the data directly to the DB
as you're parsing the logs.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,048
Latest member
verona

Latest Threads

Top