Parsing large web server logfiles efficiently

Discussion in 'Perl Misc' started by ashutosh.gaur@gmail.com, Jan 14, 2006.

  1. Guest

    Hi
    I'm a perl newbie. I've been given the task of parsing through very
    large (500MB) web server log files in an efficient manner. I need to
    parse about 8 such files in parallel and create corresponding csv files
    as output. This needs to be done every hour. In other words, the entire
    parsing of about 8 files should complete well within 30 minutes. The
    remaining 30 minutes are required for other database related activities
    that need to be performed on the csv files generated by the perl
    script.

    Following is a snippet of my perl routine....

    open(INFO, $in_file);
    open(DAT, $out_file);

    while (<INFO>) {

    my ($host, $ident_user, $auth_user, $date, $time,
    $time_zone, $method, $url, $protocol, $status, $bytes,
    $referer, $agent);

    ($host, $ident_user, $auth_user, $date, $time,
    $time_zone, $method, $url, $protocol, $status, $bytes,
    $referer, $agent) =
    /^(\S+) (\S+) (\S+) \[([^:]+):(\d+:\d+:\d+) ([^\]]+)\] "(\S+) (.+?)
    (\S+)" (\S+) (\S+) "([^"]+)" "([^"]+)"$/
    or next;

    my $decrypt_url = <decrypting subroutine> $url;

    print DAT $host, $ident_user, $auth_user, $date, $time,
    $time_zone, $method, $decrypt_url, $protocol, $status,
    $bytes, $referer, $agent, "\n";
    }

    ---------------------------------------------------------------------------------------------------
    This script takes about 50 minutes to process all the 8 files. I need
    some suggestions to improve the performance and bring the processing
    time down.

    The hardware is a good 8 ( 1.2GHz ) CPU machine with 8GB of memory.
    This machine will be used solely for file processing and running one
    more application (Informatica)

    thanks
    Ash
    , Jan 14, 2006
    #1
    1. Advertising

  2. l v Guest

    wrote:
    > Hi
    > I'm a perl newbie. I've been given the task of parsing through very
    > large (500MB) web server log files in an efficient manner. I need to
    > parse about 8 such files in parallel and create corresponding csv files
    > as output. This needs to be done every hour. In other words, the entire
    > parsing of about 8 files should complete well within 30 minutes. The
    > remaining 30 minutes are required for other database related activities
    > that need to be performed on the csv files generated by the perl
    > script.
    >
    > Following is a snippet of my perl routine....
    >
    > open(INFO, $in_file);
    > open(DAT, $out_file);
    >
    > while (<INFO>) {
    >
    > my ($host, $ident_user, $auth_user, $date, $time,
    > $time_zone, $method, $url, $protocol, $status, $bytes,
    > $referer, $agent);


    You can declare your variables in the next statement. so delete the
    above statement by adding my to the beginning of the line.

    >
    > *my* ($host, $ident_user, $auth_user, $date, $time,
    > $time_zone, $method, $url, $protocol, $status, $bytes,
    > $referer, $agent) =
    > /^(\S+) (\S+) (\S+) \[([^:]+):(\d+:\d+:\d+) ([^\]]+)\] "(\S+) (.+?)
    > (\S+)" (\S+) (\S+) "([^"]+)" "([^"]+)"$/
    > or next;


    Try replacing the rexexp with split() on space(s) into an array.

    >
    > my $decrypt_url = <decrypting subroutine> $url;
    >
    > print DAT $host, $ident_user, $auth_user, $date, $time,
    > $time_zone, $method, $decrypt_url, $protocol, $status,
    > $bytes, $referer, $agent, "\n";


    you can then use print map { "$_," } @array or join() to add in your
    commas for your CSV output.

    > }
    >
    > ---------------------------------------------------------------------------------------------------
    > This script takes about 50 minutes to process all the 8 files. I need
    > some suggestions to improve the performance and bring the processing
    > time down.
    >

    [snip]
    >
    > thanks
    > Ash


    I'm sure there are much more efficient ways, but something to start
    with.

    Len
    l v, Jan 14, 2006
    #2
    1. Advertising

  3. <> wrote:

    > open(INFO, $in_file);



    You should always, yes *always*, check the return value from open():

    open(INFO, $in_file) or die "could not open '$in_file' $!";


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
    Tad McClellan, Jan 14, 2006
    #3
  4. MikeGee Guest

    wrote:
    > Hi
    > I'm a perl newbie. I've been given the task of parsing through very
    > large (500MB) web server log files in an efficient manner. I need to
    > parse about 8 such files in parallel and create corresponding csv files
    > as output. This needs to be done every hour. In other words, the entire
    > parsing of about 8 files should complete well within 30 minutes. The
    > remaining 30 minutes are required for other database related activities
    > that need to be performed on the csv files generated by the perl
    > script.
    >
    > Following is a snippet of my perl routine....
    >
    > open(INFO, $in_file);
    > open(DAT, $out_file);
    >
    > while (<INFO>) {
    >
    > my ($host, $ident_user, $auth_user, $date, $time,
    > $time_zone, $method, $url, $protocol, $status, $bytes,
    > $referer, $agent);
    >
    > ($host, $ident_user, $auth_user, $date, $time,
    > $time_zone, $method, $url, $protocol, $status, $bytes,
    > $referer, $agent) =
    > /^(\S+) (\S+) (\S+) \[([^:]+):(\d+:\d+:\d+) ([^\]]+)\] "(\S+) (.+?)
    > (\S+)" (\S+) (\S+) "([^"]+)" "([^"]+)"$/
    > or next;
    >
    > my $decrypt_url = <decrypting subroutine> $url;
    >
    > print DAT $host, $ident_user, $auth_user, $date, $time,
    > $time_zone, $method, $decrypt_url, $protocol, $status,
    > $bytes, $referer, $agent, "\n";
    > }
    >
    > ---------------------------------------------------------------------------------------------------
    > This script takes about 50 minutes to process all the 8 files. I need
    > some suggestions to improve the performance and bring the processing
    > time down.
    >
    > The hardware is a good 8 ( 1.2GHz ) CPU machine with 8GB of memory.
    > This machine will be used solely for file processing and running one
    > more application (Informatica)
    >
    > thanks
    > Ash


    Another approach to take is substituting commas for spaces in the
    string rather than capturing all the fields. If your fields never
    contain spaces then:

    tr/ /,/

    Couple that with sysread/syswrite, and you should get some big
    improvements.
    MikeGee, Jan 15, 2006
    #4
  5. MikeGee wrote:
    > wrote:
    > > Hi
    > > I'm a perl newbie. I've been given the task of parsing through very
    > > large (500MB) web server log files in an efficient manner. I need to
    > > parse about 8 such files in parallel and create corresponding csv files
    > > as output. This needs to be done every hour. In other words, the entire
    > > parsing of about 8 files should complete well within 30 minutes. The
    > > remaining 30 minutes are required for other database related activities
    > > that need to be performed on the csv files generated by the perl
    > > script.
    > >
    > > Following is a snippet of my perl routine....
    > >
    > > open(INFO, $in_file);
    > > open(DAT, $out_file);
    > >
    > > while (<INFO>) {
    > >
    > > my ($host, $ident_user, $auth_user, $date, $time,
    > > $time_zone, $method, $url, $protocol, $status, $bytes,
    > > $referer, $agent);
    > >
    > > ($host, $ident_user, $auth_user, $date, $time,
    > > $time_zone, $method, $url, $protocol, $status, $bytes,
    > > $referer, $agent) =
    > > /^(\S+) (\S+) (\S+) \[([^:]+):(\d+:\d+:\d+) ([^\]]+)\] "(\S+) (.+?)
    > > (\S+)" (\S+) (\S+) "([^"]+)" "([^"]+)"$/
    > > or next;
    > >
    > > my $decrypt_url = <decrypting subroutine> $url;
    > >
    > > print DAT $host, $ident_user, $auth_user, $date, $time,
    > > $time_zone, $method, $decrypt_url, $protocol, $status,
    > > $bytes, $referer, $agent, "\n";
    > > }
    > >
    > > ---------------------------------------------------------------------------------------------------
    > > This script takes about 50 minutes to process all the 8 files. I need
    > > some suggestions to improve the performance and bring the processing
    > > time down.
    > >
    > > The hardware is a good 8 ( 1.2GHz ) CPU machine with 8GB of memory.
    > > This machine will be used solely for file processing and running one
    > > more application (Informatica)
    > >
    > > thanks
    > > Ash

    >
    > Another approach to take is substituting commas for spaces in the
    > string rather than capturing all the fields. If your fields never
    > contain spaces then:
    >
    > tr/ /,/
    >
    > Couple that with sysread/syswrite, and you should get some big
    > improvements.


    can you explain this more? why does this improve performance? isn't
    this only for fixed-length unbuffered input?
    it_says_BALLS_on_your forehead, Jan 15, 2006
    #5
  6. Guest

    wrote:
    > Hi
    > I'm a perl newbie. I've been given the task of parsing through very
    > large (500MB) web server log files in an efficient manner. I need to
    > parse about 8 such files in parallel and create corresponding csv files
    > as output. This needs to be done every hour. In other words, the entire
    > parsing of about 8 files should complete well within 30 minutes. The
    > remaining 30 minutes are required for other database related activities
    > that need to be performed on the csv files generated by the perl
    > script.


    parse the file into a csv file, and then re-parse the csv file to do
    database stuff with it? Wouldn't it be more efficient to do the database
    stuff directly in the script below?
    >
    > Following is a snippet of my perl routine....
    >
    > open(INFO, $in_file);
    > open(DAT, $out_file);


    You should check the success of these. I assume the string in $out_file
    begins with a ">"?

    >
    > while (<INFO>) {
    >
    > my ($host, $ident_user, $auth_user, $date, $time,
    > $time_zone, $method, $url, $protocol, $status, $bytes,
    > $referer, $agent);
    >
    > ($host, $ident_user, $auth_user, $date, $time,
    > $time_zone, $method, $url, $protocol, $status, $bytes,
    > $referer, $agent)


    You could combine the my into the same statement as the assignment.
    Probably not much faster, but certainly more readable.

    > =
    > /^(\S+) (\S+) (\S+) \[([^:]+):(\d+:\d+:\d+) ([^\]]+)\] "(\S+) (.+?)
    > (\S+)" (\S+) (\S+) "([^"]+)" "([^"]+)"$/
    > or next;
    >
    > my $decrypt_url = <decrypting subroutine> $url;
    >
    > print DAT $host, $ident_user, $auth_user, $date, $time,
    > $time_zone, $method, $decrypt_url, $protocol, $status,
    > $bytes, $referer, $agent, "\n";
    > }
    >
    > -------------------------------------------------------------------------
    > --------------------------


    > This script takes about 50 minutes to process all the 8 files.


    That script only processes one file. How do the other 7 get processed?

    It takes me less than 3 minutes to process one 635 MB file on a signle CPU
    3 GHz machine.

    > I need
    > some suggestions to improve the performance and bring the processing
    > time down.


    Take out the print DAT. How long does it take? Take out the decrypting
    subroutine, too. How long does it take now? Take out the regex. How long
    does it take now?

    You could try changing the regex to a split, with some post-processing of
    the elements. But from some testing I've done, I doubt that will same more
    than 20% or so, and that is without the necessary post-processing.

    > The hardware is a good 8 ( 1.2GHz ) CPU machine with 8GB of memory.


    So, when the program is running, what is happening on the machine? Is the
    CPU pegged? Is the network bandwidth pegged? Is the disk bandwidth
    pegged? Are you using all 8 CPUs?

    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    Usenet Newsgroup Service $9.95/Month 30GB
    , Jan 15, 2006
    #6
  7. wrote:
    > Hi
    > I'm a perl newbie. I've been given the task of parsing through very
    > large (500MB) web server log files in an efficient manner. I need to
    > parse about 8 such files in parallel and create corresponding csv files
    > as output. This needs to be done every hour. In other words, the entire
    > parsing of about 8 files should complete well within 30 minutes. The
    > remaining 30 minutes are required for other database related activities
    > that need to be performed on the csv files generated by the perl
    > script.
    >
    > Following is a snippet of my perl routine....
    >
    > open(INFO, $in_file);
    > open(DAT, $out_file);
    >
    > while (<INFO>) {
    >
    > my ($host, $ident_user, $auth_user, $date, $time,
    > $time_zone, $method, $url, $protocol, $status, $bytes,
    > $referer, $agent);
    >
    > ($host, $ident_user, $auth_user, $date, $time,
    > $time_zone, $method, $url, $protocol, $status, $bytes,
    > $referer, $agent) =
    > /^(\S+) (\S+) (\S+) \[([^:]+):(\d+:\d+:\d+) ([^\]]+)\] "(\S+) (.+?)
    > (\S+)" (\S+) (\S+) "([^"]+)" "([^"]+)"$/
    > or next;
    >
    > my $decrypt_url = <decrypting subroutine> $url;


    I wonder if that mysterious "decrypting" subroutine is where the
    bottleneck is? What does it do?

    >
    > print DAT $host, $ident_user, $auth_user, $date, $time,
    > $time_zone, $method, $decrypt_url, $protocol, $status,
    > $bytes, $referer, $agent, "\n";
    > }
    >
    > ---------------------------------------------------------------------------------------------------
    > This script takes about 50 minutes to process all the 8 files. I need
    > some suggestions to improve the performance and bring the processing
    > time down.
    >
    > The hardware is a good 8 ( 1.2GHz ) CPU machine with 8GB of memory.
    > This machine will be used solely for file processing and running one
    > more application (Informatica)
    RedGrittyBrick, Jan 16, 2006
    #7
  8. Ash Guest

    Hi Xho,
    I did some modifications based on your suggestions and some on my own.
    i tried reading the entire file into a scalar context and removed the
    regex. All I did was to substitute the space with a comma
    --------------------------------------------------------------
    undef $/;
    $_ = <INFO>;

    # Remove all double quotes
    s/ /,/g;
    ---------------------------------------------------------------
    The time came down to 45 seconds for a file. However, doing it this
    way, I'll not be able to apply the decrypting subroutine. Moreover,
    though it didn't occur, there is a possibility of memory problems with
    this approach. The other approach of moving into an array and using the
    same regex didn't improve the performance much (just about 5 minute
    gain)

    I did not understand your questions regarding pegging. Is there a way I
    can peg the cpu and the bandwidths? How can I make sure that my script
    use all the available CPUs?

    This script runs parallely for all 8 files.

    The decrypting subroutine is currently being developed by a seperate
    team. I'm not sure how efficient would it be. What I wanted was to make
    my script efficient before even pluggin that routine in.

    thanks to all of you for your inputs
    Ash
    Ash, Jan 16, 2006
    #8
  9. Guest

    "Ash" <> wrote:
    > Hi Xho,
    > I did some modifications based on your suggestions and some on my own.


    Hi Ash,

    I don't think you understood my suggestions. Those suggestions were on
    ways to *diagnose* the problems more accurately, not ways to fix them. If
    it gets much faster when you comment out the print, then you know the print
    is the problem. If it gets much faster when you comment out the decrypt,
    you know that that is the problem. etc.

    > i tried reading the entire file into a scalar context and removed the
    > regex. All I did was to substitute the space with a comma
    > --------------------------------------------------------------
    > undef $/;
    > $_ = <INFO>;
    >
    > # Remove all double quotes
    > s/ /,/g;
    > ---------------------------------------------------------------
    > The time came down to 45 seconds for a file.


    Just one file, or all 8 in parallel? If the latter, then we now know that
    reading the files from disk (or at least slurping it) is not the
    bottleneck, but we don't know much more than that.

    > However, doing it this
    > way, I'll not be able to apply the decrypting subroutine.


    Right. There is no point in testing the performance of s/ /,/g as that
    doesn't do what needs to be done. I wanted you to take out the regex
    entirely. Read the line, and then throw it away and go read the next line.
    If that is much faster than reading the line, doing the regex, throwing
    away the result of the regex and just going to the next line, then you know
    the regex is the bottleneck. Then you will know where to focus your
    efforts.

    > Moreover,
    > though it didn't occur, there is a possibility of memory problems with
    > this approach.


    Absolutely. You want to test the line-by-line approach, there is no point
    in testing the slurping approach as that is not a viable alternative.

    > The other approach of moving into an array and using the
    > same regex didn't improve the performance much (just about 5 minute
    > gain)
    >
    > I did not understand your questions regarding pegging. Is there a way I
    > can peg the cpu and the bandwidths?


    By "pegging" I mean using all of the resource which is available, so that
    that resource becomes the bottleneck.

    > How can I make sure that my script
    > use all the available CPUs?


    You use OS-specific tools to do that. On unix-like system, "top" is a good
    one. However, how you interpret the results of "top" are OS-dependent. On
    Solaris, I think it should list each of the 8 processes as getting nearly
    12.5% of the CPU. If not, then it probably not CPU bound, but rather
    IO bound.


    >
    > This script runs parallely for all 8 files.


    And it takes 50 minutes for the last of the 8 to finish? How long does
    it take if you only process 4 in parallel? (If it still takes 50 minutes,
    that suggests you are CPU bound. If it takes substantially less, that
    suggests you are IO bound.)

    >
    > The decrypting subroutine is currently being developed by a seperate
    > team. I'm not sure how efficient would it be. What I wanted was to make
    > my script efficient before even pluggin that routine in.


    So what are you currently using, just an empty dummy subroutine?

    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    Usenet Newsgroup Service $9.95/Month 30GB
    , Jan 17, 2006
    #9
  10. Ash Guest

    Hi Xho

    > I don't think you understood my suggestions. Those suggestions were on
    > ways to *diagnose* the problems more accurately, not ways to fix them. If
    > it gets much faster when you comment out the print, then you know the print
    > is the problem. If it gets much faster when you comment out the decrypt,
    > you know that that is the problem. etc.


    Going by the methodical approach you suggested, I figured out that it
    is the regex that's the bottleneck. A straightforward line-by-line read
    followed by a write for 8 files running in parallel took a little over
    2 minutes. Putting the regex back took the time back to about 50 mins.

    > Just one file, or all 8 in parallel? If the latter, then we now know that
    > reading the files from disk (or at least slurping it) is not the
    > bottleneck, but we don't know much more than that.


    Reading/writing is not the bottleneck

    > > However, doing it this
    > > way, I'll not be able to apply the decrypting subroutine.

    >
    > Right. There is no point in testing the performance of s/ /,/g as that
    > doesn't do what needs to be done. I wanted you to take out the regex
    > entirely. Read the line, and then throw it away and go read the next line.
    > If that is much faster than reading the line, doing the regex, throwing
    > away the result of the regex and just going to the next line, then you know
    > the regex is the bottleneck. Then you will know where to focus your
    > efforts.
    >
    > > Moreover,
    > > though it didn't occur, there is a possibility of memory problems with
    > > this approach.

    >
    > Absolutely. You want to test the line-by-line approach, there is no point
    > in testing the slurping approach as that is not a viable alternative.
    >
    > > The other approach of moving into an array and using the
    > > same regex didn't improve the performance much (just about 5 minute
    > > gain)
    > >
    > > I did not understand your questions regarding pegging. Is there a way I
    > > can peg the cpu and the bandwidths?

    >
    > By "pegging" I mean using all of the resource which is available, so that
    > that resource becomes the bottleneck.
    >
    > > How can I make sure that my script
    > > use all the available CPUs?

    >
    > You use OS-specific tools to do that. On unix-like system, "top" is a good
    > one. However, how you interpret the results of "top" are OS-dependent. On
    > Solaris, I think it should list each of the 8 processes as getting nearly
    > 12.5% of the CPU. If not, then it probably not CPU bound, but rather
    > IO bound.



    >
    > >
    > > This script runs parallely for all 8 files.

    >
    > And it takes 50 minutes for the last of the 8 to finish? How long does
    > it take if you only process 4 in parallel? (If it still takes 50 minutes,
    > that suggests you are CPU bound. If it takes substantially less, that
    > suggests you are IO bound.)


    The process (with regex) takes 47 minutes for a single file and about
    50 minutes for 8 parallel files. So, its CPU bound

    > >
    > > The decrypting subroutine is currently being developed by a seperate
    > > team. I'm not sure how efficient would it be. What I wanted was to make
    > > my script efficient before even pluggin that routine in.

    >
    > So what are you currently using, just an empty dummy subroutine?

    Right now I dont even have a dummy routine there. I'm just throwing the
    encrypted result into the output file

    This is a sample log entry
    -------------------------------------------------------------------------------
    151.205.97.52 - - [23/Aug/2005:11:56:31 +0000] "GET
    /liveupdate-aka.symantec.com/common$20client$20core_103.0.3_english_livetri.zip
    HTTP/1.1" 304 162 "-" "Symantec LiveUpdate" "-"
    -------------------------------------------------------------------------------

    The url /liveupdate-aka.s.....
    will be encrypted. I tried a few other regex to parse the line but
    didn't get anything that would inprove performance.

    thanks
    Ash
    Ash, Jan 17, 2006
    #10
  11. Guest

    "Ash" <> wrote:


    ....
    >
    > The url /liveupdate-aka.s.....
    > will be encrypted. I tried a few other regex to parse the line but
    > didn't get anything that would inprove performance.


    I'm afraid there will be no easy road for you here. At this point, I'd
    first consider throwing more hardware at the problem. If not that, then
    dropping from Perl into C to do the parsing. There may be existing Perl XS
    modules (i.e. Perl modules fronting for C code) which will help to do that,
    like Text::CSV_XS, but I'm not sure that even that is very promising.

    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    Usenet Newsgroup Service $9.95/Month 30GB
    , Jan 18, 2006
    #11
  12. MikeGee Guest

    it_says_BALLS_on_your forehead wrote:
    > MikeGee wrote:
    > > wrote:
    > > > Hi
    > > > I'm a perl newbie. I've been given the task of parsing through very
    > > > large (500MB) web server log files in an efficient manner. I need to
    > > > parse about 8 such files in parallel and create corresponding csv files
    > > > as output. This needs to be done every hour. In other words, the entire
    > > > parsing of about 8 files should complete well within 30 minutes. The
    > > > remaining 30 minutes are required for other database related activities
    > > > that need to be performed on the csv files generated by the perl
    > > > script.
    > > >
    > > > Following is a snippet of my perl routine....
    > > >
    > > > open(INFO, $in_file);
    > > > open(DAT, $out_file);
    > > >
    > > > while (<INFO>) {
    > > >
    > > > my ($host, $ident_user, $auth_user, $date, $time,
    > > > $time_zone, $method, $url, $protocol, $status, $bytes,
    > > > $referer, $agent);
    > > >
    > > > ($host, $ident_user, $auth_user, $date, $time,
    > > > $time_zone, $method, $url, $protocol, $status, $bytes,
    > > > $referer, $agent) =
    > > > /^(\S+) (\S+) (\S+) \[([^:]+):(\d+:\d+:\d+) ([^\]]+)\] "(\S+) (.+?)
    > > > (\S+)" (\S+) (\S+) "([^"]+)" "([^"]+)"$/
    > > > or next;
    > > >
    > > > my $decrypt_url = <decrypting subroutine> $url;
    > > >
    > > > print DAT $host, $ident_user, $auth_user, $date, $time,
    > > > $time_zone, $method, $decrypt_url, $protocol, $status,
    > > > $bytes, $referer, $agent, "\n";
    > > > }
    > > >
    > > > ---------------------------------------------------------------------------------------------------
    > > > This script takes about 50 minutes to process all the 8 files. I need
    > > > some suggestions to improve the performance and bring the processing
    > > > time down.
    > > >
    > > > The hardware is a good 8 ( 1.2GHz ) CPU machine with 8GB of memory.
    > > > This machine will be used solely for file processing and running one
    > > > more application (Informatica)
    > > >
    > > > thanks
    > > > Ash

    > >
    > > Another approach to take is substituting commas for spaces in the
    > > string rather than capturing all the fields. If your fields never
    > > contain spaces then:
    > >
    > > tr/ /,/
    > >
    > > Couple that with sysread/syswrite, and you should get some big
    > > improvements.

    >
    > can you explain this more? why does this improve performance? isn't
    > this only for fixed-length unbuffered input?


    First, if you are simply replacing all occurances of a space with a
    comma, tr/// is the way to go. tr/// does not use regular expressions
    and is therefore much faster.

    In the simple case where tr/ /,/ is sufficient:

    <untested>
    while (sysread($in_fh, $_, $read_size)) {
    tr/ /,/;
    syswrite($out_fh, $_, length);
    }
    </untested>

    I suggest sysread() & syswrite() because I read that someone
    implementing a slurp function got the best performance with them.

    You should experiment with $read_size to find the optimal value. I
    would start with 1k and go up in multiples of two.

    This results in many fewer IO operations, and no regexp. I bet it
    screams.
    MikeGee, Jan 18, 2006
    #12
  13. Dr.Ruud Guest

    Ash schreef:

    > The url /liveupdate-aka.s..... will be encrypted.


    By encrypted, do you mean that each $20 is changed into %20, or do you
    mean something more spectacular?

    --
    Affijn, Ruud

    "Gewoon is een tijger."
    Dr.Ruud, Jan 18, 2006
    #13
  14. Samwyse Guest

    wrote:
    > Hi
    > I'm a perl newbie. I've been given the task of parsing through very
    > large (500MB) web server log files in an efficient manner. I need to
    > parse about 8 such files in parallel and create corresponding csv files
    > as output. This needs to be done every hour. In other words, the entire
    > parsing of about 8 files should complete well within 30 minutes. The
    > remaining 30 minutes are required for other database related activities
    > that need to be performed on the csv files generated by the perl
    > script.


    Others have answered the Perl question, so I'll focus on your 30
    minutes. Actually, you have an hour in which to do things. Set up your
    parser to open $fname.log for reading and $fname.csv for writing, as you
    are probably doing. Have a seperate process (possibly forked from the
    coverter) run the database-related activities; arrange for that process
    to initially rename $fname.csv to, say, active_$fname.csv. This gives
    you an hour to convert and an hour to do other stuff.

    You can further arrange the file names so that they contain an embedded
    timestamp; this could allow these processes to run more or less
    indefinitely. That way, as long as you can process 24 hours worth of
    data within a 24 hour period, you can afford to fall behind during the
    busiest parts of the day.

    Note that in the latter case, you'll need enough disk to hold several
    hours worth of data, and in either case you'll need enough CPU power to
    run the parsing and the database-related programs simultaniously.

    An eight-way processor is probably overkill, since these programs have
    to wait on I/O from time to time. For example, if one-eighth of the
    time is spent waiting, then seven CPUs could handle eight parallel
    processes.
    Samwyse, Jan 18, 2006
    #14
  15. J. Gleixner Guest

    wrote:
    > Hi
    > I'm a perl newbie. I've been given the task of parsing through very
    > large (500MB) web server log files in an efficient manner. I need to
    >> parse about 8 such files in parallel and create corresponding csv files

    > as output. This needs to be done every hour. In other words, the entire
    > parsing of about 8 files should complete well within 30 minutes. The
    > remaining 30 minutes are required for other database related activities
    > that need to be performed on the csv files generated by the perl
    > script.


    In addition to the other suggestions, other approaches to speed things
    up would be:

    o If you're using Apache you could have it write the log information
    directly to your DB. Then you could encode the URL column, possibly as
    it's written, or once it's in your DB, or encode it when you write out
    your CSV file. http://www.outoforder.cc/projects/apache/mod_log_sql/

    o Have a daemon process implement a 'tail -f' (File::Tail) on the logs,
    parse the output, modify it, and write it to another log/CSV file or to
    the DB.

    o If neither of those is an option and the end result is to have the
    data in a DB, then skip the CSV file and add the data directly to the DB
    as you're parsing the logs.
    J. Gleixner, Jan 19, 2006
    #15
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    0
    Views:
    542
  2. Courtis Joannis

    Question about logfiles in Python

    Courtis Joannis, Feb 24, 2005, in forum: Python
    Replies:
    1
    Views:
    310
    Harlin Seritt
    Feb 24, 2005
  3. Replies:
    4
    Views:
    1,116
    Michael Tsang
    Jan 11, 2010
  4. Replies:
    0
    Views:
    499
  5. Victor Hooi
    Replies:
    6
    Views:
    110
    Oscar Benjamin
    Jan 8, 2013
Loading...

Share This Page