HowTo parse huge Files

Discussion in 'Perl Misc' started by cadetg@googlemail.com, Mar 29, 2007.

  1. Guest

    Dear Perl Monks, I am developing at the moment a script which has to
    parse 20GB files. The files I have to parse are some logfiles. My
    problem is that it takes ages to parse the files. I am doing something
    like this:

    my %lookingFor;
    # keys => different name of one subset
    # values => array of one subset

    my $fh = new FileHandle "< largeLogFile.log";
    while (<$fh>) {
    foreach my $subset (keys %lookingFor) {
    foreach my $item (@{$subset}) {
    if (<$fh> =~ m/$item/) {
    my $writeFh = new FileHandle ">> myout.log"; print $writeFh <
    $fh>;
    }
    }
    }

    I've already tried to speed it up by using the regExp flag=>o by doing
    something like this:

    $isSubSet=buildRegexp(@allSubSets);
    while (<$fh>) {
    foreach my $subset (keys %lookingFor) {
    if (&$isSubSet(<$fh>)) {
    my $writeFh = new FileHandle ">> myout.log";
    print $writeFh <$fh>;
    }
    }
    }
    sub buildRegexp {
    my @R = @_; my $expr = join '||', map { "\$_[0] =~ m/\(To\|is\)\\:\\S
    \+\\@\$R[$_ +]/io" } ( 0..$#R );
    my $matchsub = eval "sub { $expr }";
    if ($@) { $logger->error("Failed in building regex @R: $@"); return
    ERROR; }
    $matchsub;
    }

    I don't know how to optimize this more. Maybe it would be possible to
    do something with "map"? I think the "o" flag didn't speed it up at
    all. Also I've tried to split the one big file into a few small ones
    and use some forks childs to parse each of the small ones. Also this
    didn't help.

    Thanks a lot for your help!

    Cheers
    -Marco
     
    , Mar 29, 2007
    #1
    1. Advertising

  2. Guest

    On 29 Mrz., 15:19, bugbear <bugbear@trim_papermule.co.uk_trim> wrote:
    > wrote:
    > > Dear Perl Monks, I am developing at the moment a script which has to
    > > parse 20GB files. The files I have to parse are some logfiles. My
    > > problem is that it takes ages to parse the files. I am doing something
    > > like this:

    >
    > > my %lookingFor;
    > > # keys => different name of one subset
    > > # values => array of one subset

    >
    > > my $fh = new FileHandle "< largeLogFile.log";
    > > while (<$fh>) {
    > > foreach my $subset (keys %lookingFor) {
    > > foreach my $item (@{$subset}) {
    > > if (<$fh> =~ m/$item/) {
    > > my $writeFh = new FileHandle ">> myout.log"; print $writeFh <
    > > $fh>;
    > > }
    > > }
    > > }

    >
    > How many key-value pairs does %lookingFor (typically?) have?
    >
    > BugBear


    The %lookingFor does maybe just have five key value pair but each
    value is a reference to an array which is holding in average 20 items.
    So in total that makes maybe 100 items for what I have to look for.

    Cheers
    -Marco
     
    , Mar 29, 2007
    #2
    1. Advertising

  3. Thomas J. Guest

    schrieb:

    >
    > my %lookingFor;
    > # keys => different name of one subset
    > # values => array of one subset
    >
    > my $fh = new FileHandle "< largeLogFile.log";
    > while (<$fh>) {
    > foreach my $subset (keys %lookingFor) {
    > foreach my $item (@{$subset}) {
    > if (<$fh> =~ m/$item/) {
    > my $writeFh = new FileHandle ">> myout.log"; print $writeFh <
    > $fh>;
    > }
    > }
    > }
    >


    Your code will "copy" the following lines from "lagreLogFile.log"
    after successful patternmatch into "myout.log"
    This may be time-consuming... ...and possible not what you expected?
    (use $_ instead of <$fh> in your "if" and in your "print")

    REs may lead to "endless matching". Please show some samples.

    study may/may_not help before patternmatching. (perldoc -f study)

    Thomas
     
    Thomas J., Mar 29, 2007
    #3
  4. Uri Guttman Guest

    >>>>> "cc" == cadetg@googlemail com <> writes:

    cc> Dear Perl Monks, I am developing at the moment a script which has to
    cc> parse 20GB files. The files I have to parse are some logfiles. My
    cc> problem is that it takes ages to parse the files. I am doing something
    cc> like this:

    this isn't perlmonks. that is a web site community. this is usenet.

    cc> my %lookingFor;
    cc> # keys => different name of one subset
    cc> # values => array of one subset

    cc> my $fh = new FileHandle "< largeLogFile.log";

    FileHandle is an old and deprecated module. where did you learn to use
    it? you can just do basic open as you don't need any OO here. also
    ALWAYS check for errors on opens.

    cc> while (<$fh>) {

    don't use plain while to read lines. the proper idiom is:

    while( my $line = <$fh> ) {



    cc> foreach my $subset (keys %lookingFor) {

    that makes no sense. keys are strings. they are not array refs. yet the
    line below dereferences the key into an array. which means you are not
    using strict and you are testing for wrong stuff.

    cc> foreach my $item (@{$subset}) {

    why the two loops? since you always look at all the items for each line,
    factor out the items list to before the loops:

    my @items = map @$_, values %lookingFor ;

    cc> if (<$fh> =~ m/$item/) {

    that is very wrong. it reads a new line in for each match in the
    subset. the line from the while is in $_ at the moment.

    cc> my $writeFh = new FileHandle ">> myout.log"; print $writeFh <
    cc> $fh>;

    why are you opening the output file EACH time you want to print? just
    open it before the loops and print to it. my previous advice on opens
    applies here too.

    cc> I've already tried to speed it up by using the regExp flag=>o by doing
    cc> something like this:

    forget speed for now. your code is broken. are getting any useful output
    at all? i can't see how as you are looking for stuff with broken code.

    uri

    --
    Uri Guttman ------ -------- http://www.stemsystems.com
    --Perl Consulting, Stem Development, Systems Architecture, Design and Coding-
    Search or Offer Perl Jobs ---------------------------- http://jobs.perl.org
     
    Uri Guttman, Mar 29, 2007
    #4
  5. J. Gleixner Guest

    wrote:
    > Dear Perl Monks, I am developing at the moment a script which has to
    > parse 20GB files. The files I have to parse are some logfiles. My
    > problem is that it takes ages to parse the files. I am doing something
    > like this:


    You might be better off using a large egrep and/or by simplifying your
    items. e.g. if your item contained 'abc' and 'abcd', you would only
    have to search for 'abc'.

    >
    > my %lookingFor;
    > # keys => different name of one subset
    > # values => array of one subset
    >
    > my $fh = new FileHandle "< largeLogFile.log";
    > while (<$fh>) {
    > foreach my $subset (keys %lookingFor) {
    > foreach my $item (@{$subset}) {
    > if (<$fh> =~ m/$item/) {
    > my $writeFh = new FileHandle ">> myout.log"; print $writeFh <
    > $fh>;
    > }

    Open it once, before the while, and write $_, not <$fh>.

    print $writeFh $_ if /$item/;

    > }
    > }
     
    J. Gleixner, Mar 29, 2007
    #5
  6. Mumia W. Guest

    On 03/29/2007 07:24 AM, wrote:
    > Dear Perl Monks, I am developing at the moment a script which has to
    > parse 20GB files. The files I have to parse are some logfiles. My
    > problem is that it takes ages to parse the files. I am doing something
    > like this:
    >
    > my %lookingFor;
    > # keys => different name of one subset
    > # values => array of one subset
    >
    > my $fh = new FileHandle "< largeLogFile.log";
    > [1:] while (<$fh>) {
    > foreach my $subset (keys %lookingFor) {
    > foreach my $item (@{$subset}) {
    > [2:] if (<$fh> =~ m/$item/) {


    You are aware that line 2 reads in a new chunk from $fh, and the old
    chunk read on line 1 is forgotten, don't you?


    > my $writeFh = new FileHandle ">> myout.log"; print $writeFh <
    > $fh>;


    You can open the write filehandle once and keep it open til you are done.


    > }
    > }
    > }
    >
    > I've already tried to speed it up by using the regExp flag=>o by doing
    > something like this:
    >
    > $isSubSet=buildRegexp(@allSubSets);
    > while (<$fh>) {
    > foreach my $subset (keys %lookingFor) {
    > if (&$isSubSet(<$fh>)) {
    > my $writeFh = new FileHandle ">> myout.log";
    > print $writeFh <$fh>;
    > }
    > }
    > }
    > sub buildRegexp {
    > my @R = @_; my $expr = join '||', map { "\$_[0] =~ m/\(To\|is\)\\:\\S
    > \+\\@\$R[$_ +]/io" } ( 0..$#R );
    > my $matchsub = eval "sub { $expr }";
    > if ($@) { $logger->error("Failed in building regex @R: $@"); return
    > ERROR; }
    > $matchsub;
    > }
    >
    > I don't know how to optimize this more. Maybe it would be possible to
    > do something with "map"? I think the "o" flag didn't speed it up at
    > all. Also I've tried to split the one big file into a few small ones
    > and use some forks childs to parse each of the small ones. Also this
    > didn't help.
    >
    > Thanks a lot for your help!
    >
    > Cheers
    > -Marco
    >


    It might not be possible to get much faster with such large files, but
    try this out:

    #!/usr/bin/perl
    use strict;
    use warnings;
    use FileHandle;
    use Data::Dumper;
    use Alias;

    my %lookingFor = (
    houseware => [qw(wallpaper hangers doorknobs)],
    );

    my %lookingForRx = lookingForRx(%lookingFor);


    my $fh = new FileHandle '< largeLogFile.log';
    my $writeFh = new FileHandle '> myout.log';

    while (my $line = <$fh>) {
    foreach my $subset (keys %lookingForRx) {
    if ($line =~ /$lookingForRx{$subset}/) {
    print $writeFh $line;
    }
    }
    }


    $writeFh->close;
    $fh->close;

    #####################################

    sub lookingForRx {
    our (%oldHash, @oldArray);
    local %oldHash = @_;
    local @oldArray;

    my %hash;
    foreach my $subset (keys %oldHash) {
    alias oldArray => $oldHash{$subset};
    my $rx = do { local $" = '|'; "(@oldArray)" };
    $hash{$subset} = qr/$rx/;
    }
    %hash;
    }


    __END__

    I haven't really tested this other than to make sure it compiles.
     
    Mumia W., Mar 29, 2007
    #6
  7. Mumia W. wrote:
    > On 03/29/2007 07:24 AM, wrote:
    >> Dear Perl Monks, I am developing at the moment a script which has to
    >> parse 20GB files. The files I have to parse are some logfiles. My
    >> problem is that it takes ages to parse the files. I am doing something
    >> like this:
    >>
    >> my %lookingFor;
    >> # keys => different name of one subset
    >> # values => array of one subset
    >>
    >> my $fh = new FileHandle "< largeLogFile.log";
    >> [1:] while (<$fh>) {
    >> foreach my $subset (keys %lookingFor) {
    >> foreach my $item (@{$subset}) {
    >> [2:] if (<$fh> =~ m/$item/) {

    >
    > You are aware that line 2 reads in a new chunk from $fh, and the old
    > chunk read on line 1 is forgotten, don't you?


    It is the other way around. "while (<$fh>) {" read a line and stores it in $_
    so it is still around, while "if (<$fh> =~ m/$item/) {" reads another line and
    binds it to the regular expression and then discards it.


    >> my $writeFh = new FileHandle ">> myout.log"; print $writeFh <
    >> $fh>;


    And a third line is read from the file and printed out.




    John
    --
    Perl isn't a toolbox, but a small machine shop where you can special-order
    certain sorts of tools at low cost and in short order. -- Larry Wall
     
    John W. Krahn, Mar 29, 2007
    #7
  8. Mumia W. Guest

    On 03/29/2007 02:02 PM, John W. Krahn wrote:
    >
    > It is the other way around. "while (<$fh>) {" read a line and stores it in $_
    > so it is still around, while "if (<$fh> =~ m/$item/) {" reads another line and
    > binds it to the regular expression and then discards it.
    >
    > [...]


    Oops, you're right. The "while (<$fh>)" magic confused me.
     
    Mumia W., Mar 29, 2007
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Hans
    Replies:
    12
    Views:
    11,387
    Chris Smith
    Aug 14, 2003
  2. Replies:
    1
    Views:
    394
    Roedy Green
    Nov 30, 2005
  3. Replies:
    3
    Views:
    528
  4. spaceman-spiff

    Trying to parse a HUGE(1gb) xml file

    spaceman-spiff, Dec 20, 2010, in forum: Python
    Replies:
    41
    Views:
    2,294
  5. spaceman-spiff

    Re: Trying to parse a HUGE(1gb) xml file

    spaceman-spiff, Dec 20, 2010, in forum: Python
    Replies:
    3
    Views:
    729
    John Nagle
    Dec 22, 2010
Loading...

Share This Page