HowTo parse huge Files

cadetg · Mar 29, 2007

Dear Perl Monks, I am developing at the moment a script which has to
parse 20GB files. The files I have to parse are some logfiles. My
problem is that it takes ages to parse the files. I am doing something
like this:

my %lookingFor;
# keys => different name of one subset
# values => array of one subset

my $fh = new FileHandle "< largeLogFile.log";
while (<$fh>) {
foreach my $subset (keys %lookingFor) {
foreach my $item (@{$subset}) {
if (<$fh> =~ m/$item/) {
my $writeFh = new FileHandle ">> myout.log"; print $writeFh <
$fh>;
}
}
}

I've already tried to speed it up by using the regExp flag=>o by doing
something like this:

$isSubSet=buildRegexp(@allSubSets);
while (<$fh>) {
foreach my $subset (keys %lookingFor) {
if (&$isSubSet(<$fh>)) {
my $writeFh = new FileHandle ">> myout.log";
print $writeFh <$fh>;
}
}
}
sub buildRegexp {
my @R = @_; my $expr = join '||', map { "\$_[0] =~ m/$To\|is$\\:\\S
\+\\@\$R[$_ +]/io" } ( 0..$#R );
my $matchsub = eval "sub { $expr }";
if ($@) { $logger->error("Failed in building regex @R: $@"); return
ERROR; }
$matchsub;
}

I don't know how to optimize this more. Maybe it would be possible to
do something with "map"? I think the "o" flag didn't speed it up at
all. Also I've tried to split the one big file into a few small ones
and use some forks childs to parse each of the small ones. Also this
didn't help.

Thanks a lot for your help!

Cheers
-Marco

cadetg · Mar 29, 2007

How many key-value pairs does %lookingFor (typically?) have?

BugBear

The %lookingFor does maybe just have five key value pair but each
value is a reference to an array which is holding in average 20 items.
So in total that makes maybe 100 items for what I have to look for.

Cheers
-Marco

Thomas J. · Mar 29, 2007

my %lookingFor;
# keys => different name of one subset
# values => array of one subset

my $fh = new FileHandle "< largeLogFile.log";
while (<$fh>) {
foreach my $subset (keys %lookingFor) {
foreach my $item (@{$subset}) {
if (<$fh> =~ m/$item/) {
my $writeFh = new FileHandle ">> myout.log"; print $writeFh <
$fh>;
}
}
}

Your code will "copy" the following lines from "lagreLogFile.log"
after successful patternmatch into "myout.log"
This may be time-consuming... ...and possible not what you expected?
(use $_ instead of <$fh> in your "if" and in your "print")

REs may lead to "endless matching". Please show some samples.

study may/may_not help before patternmatching. (perldoc -f study)

Thomas

Uri Guttman · Mar 29, 2007

cc> Dear Perl Monks, I am developing at the moment a script which has to
cc> parse 20GB files. The files I have to parse are some logfiles. My
cc> problem is that it takes ages to parse the files. I am doing something
cc> like this:

this isn't perlmonks. that is a web site community. this is usenet.

cc> my %lookingFor;
cc> # keys => different name of one subset
cc> # values => array of one subset

cc> my $fh = new FileHandle "< largeLogFile.log";

FileHandle is an old and deprecated module. where did you learn to use
it? you can just do basic open as you don't need any OO here. also
ALWAYS check for errors on opens.

cc> while (<$fh>) {

don't use plain while to read lines. the proper idiom is:

while( my $line = <$fh> ) {

cc> foreach my $subset (keys %lookingFor) {

that makes no sense. keys are strings. they are not array refs. yet the
line below dereferences the key into an array. which means you are not
using strict and you are testing for wrong stuff.

cc> foreach my $item (@{$subset}) {

why the two loops? since you always look at all the items for each line,
factor out the items list to before the loops:

my @items = map @$_, values %lookingFor ;

cc> if (<$fh> =~ m/$item/) {

that is very wrong. it reads a new line in for each match in the
subset. the line from the while is in $_ at the moment.

cc> my $writeFh = new FileHandle ">> myout.log"; print $writeFh <
cc> $fh>;

why are you opening the output file EACH time you want to print? just
open it before the loops and print to it. my previous advice on opens
applies here too.

cc> I've already tried to speed it up by using the regExp flag=>o by doing
cc> something like this:

forget speed for now. your code is broken. are getting any useful output
at all? i can't see how as you are looking for stuff with broken code.

uri

J. Gleixner · Mar 29, 2007

Dear Perl Monks, I am developing at the moment a script which has to
parse 20GB files. The files I have to parse are some logfiles. My
problem is that it takes ages to parse the files. I am doing something
like this:

You might be better off using a large egrep and/or by simplifying your
items. e.g. if your item contained 'abc' and 'abcd', you would only
have to search for 'abc'.

my %lookingFor;
# keys => different name of one subset
# values => array of one subset

my $fh = new FileHandle "< largeLogFile.log";
while (<$fh>) {
foreach my $subset (keys %lookingFor) {
foreach my $item (@{$subset}) {
if (<$fh> =~ m/$item/) {
my $writeFh = new FileHandle ">> myout.log"; print $writeFh <
$fh>;
}

Open it once, before the while, and write $_, not <$fh>.

print $writeFh $_ if /$item/;

Mumia W. · Mar 29, 2007

Dear Perl Monks, I am developing at the moment a script which has to
parse 20GB files. The files I have to parse are some logfiles. My
problem is that it takes ages to parse the files. I am doing something
like this:

my %lookingFor;
# keys => different name of one subset
# values => array of one subset

my $fh = new FileHandle "< largeLogFile.log";
[1:] while (<$fh>) {
foreach my $subset (keys %lookingFor) {
foreach my $item (@{$subset}) {
[2:] if (<$fh> =~ m/$item/) {

You are aware that line 2 reads in a new chunk from $fh, and the old
chunk read on line 1 is forgotten, don't you?

my $writeFh = new FileHandle ">> myout.log"; print $writeFh <
$fh>;

You can open the write filehandle once and keep it open til you are done.

}
}
}

I've already tried to speed it up by using the regExp flag=>o by doing
something like this:

$isSubSet=buildRegexp(@allSubSets);
while (<$fh>) {
foreach my $subset (keys %lookingFor) {
if (&$isSubSet(<$fh>)) {
my $writeFh = new FileHandle ">> myout.log";
print $writeFh <$fh>;
}
}
}
sub buildRegexp {
my @R = @_; my $expr = join '||', map { "\$_[0] =~ m/$To\|is$\\:\\S
\+\\@\$R[$_ +]/io" } ( 0..$#R );
my $matchsub = eval "sub { $expr }";
if ($@) { $logger->error("Failed in building regex @R: $@"); return
ERROR; }
$matchsub;
}

I don't know how to optimize this more. Maybe it would be possible to
do something with "map"? I think the "o" flag didn't speed it up at
all. Also I've tried to split the one big file into a few small ones
and use some forks childs to parse each of the small ones. Also this
didn't help.

Thanks a lot for your help!

Cheers
-Marco

It might not be possible to get much faster with such large files, but
try this out:

#!/usr/bin/perl
use strict;
use warnings;
use FileHandle;
use Data:

umper;
use Alias;

my %lookingFor = (
houseware => [qw(wallpaper hangers doorknobs)],
);

my %lookingForRx = lookingForRx(%lookingFor);

my $fh = new FileHandle '< largeLogFile.log';
my $writeFh = new FileHandle '> myout.log';

while (my $line = <$fh>) {
foreach my $subset (keys %lookingForRx) {
if ($line =~ /$lookingForRx{$subset}/) {
print $writeFh $line;
}
}
}

$writeFh->close;
$fh->close;

#####################################

sub lookingForRx {
our (%oldHash, @oldArray);
local %oldHash = @_;
local @oldArray;

my %hash;
foreach my $subset (keys %oldHash) {
alias oldArray => $oldHash{$subset};
my $rx = do { local $" = '|'; "(@oldArray)" };
$hash{$subset} = qr/$rx/;
}
%hash;
}

__END__

I haven't really tested this other than to make sure it compiles.

John W. Krahn · Mar 29, 2007

Mumia said:
Dear Perl Monks, I am developing at the moment a script which has to
parse 20GB files. The files I have to parse are some logfiles. My
problem is that it takes ages to parse the files. I am doing something
like this:

my %lookingFor;
# keys => different name of one subset
# values => array of one subset

my $fh = new FileHandle "< largeLogFile.log";
[1:] while (<$fh>) {
foreach my $subset (keys %lookingFor) {
foreach my $item (@{$subset}) {
[2:] if (<$fh> =~ m/$item/) {

Click to expand...

You are aware that line 2 reads in a new chunk from $fh, and the old
chunk read on line 1 is forgotten, don't you?

It is the other way around. "while (<$fh>) {" read a line and stores it in $_
so it is still around, while "if (<$fh> =~ m/$item/) {" reads another line and
binds it to the regular expression and then discards it.

And a third line is read from the file and printed out.

John

Mumia W. · Mar 29, 2007

It is the other way around. "while (<$fh>) {" read a line and stores it in $_
so it is still around, while "if (<$fh> =~ m/$item/) {" reads another line and
binds it to the regular expression and then discards it.

[...]

Oops, you're right. The "while (<$fh>)" magic confused me.

UTF - SEEK_SET workaround for BOM encoding(utf-16/32) layer Bug	2	Aug 5, 2009
Merge files	1	Aug 7, 2013
Toggle between hot filehandles question	3	Apr 11, 2007
FAQ 5.32 How can I tell whether there's a character waiting on a filehandle?	0	Apr 6, 2011
PHP/Perl/Unix Virus: delete config.php files asap	2	Aug 30, 2006
comparing huge files	3	Mar 16, 2006
how to recursively search through files (not dirs)?	7	May 11, 2005
Combining HTTP:Daemon and CGI	3	May 27, 2006

HowTo parse huge Files

cadetg

cadetg

Thomas J.

Uri Guttman

J. Gleixner

Mumia W.

John W. Krahn

Mumia W.

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads