Searching large files with a regex and a list

Channing · May 31, 2006

Hello All -

I would like some suggestions (constructive) on some code I'm writing.
My Perl is rusty and that's reflected in the sample I'm posting. Here
is what I have to tackle. I have Gig files to parse for two different
RegEx's. Within those RegEx's there is a variable that is a list of
18,000+ numbers. I'm looking for some suggestions on what I can do to
speed things up, or at least make things better.

Thanks in advance for your time.

------- Code Begin ---------
#!/usr/bin/perl

my $match=0;
my $nonMatch=0;

open(DN_LIST, "<","big_list");
my @list = <DN_LIST>;
@list=sort(@list);
close(DN_LIST);
foreach (@list)
{
chomp;
s/ //g;
}
@list = join('|',@list);

while (<>)
{
if ( /^123456\d{8}($list[0])/o or /^9876(91|92)\d{24}($list[0])/o )
{
$match++;
}
else
{
$nonMatch++;
}
}

print "Match Count:" . ${match} . "\n";
print "Non-Match Count:" . ${nonMatch} . "\n";

------- Code End ---------

Bob Walton · May 31, 2006

Channing wrote:
....

I would like some suggestions (constructive) on some code I'm writing.
My Perl is rusty and that's reflected in the sample I'm posting. Here
is what I have to tackle. I have Gig files to parse for two different
RegEx's. Within those RegEx's there is a variable that is a list of
18,000+ numbers. I'm looking for some suggestions on what I can do to
speed things up, or at least make things better. ....
------- Code Begin ---------
#!/usr/bin/perl

Here you're missing:

use warnings;
use strict;

Both should be in place during development at least.

my $match=0;
my $nonMatch=0;

open(DN_LIST, "<","big_list");

Always check the results of open() for success. Something like:

open DN_LIST,"<","big_list" or
die "Oops, big_list open failed, $!";

my @list = <DN_LIST>;
@list=sort(@list);
close(DN_LIST);
foreach (@list)
{
chomp;
s/ //g;
}
@list = join('|',@list);

While this DWYM, it would be better and clearer as:

my $list = join '|',@list;

The result of join() is a scalar, not an array. Change
references to $list[0] below to just $list.

while (<>)
{
if ( /^123456\d{8}($list[0])/o or /^9876(91|92)\d{24}($list[0])/o )

This [untested] might (or might not) go faster with the leading
part alternated, as in:

if ( /^((123456\d{8})|(9876(91|92)\d{24}))($list)/o )

Since you're not using the parenthetical groups to assign number
variables, this [untested] might be better still:

if ( /^(?

?:123456\d{8})|(?:9876(?:91|92)\d{24}))(?:$list)/o )

Beyond that, if the nature of your data is such that the \d{8}
and \d{24} bits will always match (that is, you always have that
many digits present at those spots in the data, never anything
else), then you might consider using substr and eq to test parts
of your strings for matches, since your regex then boils down to
character by character string matches. Would that be faster? I
don't know in your case, but it usually is.

Another possibility is to use the strings in @list as keys to a
hash. Then, instead of testing your data string against 18000
possible strings, take the possible strings and see if they are
present as keys in the hash. One would have to keep track of and
test the possible lengths of strings, but even with that
overhead, this approach should be a big winner time-wise -- a few
hash lookups instead of 18000 string comparisons.

John W. Krahn · May 31, 2006

Channing said:
I would like some suggestions (constructive) on some code I'm writing.
My Perl is rusty and that's reflected in the sample I'm posting. Here
is what I have to tackle. I have Gig files to parse for two different
RegEx's. Within those RegEx's there is a variable that is a list of
18,000+ numbers. I'm looking for some suggestions on what I can do to
speed things up, or at least make things better.

------- Code Begin ---------
#!/usr/bin/perl

my $match=0;
my $nonMatch=0;

open(DN_LIST, "<","big_list");
my @list = <DN_LIST>;
@list=sort(@list);
close(DN_LIST);
foreach (@list)
{
chomp;
s/ //g;
}
@list = join('|',@list);

while (<>)
{
if ( /^123456\d{8}($list[0])/o or /^9876(91|92)\d{24}($list[0])/o )
{
$match++;
}
else
{
$nonMatch++;
}
}

print "Match Count:" . ${match} . "\n";
print "Non-Match Count:" . ${nonMatch} . "\n";

------- Code End ---------

According to the FAQ:

perldoc -q "How do I efficiently match many regular expressions at once"

You need to do something like this (UNTESTED):

#!/usr/bin/perl
use warnings;
use strict;

my $match = 0;
my $nonMatch = 0;

open DN_LIST, '<', 'big_list' or die "Cannot open 'big_list' $!";

my @list = map {
chomp;
tr/ //d;
qr/^(?:123456\d{8}|98769[12]\d{24})$_/;
} <DN_LIST>;

close DN_LIST;

LINE:
while ( my $line = <> ) {
for my $regex ( @list ) {
if ( $line =~ /$regex/ ) {
$match++;
next LINE;
}
}
$nonMatch++;
}

print "Match Count:$match\n";
print "Non-Match Count:$nonMatch\n";

__END__

John

attn.steven.kuo · May 31, 2006

Channing said:
Hello All -

I would like some suggestions (constructive) on some code I'm writing.
My Perl is rusty and that's reflected in the sample I'm posting. Here
is what I have to tackle. I have Gig files to parse for two different
RegEx's. Within those RegEx's there is a variable that is a list of
18,000+ numbers. I'm looking for some suggestions on what I can do to
speed things up, or at least make things better.

Thanks in advance for your time.

------- Code Begin ---------
#!/usr/bin/perl

my $match=0;
my $nonMatch=0;

open(DN_LIST, "<","big_list");
my @list = <DN_LIST>;
@list=sort(@list);
close(DN_LIST);
foreach (@list)
{
chomp;
s/ //g;
}
@list = join('|',@list);

while (<>)
{
if ( /^123456\d{8}($list[0])/o or /^9876(91|92)\d{24}($list[0])/o )
{
$match++;
}
else
{
$nonMatch++;
}
}

print "Match Count:" . ${match} . "\n";
print "Non-Match Count:" . ${nonMatch} . "\n";

------- Code End ---------

You may want to avoid alternation in
the regular expression and just check for
matches against a hash:

use Inline::Files -backup;

my %wanted;
while (<DNLIST>)
{
chomp;
$wanted{$_} = 1;
}

while (<DATA>)
{
my $found_match = 0;
chomp;
if (/^123456\d{8}/gc || /^9876(91|92)\d{24}/gc)
{
our $digits = '';
while (/\G(\d)(?{ $digits .= $1})/g)
{
if (exists $wanted{$digits})
{
$found_match = 1;
print $_, " Matched\n";
last;
}
}
}
unless ($found_match)
{
print $_, " Not Matched\n";
}
}

__DNLIST__
12
345
6789
__DATA__
12345612345678345
00000000000000000
98769212345678901234567890123412
9876911234567890123456789012346789
9876911234567890123456789012340000

Brian McCauley · May 31, 2006

Channing said:
I would like some suggestions (constructive) on some code I'm writing.
My Perl is rusty and that's reflected in the sample I'm posting. Here
is what I have to tackle. I have Gig files to parse for two different
RegEx's. Within those RegEx's there is a variable that is a list of
18,000+ numbers. I'm looking for some suggestions on what I can do to
speed things up, or at least make things better.

Thanks in advance for your time.

------- Code Begin ---------
#!/usr/bin/perl

my $match=0;
my $nonMatch=0;

open(DN_LIST, "<","big_list");
my @list = <DN_LIST>;
@list=sort(@list);
close(DN_LIST);
foreach (@list)
{
chomp;
s/ //g;
}
@list = join('|',@list);

Joining multiple RegEx into one like this is _less_ efficient than
simply looping over @list, which is why the answer given in the FAQ
(yes, your question is a FAQ) does not suggest doing so. (It does
suggest using qr// to precompile the RegEx though...

$_=qr/$_/; # Inside your loop

Channing · May 31, 2006

Brian said:
Joining multiple RegEx into one like this is _less_ efficient than
simply looping over @list, which is why the answer given in the FAQ
(yes, your question is a FAQ) does not suggest doing so. (It does
suggest using qr// to precompile the RegEx though...

$_=qr/$_/; # Inside your loop

Well, I tried a number of the suggestions. The best combination (of
what I tried) is posted below. This took the runtime from 2 hours to
1.5 minutes! In a nutshell, the suggestion to use a hash in-place of
the RegEx was the break-through. Thanks to all for their time and
contribution to the list!

Regards,

Channing

----- Code Begin -----

#!/usr/bin/perl

my $nonMatched=0;
my $matched=0;
my %dnList;
my $dnFile = "big_list";

open(DN_LIST, "<","${dnFile}") or die "Cannot open ${dnFile} $!";
my @list = <DN_LIST>;
close(DN_LIST);
foreach (@list)
{
chomp;
s/ //g;
${dnList{"$_"}} = 1;
}

while (<>)
{
if ( ( /^123456/o and (exists $dnList{substr($_,14,10)})) or
( /^9876(21|99)/o and (exists $dnList{substr($_,29,10)})) )
{
$matched++;
}
else
{
$nonMatched++;
}
}

print "Matched:" . ${matched} . "\n";
print "Non-Matched:" . ${nonMatched} . "\n";

----- Code Ends -----

Search a Large files backwards	7	Mar 2, 2010
Push regex search result into hash with multiple values	14	May 19, 2014
Can I get a little help with my program? (string searching and regex)	0	Jan 8, 2009
linked list	3	Apr 29, 2009
make sublists of a list broken at nth certain list items	2	Jul 8, 2013
Regex to match a numerical IP range	7	Dec 11, 2010
Process header record and concatenate files	5	Apr 5, 2009
RegEx: odd number of slashes? and too many slashes?	1	Jul 17, 2006

Searching large files with a regex and a list

Channing

Bob Walton

John W. Krahn

attn.steven.kuo

Brian McCauley

Channing

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads