Searching large files with a regex and a list

C

Channing

Hello All -

I would like some suggestions (constructive) on some code I'm writing.
My Perl is rusty and that's reflected in the sample I'm posting. Here
is what I have to tackle. I have Gig files to parse for two different
RegEx's. Within those RegEx's there is a variable that is a list of
18,000+ numbers. I'm looking for some suggestions on what I can do to
speed things up, or at least make things better.

Thanks in advance for your time.

------- Code Begin ---------
#!/usr/bin/perl

my $match=0;
my $nonMatch=0;

open(DN_LIST, "<","big_list");
my @list = <DN_LIST>;
@list=sort(@list);
close(DN_LIST);
foreach (@list)
{
chomp;
s/ //g;
}
@list = join('|',@list);


while (<>)
{
if ( /^123456\d{8}($list[0])/o or /^9876(91|92)\d{24}($list[0])/o )
{
$match++;
}
else
{
$nonMatch++;
}
}

print "Match Count:" . ${match} . "\n";
print "Non-Match Count:" . ${nonMatch} . "\n";

------- Code End ---------
 
B

Bob Walton

Channing wrote:
....
I would like some suggestions (constructive) on some code I'm writing.
My Perl is rusty and that's reflected in the sample I'm posting. Here
is what I have to tackle. I have Gig files to parse for two different
RegEx's. Within those RegEx's there is a variable that is a list of
18,000+ numbers. I'm looking for some suggestions on what I can do to
speed things up, or at least make things better. ....
------- Code Begin ---------
#!/usr/bin/perl
Here you're missing:

use warnings;
use strict;

Both should be in place during development at least.
my $match=0;
my $nonMatch=0;

open(DN_LIST, "<","big_list");

Always check the results of open() for success. Something like:

open DN_LIST,"<","big_list" or
die "Oops, big_list open failed, $!";
my @list = <DN_LIST>;
@list=sort(@list);
close(DN_LIST);
foreach (@list)
{
chomp;
s/ //g;
}
@list = join('|',@list);

While this DWYM, it would be better and clearer as:

my $list = join '|',@list;

The result of join() is a scalar, not an array. Change
references to $list[0] below to just $list.
while (<>)
{
if ( /^123456\d{8}($list[0])/o or /^9876(91|92)\d{24}($list[0])/o )

This [untested] might (or might not) go faster with the leading
part alternated, as in:

if ( /^((123456\d{8})|(9876(91|92)\d{24}))($list)/o )

Since you're not using the parenthetical groups to assign number
variables, this [untested] might be better still:

if ( /^(?:(?:123456\d{8})|(?:9876(?:91|92)\d{24}))(?:$list)/o )

Beyond that, if the nature of your data is such that the \d{8}
and \d{24} bits will always match (that is, you always have that
many digits present at those spots in the data, never anything
else), then you might consider using substr and eq to test parts
of your strings for matches, since your regex then boils down to
character by character string matches. Would that be faster? I
don't know in your case, but it usually is.

Another possibility is to use the strings in @list as keys to a
hash. Then, instead of testing your data string against 18000
possible strings, take the possible strings and see if they are
present as keys in the hash. One would have to keep track of and
test the possible lengths of strings, but even with that
overhead, this approach should be a big winner time-wise -- a few
hash lookups instead of 18000 string comparisons.
 
J

John W. Krahn

Channing said:
I would like some suggestions (constructive) on some code I'm writing.
My Perl is rusty and that's reflected in the sample I'm posting. Here
is what I have to tackle. I have Gig files to parse for two different
RegEx's. Within those RegEx's there is a variable that is a list of
18,000+ numbers. I'm looking for some suggestions on what I can do to
speed things up, or at least make things better.


------- Code Begin ---------
#!/usr/bin/perl

my $match=0;
my $nonMatch=0;

open(DN_LIST, "<","big_list");
my @list = <DN_LIST>;
@list=sort(@list);
close(DN_LIST);
foreach (@list)
{
chomp;
s/ //g;
}
@list = join('|',@list);


while (<>)
{
if ( /^123456\d{8}($list[0])/o or /^9876(91|92)\d{24}($list[0])/o )
{
$match++;
}
else
{
$nonMatch++;
}
}

print "Match Count:" . ${match} . "\n";
print "Non-Match Count:" . ${nonMatch} . "\n";

------- Code End ---------

According to the FAQ:

perldoc -q "How do I efficiently match many regular expressions at once"

You need to do something like this (UNTESTED):

#!/usr/bin/perl
use warnings;
use strict;

my $match = 0;
my $nonMatch = 0;

open DN_LIST, '<', 'big_list' or die "Cannot open 'big_list' $!";

my @list = map {
chomp;
tr/ //d;
qr/^(?:123456\d{8}|98769[12]\d{24})$_/;
} <DN_LIST>;

close DN_LIST;

LINE:
while ( my $line = <> ) {
for my $regex ( @list ) {
if ( $line =~ /$regex/ ) {
$match++;
next LINE;
}
}
$nonMatch++;
}

print "Match Count:$match\n";
print "Non-Match Count:$nonMatch\n";

__END__



John
 
A

attn.steven.kuo

Channing said:
Hello All -

I would like some suggestions (constructive) on some code I'm writing.
My Perl is rusty and that's reflected in the sample I'm posting. Here
is what I have to tackle. I have Gig files to parse for two different
RegEx's. Within those RegEx's there is a variable that is a list of
18,000+ numbers. I'm looking for some suggestions on what I can do to
speed things up, or at least make things better.

Thanks in advance for your time.

------- Code Begin ---------
#!/usr/bin/perl

my $match=0;
my $nonMatch=0;

open(DN_LIST, "<","big_list");
my @list = <DN_LIST>;
@list=sort(@list);
close(DN_LIST);
foreach (@list)
{
chomp;
s/ //g;
}
@list = join('|',@list);


while (<>)
{
if ( /^123456\d{8}($list[0])/o or /^9876(91|92)\d{24}($list[0])/o )
{
$match++;
}
else
{
$nonMatch++;
}
}

print "Match Count:" . ${match} . "\n";
print "Non-Match Count:" . ${nonMatch} . "\n";

------- Code End ---------



You may want to avoid alternation in
the regular expression and just check for
matches against a hash:


use Inline::Files -backup;

my %wanted;
while (<DNLIST>)
{
chomp;
$wanted{$_} = 1;
}


while (<DATA>)
{
my $found_match = 0;
chomp;
if (/^123456\d{8}/gc || /^9876(91|92)\d{24}/gc)
{
our $digits = '';
while (/\G(\d)(?{ $digits .= $1})/g)
{
if (exists $wanted{$digits})
{
$found_match = 1;
print $_, " Matched\n";
last;
}
}
}
unless ($found_match)
{
print $_, " Not Matched\n";
}
}

__DNLIST__
12
345
6789
__DATA__
12345612345678345
00000000000000000
98769212345678901234567890123412
9876911234567890123456789012346789
9876911234567890123456789012340000
 
B

Brian McCauley

Channing said:
I would like some suggestions (constructive) on some code I'm writing.
My Perl is rusty and that's reflected in the sample I'm posting. Here
is what I have to tackle. I have Gig files to parse for two different
RegEx's. Within those RegEx's there is a variable that is a list of
18,000+ numbers. I'm looking for some suggestions on what I can do to
speed things up, or at least make things better.

Thanks in advance for your time.

------- Code Begin ---------
#!/usr/bin/perl

my $match=0;
my $nonMatch=0;

open(DN_LIST, "<","big_list");
my @list = <DN_LIST>;
@list=sort(@list);
close(DN_LIST);
foreach (@list)
{
chomp;
s/ //g;
}
@list = join('|',@list);

Joining multiple RegEx into one like this is _less_ efficient than
simply looping over @list, which is why the answer given in the FAQ
(yes, your question is a FAQ) does not suggest doing so. (It does
suggest using qr// to precompile the RegEx though...

$_=qr/$_/; # Inside your loop
 
C

Channing

Brian said:
Joining multiple RegEx into one like this is _less_ efficient than
simply looping over @list, which is why the answer given in the FAQ
(yes, your question is a FAQ) does not suggest doing so. (It does
suggest using qr// to precompile the RegEx though...

$_=qr/$_/; # Inside your loop

Well, I tried a number of the suggestions. The best combination (of
what I tried) is posted below. This took the runtime from 2 hours to
1.5 minutes! In a nutshell, the suggestion to use a hash in-place of
the RegEx was the break-through. Thanks to all for their time and
contribution to the list!

Regards,

Channing

----- Code Begin -----


#!/usr/bin/perl

my $nonMatched=0;
my $matched=0;
my %dnList;
my $dnFile = "big_list";

open(DN_LIST, "<","${dnFile}") or die "Cannot open ${dnFile} $!";
my @list = <DN_LIST>;
close(DN_LIST);
foreach (@list)
{
chomp;
s/ //g;
${dnList{"$_"}} = 1;
}


while (<>)
{
if ( ( /^123456/o and (exists $dnList{substr($_,14,10)})) or
( /^9876(21|99)/o and (exists $dnList{substr($_,29,10)})) )
{
$matched++;
}
else
{
$nonMatched++;
}
}

print "Matched:" . ${matched} . "\n";
print "Non-Matched:" . ${nonMatched} . "\n";

----- Code Ends -----
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,013
Latest member
KatriceSwa

Latest Threads

Top