perl vs Unix grep

A

Al Belden

Hi all,
I've been working on a problem that I thought might be of interest: I'm
trying to replace some korn shell scripts that search source code files with
perl scripts to gain certain features such as:

More powerful regular expressions available in perl
Ability to print out lines before and after matches (gnu grep supports this
but is not availble on our Digital Unix and AIX platforms)
Make searches case insensitive by default (yes, I know this can be done with
grep but the shell scripts that use
grep don't do this)

We're talking about approx. 5000 files spread over 15 directories. To date
it has proven quite difficult (for me) to match the performance of the Korn
shell scripts using perl scripts and still obtain the line number and
context information needed. The crux of the problem is that I have seen the
best performance from perl when I match with the /g option on a string that
represents the current slurped file:

local $/;
my $curStr = <FH>;
my $compiledRegex = qr/$srchStr/;
while ($curStr =~ /$compiledRegex/g)
{
# write matches to file for eventual paging
}

This works well except that when each match is found I need the line number
the match has been found in. As far as I can tell from reading and research
there is no variable that holds this information as I am not reading from
the file at this point. I can get the information in other ways such as:

1. Reading each file a line at a time, testing for a match and keeping a
line counter or using $NR or $..
2. Reading the file into an array and processing a line at a time
3. Creating index files for the source files that store line offsets and
using them with the slurp method in the paragraph above
4. Creating an in-memory index for each file that contains a match and using
it for subsequent matches in that file

1, 2 and 4 above suffer performance degradation relative to unix grep. #3
provides good performance and is the method I am currently using but it
requires creating and maintaining index files. I was wondering if I could
tie a scalar to a file and use the slurping loop above. Then perhaps $NR and
$. would contain the current line number as the file would be read as the
loop is traversed. Any other ideas would be welcome
 
A

Anno Siegel

Al Belden said:
Hi all,
I've been working on a problem that I thought might be of interest: I'm
trying to replace some korn shell scripts that search source code files with
perl scripts ...
[...]

1. Reading each file a line at a time, testing for a match and keeping a
line counter or using $NR or $..
2. Reading the file into an array and processing a line at a time
3. Creating index files for the source files that store line offsets and
using them with the slurp method in the paragraph above
4. Creating an in-memory index for each file that contains a match and using
it for subsequent matches in that file

1, 2 and 4 above suffer performance degradation relative to unix grep. #3
provides good performance and is the method I am currently using but it
requires creating and maintaining index files. I was wondering if I could
tie a scalar to a file and use the slurping loop above. Then perhaps $NR and
$. would contain the current line number as the file would be read as the
loop is traversed. Any other ideas would be welcome

The Tie::File module (a standard module) ties a file not to a scalar
but an array of lines. Line numbers are then represented as the
array index (starting at 0). The behavior of grep -n can be simulated
like this:

use Tie::File;

my @line;
tie @line, 'Tie::File', $_ or
die "can't tie file '$_': $!" for '/tmp/x';

print "$_:$line[ $_ - 1]\n" for
grep $line[ $_ - 1] =~ /aaa/, 1 .. @line;

No idea about performance, but I'd give it a try.

Anno
 
B

Brad Baxter

Hi all,
I've been working on a problem that I thought might be of interest: I'm
trying to replace some korn shell scripts that search source code files with
perl scripts to gain certain features such as:

More powerful regular expressions available in perl
Ability to print out lines before and after matches (gnu grep supports this
but is not availble on our Digital Unix and AIX platforms)
Make searches case insensitive by default (yes, I know this can be done with
grep but the shell scripts that use
grep don't do this)

Some other discussions that might help:

http://groups.google.com/[email protected]

http://groups.google.com/[email protected]


Regards,

Brad
 
C

ctcgag

Al Belden said:
Hi all,
I've been working on a problem that I thought might be of interest:
I'm trying to replace some korn shell scripts that search source code
files with perl scripts to gain certain features such as:

More powerful regular expressions available in perl

If the scripts you are seeking to replace are currently working, then
the regex available to the shell/grep are evidentally powerful enough.
Are you sure that this is a true reason and not simply a rationalization?
Do the scripts undergo extensive maintenance during which you really pine
for the power of Perl?
Ability to print out lines before and after matches (gnu grep supports
this but is not availble on our Digital Unix and AIX platforms)

I'm pretty sure you can compile gnu grep for both of those platforms.
Make searches case insensitive by default (yes, I know this can be done
with grep but the shell scripts that use
grep don't do this)

I think that, before trying to rewrite those scripts into Perl, I would try
to rewrite them using Perl. In other words, use Perl to edit the scripts
to add an -i to each grep.
We're talking about approx. 5000 files spread over 15 directories.

If you rewrite into perl, would you be able to reduce this to a much
smaller number of files that parameterize some of the variation now
represented as a multiplicity of scripts?
To
date it has proven quite difficult (for me) to match the performance of
the Korn shell scripts using perl scripts and still obtain the line
number and context information needed.

Is it very important that you match the performance of the Korn shell
scripts? If they are twice as slow, will anyone notice?

The crux of the problem is that I
have seen the best performance from perl when I match with the /g option
on a string that represents the current slurped file:

local $/;
my $curStr = <FH>;
my $compiledRegex = qr/$srchStr/;
while ($curStr =~ /$compiledRegex/g)
{
# write matches to file for eventual paging
}

This works well except that when each match is found I need the line
number the match has been found in. As far as I can tell from reading and
research there is no variable that holds this information as I am not
reading from the file at this point. I can get the information in other
ways such as:

1. Reading each file a line at a time, testing for a match and keeping a
line counter or using $NR or $..
2. Reading the file into an array and processing a line at a time
3. Creating index files for the source files that store line offsets and
using them with the slurp method in the paragraph above

If you add the runtime of the matching program itself to the runtime of the
index generating program, is this still faster than 1 or 2?
4. Creating an in-memory index for each file that contains a match and
using it for subsequent matches in that file

1, 2 and 4 above suffer performance degradation relative to unix grep. #3
provides good performance and is the method I am currently using but it
requires creating and maintaining index files. I was wondering if I could
tie a scalar to a file and use the slurping loop above. Then perhaps $NR
and $. would contain the current line number as the file would be read as
the loop is traversed. Any other ideas would be welcome

If the question is performance, tie'ing is almost never the answer.

Unix grep is very good at what it does. It will be very, very hard
to beat it at its own game without resorting to xs-type stuff. I wouldn't
even bother trying.

Xho
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,773
Messages
2,569,594
Members
45,113
Latest member
Vinay KumarNevatia
Top