perl vs Unix grep

Discussion in 'Perl Misc' started by Al Belden, Jul 3, 2004.

  1. Al Belden

    Al Belden Guest

    Hi all,
    I've been working on a problem that I thought might be of interest: I'm
    trying to replace some korn shell scripts that search source code files with
    perl scripts to gain certain features such as:

    More powerful regular expressions available in perl
    Ability to print out lines before and after matches (gnu grep supports this
    but is not availble on our Digital Unix and AIX platforms)
    Make searches case insensitive by default (yes, I know this can be done with
    grep but the shell scripts that use
    grep don't do this)

    We're talking about approx. 5000 files spread over 15 directories. To date
    it has proven quite difficult (for me) to match the performance of the Korn
    shell scripts using perl scripts and still obtain the line number and
    context information needed. The crux of the problem is that I have seen the
    best performance from perl when I match with the /g option on a string that
    represents the current slurped file:

    local $/;
    my $curStr = <FH>;
    my $compiledRegex = qr/$srchStr/;
    while ($curStr =~ /$compiledRegex/g)
    {
    # write matches to file for eventual paging
    }

    This works well except that when each match is found I need the line number
    the match has been found in. As far as I can tell from reading and research
    there is no variable that holds this information as I am not reading from
    the file at this point. I can get the information in other ways such as:

    1. Reading each file a line at a time, testing for a match and keeping a
    line counter or using $NR or $..
    2. Reading the file into an array and processing a line at a time
    3. Creating index files for the source files that store line offsets and
    using them with the slurp method in the paragraph above
    4. Creating an in-memory index for each file that contains a match and using
    it for subsequent matches in that file

    1, 2 and 4 above suffer performance degradation relative to unix grep. #3
    provides good performance and is the method I am currently using but it
    requires creating and maintaining index files. I was wondering if I could
    tie a scalar to a file and use the slurping loop above. Then perhaps $NR and
    $. would contain the current line number as the file would be read as the
    loop is traversed. Any other ideas would be welcome
     
    Al Belden, Jul 3, 2004
    #1
    1. Advertising

  2. Al Belden

    Anno Siegel Guest

    Al Belden <> wrote in comp.lang.perl.misc:
    > Hi all,
    > I've been working on a problem that I thought might be of interest: I'm
    > trying to replace some korn shell scripts that search source code files with
    > perl scripts ...


    [...]

    > 1. Reading each file a line at a time, testing for a match and keeping a
    > line counter or using $NR or $..
    > 2. Reading the file into an array and processing a line at a time
    > 3. Creating index files for the source files that store line offsets and
    > using them with the slurp method in the paragraph above
    > 4. Creating an in-memory index for each file that contains a match and using
    > it for subsequent matches in that file
    >
    > 1, 2 and 4 above suffer performance degradation relative to unix grep. #3
    > provides good performance and is the method I am currently using but it
    > requires creating and maintaining index files. I was wondering if I could
    > tie a scalar to a file and use the slurping loop above. Then perhaps $NR and
    > $. would contain the current line number as the file would be read as the
    > loop is traversed. Any other ideas would be welcome


    The Tie::File module (a standard module) ties a file not to a scalar
    but an array of lines. Line numbers are then represented as the
    array index (starting at 0). The behavior of grep -n can be simulated
    like this:

    use Tie::File;

    my @line;
    tie @line, 'Tie::File', $_ or
    die "can't tie file '$_': $!" for '/tmp/x';

    print "$_:$line[ $_ - 1]\n" for
    grep $line[ $_ - 1] =~ /aaa/, 1 .. @line;

    No idea about performance, but I'd give it a try.

    Anno
     
    Anno Siegel, Jul 4, 2004
    #2
    1. Advertising

  3. Al Belden

    Brad Baxter Guest

    On Sat, 3 Jul 2004, Al Belden wrote:

    > Hi all,
    > I've been working on a problem that I thought might be of interest: I'm
    > trying to replace some korn shell scripts that search source code files with
    > perl scripts to gain certain features such as:
    >
    > More powerful regular expressions available in perl
    > Ability to print out lines before and after matches (gnu grep supports this
    > but is not availble on our Digital Unix and AIX platforms)
    > Make searches case insensitive by default (yes, I know this can be done with
    > grep but the shell scripts that use
    > grep don't do this)


    Some other discussions that might help:

    http://groups.google.com/groups?selm=

    http://groups.google.com/groups?selm=


    Regards,

    Brad
     
    Brad Baxter, Jul 6, 2004
    #3
  4. Al Belden

    Guest

    "Al Belden" <> wrote:
    > Hi all,
    > I've been working on a problem that I thought might be of interest:
    > I'm trying to replace some korn shell scripts that search source code
    > files with perl scripts to gain certain features such as:
    >
    > More powerful regular expressions available in perl


    If the scripts you are seeking to replace are currently working, then
    the regex available to the shell/grep are evidentally powerful enough.
    Are you sure that this is a true reason and not simply a rationalization?
    Do the scripts undergo extensive maintenance during which you really pine
    for the power of Perl?

    > Ability to print out lines before and after matches (gnu grep supports
    > this but is not availble on our Digital Unix and AIX platforms)


    I'm pretty sure you can compile gnu grep for both of those platforms.

    > Make searches case insensitive by default (yes, I know this can be done
    > with grep but the shell scripts that use
    > grep don't do this)


    I think that, before trying to rewrite those scripts into Perl, I would try
    to rewrite them using Perl. In other words, use Perl to edit the scripts
    to add an -i to each grep.

    >
    > We're talking about approx. 5000 files spread over 15 directories.


    If you rewrite into perl, would you be able to reduce this to a much
    smaller number of files that parameterize some of the variation now
    represented as a multiplicity of scripts?

    > To
    > date it has proven quite difficult (for me) to match the performance of
    > the Korn shell scripts using perl scripts and still obtain the line
    > number and context information needed.


    Is it very important that you match the performance of the Korn shell
    scripts? If they are twice as slow, will anyone notice?


    > The crux of the problem is that I
    > have seen the best performance from perl when I match with the /g option
    > on a string that represents the current slurped file:
    >
    > local $/;
    > my $curStr = <FH>;
    > my $compiledRegex = qr/$srchStr/;
    > while ($curStr =~ /$compiledRegex/g)
    > {
    > # write matches to file for eventual paging
    > }
    >
    > This works well except that when each match is found I need the line
    > number the match has been found in. As far as I can tell from reading and
    > research there is no variable that holds this information as I am not
    > reading from the file at this point. I can get the information in other
    > ways such as:
    >
    > 1. Reading each file a line at a time, testing for a match and keeping a
    > line counter or using $NR or $..
    > 2. Reading the file into an array and processing a line at a time
    > 3. Creating index files for the source files that store line offsets and
    > using them with the slurp method in the paragraph above


    If you add the runtime of the matching program itself to the runtime of the
    index generating program, is this still faster than 1 or 2?

    > 4. Creating an in-memory index for each file that contains a match and
    > using it for subsequent matches in that file
    >
    > 1, 2 and 4 above suffer performance degradation relative to unix grep. #3
    > provides good performance and is the method I am currently using but it
    > requires creating and maintaining index files. I was wondering if I could
    > tie a scalar to a file and use the slurping loop above. Then perhaps $NR
    > and $. would contain the current line number as the file would be read as
    > the loop is traversed. Any other ideas would be welcome


    If the question is performance, tie'ing is almost never the answer.

    Unix grep is very good at what it does. It will be very, very hard
    to beat it at its own game without resorting to xs-type stuff. I wouldn't
    even bother trying.

    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    Usenet Newsgroup Service $9.95/Month 30GB
     
    , Jul 9, 2004
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. dpackwood
    Replies:
    3
    Views:
    1,829
  2. Al Belden

    perl vs Unix grep

    Al Belden, Jul 3, 2004, in forum: Perl
    Replies:
    1
    Views:
    5,225
    Giridhar Nandigam
    Jul 7, 2004
  3. Spendius
    Replies:
    2
    Views:
    3,018
    Rogan Dawes
    Dec 13, 2004
  4. nospam
    Replies:
    5
    Views:
    14,781
    winey
    Jul 3, 2013
  5. Robert Wallace

    my own perl "dos->unix"/"unix->dos"

    Robert Wallace, Jan 21, 2004, in forum: Perl Misc
    Replies:
    7
    Views:
    293
    Michele Dondi
    Jan 22, 2004
Loading...

Share This Page