Efficiently searching multiple files

Jorge · May 20, 2010

I have a Linux directory full of shell scripts (~100 scripts with
average
size of ~60 lines) with many of them calling other scripts as
depencencies.

Not being the author of the scripts and with the immediate need to
become familiar with
the overall scope of this script directory, I needed to create a cross-
reference chart
that shows what script depends on what other script.

My Perl script (below) works in terms of doing what I need, however,
IMO, it appears to run a bit slow. I timed it at ~3.5 sec. Possibly
that is it's max performance but something tells me (from other Perl
script experiences) this one just isn't up to speed.

I would appreciate any ideas or comments.

Thanks, Jorge

#!/usr/bin/perl -l

use strict;
use warnings;

&scripts_dir('/some/dir/that/holds/scripts');

sub scripts_dir {

# set some vars
my $dir = shift;
my(@paths, $paths, @scripts, $scripts, $string);
my($lines, $leaf, $header, $header2);
local($_);

# check dir can be opened for read
unless (opendir(DIR, $dir)) {
die "can't open $dir $!\n";
closedir(DIR);
return;
}

#
# read dir, skip over system files and bak files
# build array of script names
# build array of full-path script names
#
foreach (readdir(DIR)) {
next if $_ eq '.' || $_ eq '..' || $_ =~ /\.bak$/;
push(@scripts, $_);
$paths = $dir."/".$_;
push(@paths, $paths);
}
closedir(DIR);

# open ouput file, format and print headers
open OUT, ">output" or die "cannot open output for writing $! ...
\n";
$header = sprintf("%-25s%-25s%s", "SCRIPT NAME", "FOUND IN",
"USAGE");
$header2 = sprintf("%-25s%-25s%s", "===========", "========",
"=====");
print OUT $header;
print OUT $header2;

# loop through each script name
foreach $scripts(@scripts){

# loop through each script in directory
foreach my $paths(@paths){

# get last leaf of script being searched --
# if it matches itself; skip
$leaf = get_leaf($paths);
if($scripts eq $leaf) { next;}

# open each script for searching
open F, "$paths" or die "cannot open $paths for reading
$! ...\n";

while(my $lines = <F>) {

# -l switch in place
chomp($lines);

# search for matches to the commonly-used command
syntax
if($lines =~ /\$\($scripts / || $lines =~ /\`
$scripts /){

# format to line up with headers
$string = sprintf("%-25s%-25s%s", $scripts, $leaf,
$lines);

# print to file
print OUT $string;
}
}
}
}
# close I/O streams
close(F);
close(OUT);
}

#======================
# subroutine get_leaf
#======================
sub get_leaf
{
# get arg(s)
my($pathname) = @_;

# split on leafs
my @split_pathname = split( /[\\\/]/, $pathname);

# grab last leaf
my $leaf = pop( @split_pathname );

# return bare script name (leaf)
return( $leaf );
}

__END__

Uri Guttman · May 20, 2010

J> #!/usr/bin/perl -l

J> use strict;
J> use warnings;

good

J> &scripts_dir('/some/dir/that/holds/scripts');

don't call subs with & as it isn't needed with () and it is perl4
style. there are other subtle issues that can bite you.

J> sub scripts_dir {

J> # set some vars
J> my $dir = shift;
J> my(@paths, $paths, @scripts, $scripts, $string);
J> my($lines, $leaf, $header, $header2);

don't declare vars before you need them. in many cases you can declare
then when first used.

J> local($_);

why? actually i recommend against using $_ as much as possible (it does
have its places) so you can used named vars which are easier to read and
make the code better

J> # check dir can be opened for read
J> unless (opendir(DIR, $dir)) {

use a lexical handle

unless (opendir(my $dirh, $dir)) {

J> die "can't open $dir $!\n";
J> closedir(DIR);

why close the handle if it never opened?

J> return;

you can just exit here instead.

J> }

better yet, use File::Slurp's read_dir.

J> #
J> # read dir, skip over system files and bak files
J> # build array of script names
J> # build array of full-path script names
J> #
J> foreach (readdir(DIR)) {
J> next if $_ eq '.' || $_ eq '..' || $_ =~ /\.bak$/;

read_dir skips . and .. for you.

J> push(@scripts, $_);
J> $paths = $dir."/".$_;
J> push(@paths, $paths);
J> }
J> closedir(DIR);

that can all be done so much simpler like this (untested):

my @paths = map "$dir/$_", grep !/\.bak$/, read_dir $dir ;

J> # open ouput file, format and print headers
J> open OUT, ">output" or die "cannot open output for writing $!
J> ...

ditto for lexical handles

J> \n";
J> $header = sprintf("%-25s%-25s%s", "SCRIPT NAME", "FOUND IN",
J> "USAGE");
J> $header2 = sprintf("%-25s%-25s%s", "===========", "========",
J> "=====");

you use the same sprintf format three times. put that into a variable
above so you can change it in one place.

J> print OUT $header;
J> print OUT $header2;

no need for two print calls.

print OUT $header, $header2 ;

J> # loop through each script name
J> foreach $scripts(@scripts){

foreach my $scripts (@scripts) {

that is how you can declare vars locally. and use more horizontal
whitespace. others need to read your code so think about them when you
write it. and YOU are an other too!

i noticed this below but the point is true here. don't use plural names
for singular things. $scripts has a single script name

J> # loop through each script in directory
J> foreach my $paths(@paths){

you used my there. why not in the previous loop?

J> # get last leaf of script being searched --
J> # if it matches itself; skip
J> $leaf = get_leaf($paths);
J> if($scripts eq $leaf) { next;}

slightly faster as you don't need a block entry on next. also a better
style for simple flow control like this.

next if $scripts eq $leaf ;

J> # open each script for searching
J> open F, "$paths" or die "cannot open $paths for reading

don't quote scalar vars as it can lead to subtle bugs. it is not needed here.

J> $! ...\n";

J> while(my $lines = <F>) {

$lines is a single line so make that singular. using a plural name
implies an array or array ref

J> # -l switch in place

that only affects one liners using -p or -n. no effect on regular code

J> chomp($lines);

J> # search for matches to the commonly-used command
J> syntax
J> if($lines =~ /\$\($scripts / || $lines =~ /\`
J> $scripts /){

this doesn't make sense. are you checking if the current script refers
to itself??

J> # format to line up with headers
J> $string = sprintf("%-25s%-25s%s", $scripts, $leaf,
J> $lines);

J> # print to file
J> print OUT $string;

since you just print the string, you can use printf directly.

J> }
J> }
J> }
J> }
J> # close I/O streams
J> close(F);
J> close(OUT);
J> }

J> #======================
J> # subroutine get_leaf
J> #======================
J> sub get_leaf
J> {
J> # get arg(s)
J> my($pathname) = @_;

J> # split on leafs
J> my @split_pathname = split( /[\\\/]/, $pathname);

gack!!

File::Basename does this for you and simpler.

J> # grab last leaf
J> my $leaf = pop( @split_pathname );

you could just return the popped value directly. hell, you can do that
in the split line too:

return( (split( /[\\\/]/, $pathname)[-1] );

as for speedups, i can't help much since i don't have your input
data. nothing seems oddly slow looking but your core loops are deep and
can be done better. slurping in the entire file (File::Slurp) and
scanning for those lines in one regex would be noticeably faster. and
your core logic is suspect as it doesn't seem to check for calling other
scripts from a given one.

uri

J. Gleixner · May 20, 2010

Jorge said:
I have a Linux directory full of shell scripts (~100 scripts with
average
size of ~60 lines) with many of them calling other scripts as
depencencies.

Not being the author of the scripts and with the immediate need to
become familiar with
the overall scope of this script directory, I needed to create a cross-
reference chart
that shows what script depends on what other script.

My Perl script (below) works in terms of doing what I need, however,
IMO, it appears to run a bit slow. I timed it at ~3.5 sec. Possibly
that is it's max performance but something tells me (from other Perl
script experiences) this one just isn't up to speed.

3 seconds isn't fast enough??.. just write it, run it, and move on..
sheesh..

I would appreciate any ideas or comments.

Thanks, Jorge

#!/usr/bin/perl -l

use strict;
use warnings;

&scripts_dir('/some/dir/that/holds/scripts'); Remove the '&' there.

sub scripts_dir {

# set some vars
my $dir = shift;
my(@paths, $paths, @scripts, $scripts, $string);
my($lines, $leaf, $header, $header2);
local($_);

Declare your vars in the smallest possible scope.

# check dir can be opened for read
unless (opendir(DIR, $dir)) {
die "can't open $dir $!\n";

After die is called, nothing after this point is called..

closedir(DIR);
return;
}

#
# read dir, skip over system files and bak files
# build array of script names
# build array of full-path script names
#
foreach (readdir(DIR)) {
next if $_ eq '.' || $_ eq '..' || $_ =~ /\.bak$/;
push(@scripts, $_);
$paths = $dir."/".$_;
push(@paths, $paths);
}
closedir(DIR);

# open ouput file, format and print headers
open OUT, ">output" or die "cannot open output for writing $! ...
\n";

use 3 argument open.

open( my $out, '>', 'output' ) || die "can't open output: $!";

$header = sprintf("%-25s%-25s%s", "SCRIPT NAME", "FOUND IN",
"USAGE");
$header2 = sprintf("%-25s%-25s%s", "===========", "========",
"=====");
print OUT $header;
print OUT $header2;

Just use printf.

printf OUT "%-25s%-25s%s\n", "SCRIPT NAME", "FOUND IN", "USAGE";

or printf $out...

# loop through each script name
foreach $scripts(@scripts){ for my $script( @scripts )

# loop through each script in directory
foreach my $paths(@paths){ $path seems better than $paths

# get last leaf of script being searched --
# if it matches itself; skip
$leaf = get_leaf($paths);

my $leaf = ( split( /\//, $paths ) )[-1];

if($scripts eq $leaf) { next;} next if $scripts eq $leaf;

# open each script for searching
open F, "$paths" or die "cannot open $paths for reading
$! ...\n";

open( my $file, '<', $path ) || die "can't open $path: $!;

while(my $lines = <F>) { my $line instead of my $lines.

# -l switch in place
chomp($lines);

# search for matches to the commonly-used command
syntax
if($lines =~ /\$\($scripts / || $lines =~ /\`
$scripts /){

# format to line up with headers
$string = sprintf("%-25s%-25s%s", $scripts, $leaf,
$lines);

# print to file
print OUT $string; again printf
}
}

Should close F or $file here.

}
}
# close I/O streams
close(F);

^^^ should be just after the end of the while.. above.

close(OUT);
}

#======================
# subroutine get_leaf
#======================
sub get_leaf
{
# get arg(s)
my($pathname) = @_;

# split on leafs
my @split_pathname = split( /[\\\/]/, $pathname);

# grab last leaf
my $leaf = pop( @split_pathname );

# return bare script name (leaf)
return( $leaf );
}

__END__

Jorge · May 20, 2010

I wasn't expecting a full critique but I'll take it -- learning is
good so thanks for your effort and time.

Most of your points are obvious (once read) and some I will have to
digest.

Thanks again

J> #!/usr/bin/perl -l

J> use strict;
J> use warnings;

good

J> &scripts_dir('/some/dir/that/holds/scripts');

don't call subs with & as it isn't needed with () and it is perl4
style. there are other subtle issues that can bite you.

J> sub scripts_dir {

J> # set some vars
J> my $dir = shift;
J> my(@paths, $paths, @scripts, $scripts, $string);
J> my($lines, $leaf, $header, $header2);

don't declare vars before you need them. in many cases you can declare
then when first used.

J> local($_);

why? actually i recommend against using $_ as much as possible (it does
have its places) so you can used named vars which are easier to read and
make the code better

J> # check dir can be opened for read
J> unless (opendir(DIR, $dir)) {

use a lexical handle

unless (opendir(my $dirh, $dir)) {

J> die "can't open $dir $!\n";
J> closedir(DIR);

why close the handle if it never opened?

J> return;

you can just exit here instead.

J> }

better yet, use File::Slurp's read_dir.

J> #
J> # read dir, skip over system files and bak files
J> # build array of script names
J> # build array of full-path script names
J> #
J> foreach (readdir(DIR)) {
J> next if $_ eq '.' || $_ eq '..' || $_ =~ /\.bak$/;

read_dir skips . and .. for you.

J> push(@scripts, $_);
J> $paths = $dir."/".$_;
J> push(@paths, $paths);
J> }
J> closedir(DIR);

that can all be done so much simpler like this (untested):

my @paths = map "$dir/$_", grep !/\.bak$/, read_dir $dir ;

J> # open ouput file, format and print headers
J> open OUT, ">output" or die "cannot open output for writing$!
J> ...

ditto for lexical handles

J> \n";
J> $header = sprintf("%-25s%-25s%s", "SCRIPT NAME", "FOUND IN",
J> "USAGE");
J> $header2 = sprintf("%-25s%-25s%s", "===========", "========",
J> "=====");

you use the same sprintf format three times. put that into a variable
above so you can change it in one place.

J> print OUT $header;
J> print OUT $header2;

no need for two print calls.

print OUT $header, $header2 ;

J> # loop through each script name
J> foreach $scripts(@scripts){

foreach my $scripts (@scripts) {

that is how you can declare vars locally. and use more horizontal
whitespace. others need to read your code so think about them when you
write it. and YOU are an other too!

i noticed this below but the point is true here. don't use plural names
for singular things. $scripts has a single script name

J> # loop through each script in directory
J> foreach my $paths(@paths){

you used my there. why not in the previous loop?

J> # get last leaf of script being searched --
J> # if it matches itself; skip
J> $leaf = get_leaf($paths);
J> if($scripts eq $leaf) { next;}

slightly faster as you don't need a block entry on next. also a better
style for simple flow control like this.

next if $scripts eq $leaf ;

J> # open each script for searching
J> open F, "$paths" or die "cannot open $paths for reading

don't quote scalar vars as it can lead to subtle bugs. it is not needed here.

J> $! ...\n";

J> while(my $lines = <F>) {

$lines is a single line so make that singular. using a plural name
implies an array or array ref

J> # -l switch in place

that only affects one liners using -p or -n. no effect on regular code

J> chomp($lines);

J> # search for matches to the commonly-used command
J> syntax
J> if($lines =~ /\$\($scripts / || $lines =~ /\`
J> $scripts /){

this doesn't make sense. are you checking if the current script refers
to itself??

J> # format to line up with headers
J> $string = sprintf("%-25s%-25s%s", $scripts, $leaf,
J> $lines);

J> # print to file
J> print OUT $string;

since you just print the string, you can use printf directly.

J> }
J> }
J> }
J> }
J> # close I/O streams
J> close(F);
J> close(OUT);
J> }

J> #======================
J> # subroutine get_leaf
J> #======================
J> sub get_leaf
J> {
J> # get arg(s)
J> my($pathname) = @_;

J> # split on leafs
J> my @split_pathname = split( /[\\\/]/, $pathname);

gack!!

File::Basename does this for you and simpler.

J> # grab last leaf
J> my $leaf = pop( @split_pathname );

you could just return the popped value directly. hell, you can do that
in the split line too:

return( (split( /[\\\/]/, $pathname)[-1] );

as for speedups, i can't help much since i don't have your input
data. nothing seems oddly slow looking but your core loops are deep and
can be done better. slurping in the entire file (File::Slurp) and
scanning for those lines in one regex would be noticeably faster. and
your core logic is suspect as it doesn't seem to check for calling other
scripts from a given one.

uri

Jim Gibson · May 20, 2010

Jorge said:
I have a Linux directory full of shell scripts (~100 scripts with
average
size of ~60 lines) with many of them calling other scripts as
depencencies.

Not being the author of the scripts and with the immediate need to
become familiar with
the overall scope of this script directory, I needed to create a cross-
reference chart
that shows what script depends on what other script.

My Perl script (below) works in terms of doing what I need, however,
IMO, it appears to run a bit slow. I timed it at ~3.5 sec. Possibly
that is it's max performance but something tells me (from other Perl
script experiences) this one just isn't up to speed.

I would appreciate any ideas or comments.

Thanks, Jorge

Uri and J. have given you thorough critiques of your posted program. I
can only add a few additional suggestions.

Concerning speedup, you have implemented an O(N**2) algorithm. You are
reading each of N files N-1 times, looking for a different script name
each time you read each file. You should read each file only once, and
look for all of your script names in each line. You shouldn't exclude
the script your are reading from this search, since scripts can call
themselves. I would also not look in comments, since it is very common
to put script names in comments, and you don't want those.

For how to search for many things at once, see the advice in 'perldoc
-q many' "How do I efficiently match many regular expressions at once?"

You are concatenating the directory name and file name into a path,
then taking many steps to strip the directory name from the path to
retrieve the file name. Don't bother, since you are only working in one
directory and the $dir variable only contains one value. Add $dir to
the open statement or a temp variable:

my $filename = "$dir/$scripts";
open( my $f, '<', $filename ) ...

Then you don't need the @paths array at all.

[program snipped]

Jorge · May 20, 2010

Yeah Uri and J gave the program a good going-over so I have plenty to
look at there and now I can add your pointers to the list.

You mention one very good point that should have been obvious yet it
completely got past me ... that beinging I am searching every file for
each single pattern where I should be looking at each file only once
for all patterns.

Ihanks again, Jorge

Jorge said:
Jorge said:

I have a Linux directory full of shell scripts (~100 scripts with
average
size of ~60 lines) with many of them calling other scripts as
depencencies.

Click to expand...

Not being the author of the scripts and with the immediate need to
become familiar with
the overall scope of this script directory, I needed to create a cross-
reference chart
that shows what script depends on what other script.

Click to expand...

My Perl script (below) works in terms of doing what I need, however,
IMO, it appears to run a bit slow. I timed it at ~3.5 sec. Possibly
that is it's max performance but something tells me (from other Perl
script experiences) this one just isn't up to speed.

Click to expand...

I would appreciate any ideas or comments.

Click to expand...

Thanks, Jorge

Click to expand...

Uri and J. have given you thorough critiques of your posted program. I
can only add a few additional suggestions.

Concerning speedup, you have implemented an O(N**2) algorithm. You are
reading each of N files N-1 times, looking for a different script name
each time you read each file. You should read each file only once, and
look for all of your script names in each line. You shouldn't exclude
the script your are reading from this search, since scripts can call
themselves. I would also not look in comments, since it is very common
to put script names in comments, and you don't want those.

For how to search for many things at once, see the advice in 'perldoc
-q many' "How do I efficiently match many regular expressions at once?"

You are concatenating the directory name and file name into a path,
then taking many steps to strip the directory name from the path to
retrieve the file name. Don't bother, since you are only working in one
directory and the $dir variable only contains one value. Add $dir to
the open statement or a temp variable:

my $filename = "$dir/$scripts";
open( my $f, '<', $filename ) ...

Then you don't need the @paths array at all.

[program snipped]

Jorge · May 21, 2010

Jorge said:
Jorge said:

Yeah Uri and J gave the program a good going-over so I have plenty to
look at there and now I can add your pointers to the list.

Click to expand...

[ snip TOFU. Please stop doing that! ]

TOFU is "Top Over Fullquote Under" also known as "top posting".

You were asked nicely (twice!) to stop doing that years ago,
yet you continue.

I hereby banish you to perpetual invisibility!

*plonk*

--
Tad McClellan
email: perl -le "print scalar reverse qq/moc.liamg\100cm.j.dat/"
The above message is a Usenet post.
I don't recall having given anyone permission to use it on a Web site.

Yes I was warned (and IMO not in a nice way) but it has been 3 years
since I posted and simply had forgotten about that incident. Sometimes
sh*t happens.

I will make every effort not to do it again.

Jorge · May 21, 2010

This is a run-once script. Any time spent optimising
it (e.g. 10 minutes) is a net loss to you of ten minutes,
bar what ever time your optimisation saves from 3.5 seconds.

10 minutes then.

BugBear- Hide quoted text -

- Show quoted text -

Point taken. I probably should have pointed out that the focus of my
interest in the program execution speed is not for just this one-time
script. Certainly you (and others) are spot on that 3 seconds is of
little consequence. However, for future use, I.E., when the day comes
that I have to deal with data 100x the size of this data I want to be
using the most efficient method(s).

Thanks, Jorge

Jorge · May 21, 2010

Jorge said:
[ Thank you for not top-posting ]

Jorge said:

Yeah Uri and J gave the program a good going-over so I have plenty to
look at there and now I can add your pointers to the list.
[ snip TOFU. Please stop doing that! ]
TOFU is "Top Over Fullquote Under" also known as "top posting".
You were asked nicely (twice!) to stop doing that years ago,
yet you continue.
I hereby banish you to perpetual invisibility!
*plonk*
--
Tad McClellan
email: perl -le "print scalar reverse qq/moc.liamg\100cm.j.dat/"
The above message is a Usenet post.
I don't recall having given anyone permission to use it on a Web site.

Click to expand...

Click to expand...

It is bad manners to quote .sigs too...

You really should learn how to post to Usenet if you plan
to post to Usenet...

http://web.presby.edu/~nnqadmin/nnq/nquote.html

Have you seen the Posting Guidelines that are posted here frequently?

Their purpose is to help avoid this sort of thing...

Yes I was warned (and IMO not in a nice way)

Click to expand...

Here's what I said:

[ snip TOFU. Please stop doing that! ]

[ snip yet more TOFU. Please stop doing that before it is too late! ]

What was not nice about it?

I was trying to save you from getting killfiled.

I failed.

but it has been 3 years
since I posted and simply had forgotten about that incident.

Click to expand...

Not top-posting is standard netiquette, the incident would never
have happened if you just deigned to post in the accepted manner
from the get-go.

Sometimes
sh*t happens.

Click to expand...

Sometimes people are banished to eternal invisibility.

--
Tad McClellan
email: perl -le "print scalar reverse qq/moc.liamg\100cm.j.dat/"
The above message is a Usenet post.
I don't recall having given anyone permission to use it on a Web site.- Hide quoted text -

- Show quoted text -

I'm not so thinned-skinned that I can't take criticism. On the other
hand, the one comment years ago "before it's too late", came across as
a threat and an ambiguous one at that. If it was intended to
communicate the fact that I had violated the TOS (more than once) and
I was in jeopardy of being tombstoned ... then it should have been
worded as such and not left as an open-ended statement that left me
guessing what "or else" means.

Fast forward to today ... I have pled guilty and right here and now,
apologize to the forum for the disruption and promise to not top-post
again. There's not much more I can say or do.

Before banishing me to eternal invisibility, I should at least have
the chance to face the wheel.

sln · May 21, 2010

Before banishing me to eternal invisibility, I should at least have
the chance to face the wheel.

The Pit, The Pendulem

John Bokma · May 22, 2010

Tad McClellan said:
Here's what I said:

[ snip TOFU. Please stop doing that! ]

[ snip yet more TOFU. Please stop doing that before it is too late! ]

What was not nice about it?

Maybe you should stop addressing people like they are 3 year old? I've
mentioned this before but you really DO NOT HAVE TO POST.

Ah, well, there was a time I was a bit like you. I have grown up since,
so there is hope.

Merge files	1	Aug 7, 2013
Perl Format the Output in table,by removing duplicate entries	0	Oct 11, 2012
Very Sluggish Code	4	Jun 11, 2012
Flushing and multiple pipes	2	Jul 20, 2012
Help with Hash	2	Oct 14, 2010
How to avoid searching this folder?	11	Mar 25, 2011
Searching all instances of a pattern across multi-lines	3	Dec 13, 2009
Toggle between hot filehandles question	3	Apr 11, 2007

Efficiently searching multiple files

Jorge

Uri Guttman

J. Gleixner

Jorge

Jim Gibson

Jorge

Jorge

Jorge

Jorge

sln

John Bokma

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads