Walking a tree and extracting info... Problems

J

jim.goodman

I am new to the perl thing and i am trying to extract some date from
some web pages and am having problems.... can someone please tell me
what i am doing wrong... i think i have become a charter member of the
"idiots 'r' us" club... :eek:)!

this is my script... pretty simple so far, i am just trying to get one
piece of info working to start. i can traverse the directory and print
the filenames, but it only seems to get the data and do the pattern
matching from the first file in the directory....

any hints would be appreciated!

#!/usr/bin/perl
$dir="/Users/test/";
opendir(DIRECTORY, $dir) || die("Cannot open directory");
@thefiles= readdir(DIRECTORY);
closedir(DIRECTORY);

foreach $file (@thefiles) {
unless ( ($file eq ".") || ($file eq "..") || ($file eq ".DS_Store")
) {
open FILE, "$dir/$file" or die "Can't open $file : $!";
while( <FILE> ) {
s/\t//; # ignore tabs by erasing them
next if /^(\s)*$/; # skip blank lines
chomp; # remove trailing newline characters
push @lines, $_; # push the data line onto the array
}
close FILE;
$string = "@lines";
$n++;
print "$n:$file:";
$string =~ /<span class=searchtitle><B> (.*?)<\/B><\/span><BR>/is;
print "$1\n"; # print html page title
}
}
 
H

Henry Law

I am new to the perl thing and i am trying to extract some date from
some web pages and am having problems.... can someone please tell me
what i am doing wrong... i think i have become a charter member of the
"idiots 'r' us" club... :eek:)!

Nope; Perl is IMO harder to learn than some other languages. You're not
helping yourself enough though. I'll get to your problem in a moment,
but first some things you should do (a) to help you find your problems
before posting here, and (b) to get better and quicker help here.

1. Always code "use strict;" and "use warnings"; had you done so you
might have picked up the logic problem in your code, but it will
certainly ensure that you pick up many others.
2. Code not only a test program (well done for doing that) but also
some suitable data. I had to make some in order to do the testing.
3. Learn to use the Perl debugger (perl -d yourprog.pl) and to use the
breakpoint and examine commands. Doing that I found your problem in
one pass through the program.
this is my script... pretty simple so far, i am just trying to get one
piece of info working to start. i can traverse the directory and print
the filenames, but it only seems to get the data and do the pattern
matching from the first file in the directory....

What you mean is that once it has found a file with a match it then
finds that match in all subsequent files even if they themselves don't
have it. I recommend you try to be very precise about your problem.
Actually, showing your incorrect output is very precise and saves extra
thought on your part!
#!/usr/bin/perl
$dir="/Users/test/";

If you code "use strict" you'll need to put "my $dir", and the same
elsewhere in the file.
opendir(DIRECTORY, $dir) || die("Cannot open directory");
@thefiles= readdir(DIRECTORY);

This is OK as far as it goes but assumes you have enough memory to read
in the whole directory. Better practice is to read the directory line
by line, as you've (partly) done with the file.
closedir(DIRECTORY);

foreach $file (@thefiles) {
unless ( ($file eq ".") || ($file eq "..") || ($file eq ".DS_Store")
) {

A regex could do this (untested)

unless ( $file =~ /^\.{1,2}$|^\.DS_Store$/ ) {

.... but if you could just reject all "dot" files it would be even easier

unless ( $file =~ /^\./ )
open FILE, "$dir/$file" or die "Can't open $file : $!";

Well done for checking the file open result. Lots of beginners don't.
while( <FILE> ) {
s/\t//; # ignore tabs by erasing them
next if /^(\s)*$/; # skip blank lines
chomp; # remove trailing newline characters
push @lines, $_; # push the data line onto the array

Again, you're assuming that you always have enough memory for the whole
file.

Your problem is here. Because you didn't code "use strict" you aren't
forcing yourself to take control of the scope of your variables. Perl
has allocated "@lines" once for the whole program; when you process the
next file in the directory you push the lines on the bottom; the match
for the HTML title then fires every time. If you'd coded "my @lines"
just before the "while (<FILE)" line then you'd have got a new "@lines"
each time and your program would have worked as you wanted it to.
}
close FILE;
$string = "@lines";

This is ugly, and produces a slap on the wrist from Perl when you code
"use strict; use warnings". Not that it doesn't give you what you want,
though ... it's up to you as to whether you want to write with good style.

When "strict" forces you to code "my $n" then you'll have to put it
outside the directory-read loop.
print "$n:$file:";
$string =~ /<span class=searchtitle><B> (.*?)<\/B><\/span><BR>/is;
print "$1\n"; # print html page title

Always check the extracted text. When I fixed your program so it only
examined the text of the current file I got errors from this statement
every time it failed to find a match.

Here's a minimally-fixed version of your program which "works", in the
sense that it finds the HTML titles. It still needs quite a lot of
cleaning up and more Perlish idiom.

#!/usr/bin/perl
# Jim Goodman's problem April 9

use strict; use warnings; # I added this

#$dir="/Users/test/";
my $dir="F:/scratch"; # My directory instead of his

opendir(DIRECTORY, $dir) || die("Cannot open directory");
my @thefiles= readdir(DIRECTORY);
closedir(DIRECTORY);

my $n;
foreach my $file (@thefiles) {
unless ( ($file eq ".") || ($file eq "..") || ($file eq ".DS_Store")
) {
open FILE, "$dir/$file" or die "Can't open $file : $!";
my @lines = ();
while( <FILE> ) {
s/\t//; # ignore tabs by erasing them
next if /^(\s)*$/; # skip blank lines
chomp; # remove trailing newline characters
push @lines, $_; # push the data line onto the array
}
close FILE;
my $string = "@lines";
$n++;
print "$n:$file:";
$string =~ /<span class=searchtitle><B> (.*?)<\/B><\/span><BR>/is;
print "$1\n" if $1; # print html page title
}
}

But I think I'd feel inclined use "grep" to find the files that had the
relevant string in them, and pipe the output into a much smaller Perl
program to find the HTML titles and print them out. You'd lose the
incrementing count of the files, though.
 
J

jim.goodman

thanks a million.... i want you to know that although the wanted result
was a bit different that what you suggested, your suggestions still
solved my problem. You should also know that i have taken your
suggestions into account and have cleaned up my code, and next time i
will include a sample input file and the output... i wanted to attach
it all and had prepared a nice little archive but... :eek:).

again, thanks a million on resolving what was such a simple issue, i
just not catching it :eek:).... and if you think i should be learning
something other than perl, please speak up....!
 
H

Henry Law

again, thanks a million on resolving what was such a simple issue, i
just not catching it :eek:).... and if you think i should be learning
something other than perl, please speak up....!

Absolutely not; once you're familiar with it Perl is easy and powerful.
I'd just that for some reason (that I can't explain) it seems to me to
be harder to move from "writing random Perl code" to "writing good,
neat, compact-yet-understandable Perl" than it is to make the same
transition for other languages. Keep posting here - in a way that
helps you and helps us - and you'll get the hang of it.
 
H

Henry Law

suggestions into account and have cleaned up my code, and next time i

By the way, walking a directory tree is _exactly_ what the File::Find
module does, and for many applications it's better. Have a look at it.
 
T

Tad McClellan

I am new to the perl thing


You should have a look at the Posting Guidelines that are
posted here frequently (even though you have composed a
very good first post).

and i am trying to extract some date from
some web pages and am having problems.... can someone please tell me
what i am doing wrong...


Putting the lines from the 1st file into @lines, then tacking
on the lines from the 2nd file, then the 3rd ...

i can traverse the directory and print
the filenames, but it only seems to get the data and do the pattern
matching from the first file in the directory....

any hints would be appreciated!

#!/usr/bin/perl

#!/usr/bin/perl
use warnings;
use strict;

$dir="/Users/test/";

my $dir = '/Users/test/';
foreach $file (@thefiles) {


Since you want a new @lines array for every iteration of this loop, and
since you will now be using "strict" forevermore <g>, put a declaration
here so that you will get a new @lines array each time through
the foreach loop:

my @lines;

unless ( ($file eq ".") || ($file eq "..") || ($file eq ".DS_Store")


If you do it this way instead:

next if ($file eq ".") || ($file eq "..") || ($file eq ".DS_Store");

then you can save a level of indent.

while( <FILE> ) {
s/\t//; # ignore tabs by erasing them
next if /^(\s)*$/; # skip blank lines
chomp; # remove trailing newline characters
push @lines, $_; # push the data line onto the array
}


You eventually push() all of the lines from all of the files into @lines.

(the matching line from file 1 is in there every time.)

$string = "@lines";


This adds space characters between each line. Is that want you wanted?

$string =~ /<span class=searchtitle><B> (.*?)<\/B><\/span><BR>/is;
print "$1\n"; # print html page title


You should never use the dollar-digit variables unless you have
first tested to ensure that the match _succeeded_


if ( $string =~ /<span class=searchtitle><B> (.*?)<\/B><\/span><BR>/is ){
print "$1\n"; # print html page title
}
$string =~ /<span class=searchtitle><B> (.*?)<\/B><\/span><BR>/is;
^
^
^
Is there really a space there in the string you are matching against?
 
T

Tad McClellan

to get better and quicker help here.

1. Always code "use strict;" and "use warnings"; had you done so you
might have picked up the logic problem in your code, but it will
certainly ensure that you pick up many others.
2. Code not only a test program (well done for doing that) but also
some suitable data. I had to make some in order to do the testing.


Amen brother!

3. Learn to use the Perl debugger (perl -d yourprog.pl) and to use the
breakpoint and examine commands. Doing that I found your problem in
one pass through the program.


I've needed to use the Perl debugger about a dozen times in
over 10 years of daily Perl coding.

Carefully placed print() statements usually do it for me (warn()
statements actually, because STDERR is not buffered).

I'd not spend a lot of my limited time on the debugger for a while.

A regex could do this (untested)

unless ( $file =~ /^\.{1,2}$|^\.DS_Store$/ ) {


The code went from being easy to figure out, to requiring a bit
of analysis.

I would never use your regex alternative in a case like this.

Well done for checking the file open result. Lots of beginners don't.


And even more well done for remembering to glue the directory
part back onto the filename from readdir().
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top