Walking a tree and extracting info... Problems

Discussion in 'Perl Misc' started by jim.goodman@gmail.com, Apr 9, 2006.

  1. Guest

    I am new to the perl thing and i am trying to extract some date from
    some web pages and am having problems.... can someone please tell me
    what i am doing wrong... i think i have become a charter member of the
    "idiots 'r' us" club... :eek:)!

    this is my script... pretty simple so far, i am just trying to get one
    piece of info working to start. i can traverse the directory and print
    the filenames, but it only seems to get the data and do the pattern
    matching from the first file in the directory....

    any hints would be appreciated!

    #!/usr/bin/perl
    $dir="/Users/test/";
    opendir(DIRECTORY, $dir) || die("Cannot open directory");
    @thefiles= readdir(DIRECTORY);
    closedir(DIRECTORY);

    foreach $file (@thefiles) {
    unless ( ($file eq ".") || ($file eq "..") || ($file eq ".DS_Store")
    ) {
    open FILE, "$dir/$file" or die "Can't open $file : $!";
    while( <FILE> ) {
    s/\t//; # ignore tabs by erasing them
    next if /^(\s)*$/; # skip blank lines
    chomp; # remove trailing newline characters
    push @lines, $_; # push the data line onto the array
    }
    close FILE;
    $string = "@lines";
    $n++;
    print "$n:$file:";
    $string =~ /<span class=searchtitle><B> (.*?)<\/B><\/span><BR>/is;
    print "$1\n"; # print html page title
    }
    }
    , Apr 9, 2006
    #1
    1. Advertising

  2. Henry Law Guest

    wrote:
    > I am new to the perl thing and i am trying to extract some date from
    > some web pages and am having problems.... can someone please tell me
    > what i am doing wrong... i think i have become a charter member of the
    > "idiots 'r' us" club... :eek:)!


    Nope; Perl is IMO harder to learn than some other languages. You're not
    helping yourself enough though. I'll get to your problem in a moment,
    but first some things you should do (a) to help you find your problems
    before posting here, and (b) to get better and quicker help here.

    1. Always code "use strict;" and "use warnings"; had you done so you
    might have picked up the logic problem in your code, but it will
    certainly ensure that you pick up many others.
    2. Code not only a test program (well done for doing that) but also
    some suitable data. I had to make some in order to do the testing.
    3. Learn to use the Perl debugger (perl -d yourprog.pl) and to use the
    breakpoint and examine commands. Doing that I found your problem in
    one pass through the program.

    > this is my script... pretty simple so far, i am just trying to get one
    > piece of info working to start. i can traverse the directory and print
    > the filenames, but it only seems to get the data and do the pattern
    > matching from the first file in the directory....


    What you mean is that once it has found a file with a match it then
    finds that match in all subsequent files even if they themselves don't
    have it. I recommend you try to be very precise about your problem.
    Actually, showing your incorrect output is very precise and saves extra
    thought on your part!

    > #!/usr/bin/perl
    > $dir="/Users/test/";


    If you code "use strict" you'll need to put "my $dir", and the same
    elsewhere in the file.

    > opendir(DIRECTORY, $dir) || die("Cannot open directory");
    > @thefiles= readdir(DIRECTORY);


    This is OK as far as it goes but assumes you have enough memory to read
    in the whole directory. Better practice is to read the directory line
    by line, as you've (partly) done with the file.

    > closedir(DIRECTORY);
    >
    > foreach $file (@thefiles) {
    > unless ( ($file eq ".") || ($file eq "..") || ($file eq ".DS_Store")
    > ) {


    A regex could do this (untested)

    unless ( $file =~ /^\.{1,2}$|^\.DS_Store$/ ) {

    .... but if you could just reject all "dot" files it would be even easier

    unless ( $file =~ /^\./ )

    > open FILE, "$dir/$file" or die "Can't open $file : $!";


    Well done for checking the file open result. Lots of beginners don't.

    > while( <FILE> ) {
    > s/\t//; # ignore tabs by erasing them
    > next if /^(\s)*$/; # skip blank lines
    > chomp; # remove trailing newline characters
    > push @lines, $_; # push the data line onto the array


    Again, you're assuming that you always have enough memory for the whole
    file.

    Your problem is here. Because you didn't code "use strict" you aren't
    forcing yourself to take control of the scope of your variables. Perl
    has allocated "@lines" once for the whole program; when you process the
    next file in the directory you push the lines on the bottom; the match
    for the HTML title then fires every time. If you'd coded "my @lines"
    just before the "while (<FILE)" line then you'd have got a new "@lines"
    each time and your program would have worked as you wanted it to.

    > }
    > close FILE;
    > $string = "@lines";


    This is ugly, and produces a slap on the wrist from Perl when you code
    "use strict; use warnings". Not that it doesn't give you what you want,
    though ... it's up to you as to whether you want to write with good style.

    > $n++;


    When "strict" forces you to code "my $n" then you'll have to put it
    outside the directory-read loop.

    > print "$n:$file:";
    > $string =~ /<span class=searchtitle><B> (.*?)<\/B><\/span><BR>/is;
    > print "$1\n"; # print html page title


    Always check the extracted text. When I fixed your program so it only
    examined the text of the current file I got errors from this statement
    every time it failed to find a match.

    Here's a minimally-fixed version of your program which "works", in the
    sense that it finds the HTML titles. It still needs quite a lot of
    cleaning up and more Perlish idiom.

    #!/usr/bin/perl
    # Jim Goodman's problem April 9

    use strict; use warnings; # I added this

    #$dir="/Users/test/";
    my $dir="F:/scratch"; # My directory instead of his

    opendir(DIRECTORY, $dir) || die("Cannot open directory");
    my @thefiles= readdir(DIRECTORY);
    closedir(DIRECTORY);

    my $n;
    foreach my $file (@thefiles) {
    unless ( ($file eq ".") || ($file eq "..") || ($file eq ".DS_Store")
    ) {
    open FILE, "$dir/$file" or die "Can't open $file : $!";
    my @lines = ();
    while( <FILE> ) {
    s/\t//; # ignore tabs by erasing them
    next if /^(\s)*$/; # skip blank lines
    chomp; # remove trailing newline characters
    push @lines, $_; # push the data line onto the array
    }
    close FILE;
    my $string = "@lines";
    $n++;
    print "$n:$file:";
    $string =~ /<span class=searchtitle><B> (.*?)<\/B><\/span><BR>/is;
    print "$1\n" if $1; # print html page title
    }
    }

    But I think I'd feel inclined use "grep" to find the files that had the
    relevant string in them, and pipe the output into a much smaller Perl
    program to find the HTML titles and print them out. You'd lose the
    incrementing count of the files, though.

    --

    Henry Law <>< Manchester, England
    Henry Law, Apr 9, 2006
    #2
    1. Advertising

  3. Guest

    thanks a million.... i want you to know that although the wanted result
    was a bit different that what you suggested, your suggestions still
    solved my problem. You should also know that i have taken your
    suggestions into account and have cleaned up my code, and next time i
    will include a sample input file and the output... i wanted to attach
    it all and had prepared a nice little archive but... :eek:).

    again, thanks a million on resolving what was such a simple issue, i
    just not catching it :eek:).... and if you think i should be learning
    something other than perl, please speak up....!
    , Apr 9, 2006
    #3
  4. Henry Law Guest

    wrote:

    > again, thanks a million on resolving what was such a simple issue, i
    > just not catching it :eek:).... and if you think i should be learning
    > something other than perl, please speak up....!


    Absolutely not; once you're familiar with it Perl is easy and powerful.
    I'd just that for some reason (that I can't explain) it seems to me to
    be harder to move from "writing random Perl code" to "writing good,
    neat, compact-yet-understandable Perl" than it is to make the same
    transition for other languages. Keep posting here - in a way that
    helps you and helps us - and you'll get the hang of it.

    --

    Henry Law <>< Manchester, England
    Henry Law, Apr 9, 2006
    #4
  5. Henry Law Guest

    wrote:
    > suggestions into account and have cleaned up my code, and next time i


    By the way, walking a directory tree is _exactly_ what the File::Find
    module does, and for many applications it's better. Have a look at it.

    --

    Henry Law <>< Manchester, England
    Henry Law, Apr 9, 2006
    #5
  6. <> wrote:

    > I am new to the perl thing



    You should have a look at the Posting Guidelines that are
    posted here frequently (even though you have composed a
    very good first post).


    > and i am trying to extract some date from
    > some web pages and am having problems.... can someone please tell me
    > what i am doing wrong...



    Putting the lines from the 1st file into @lines, then tacking
    on the lines from the 2nd file, then the 3rd ...


    > i can traverse the directory and print
    > the filenames, but it only seems to get the data and do the pattern
    > matching from the first file in the directory....
    >
    > any hints would be appreciated!
    >
    > #!/usr/bin/perl


    #!/usr/bin/perl
    use warnings;
    use strict;


    > $dir="/Users/test/";


    my $dir = '/Users/test/';

    > foreach $file (@thefiles) {



    Since you want a new @lines array for every iteration of this loop, and
    since you will now be using "strict" forevermore <g>, put a declaration
    here so that you will get a new @lines array each time through
    the foreach loop:

    my @lines;


    > unless ( ($file eq ".") || ($file eq "..") || ($file eq ".DS_Store")



    If you do it this way instead:

    next if ($file eq ".") || ($file eq "..") || ($file eq ".DS_Store");

    then you can save a level of indent.


    > while( <FILE> ) {
    > s/\t//; # ignore tabs by erasing them
    > next if /^(\s)*$/; # skip blank lines
    > chomp; # remove trailing newline characters
    > push @lines, $_; # push the data line onto the array
    > }



    You eventually push() all of the lines from all of the files into @lines.

    (the matching line from file 1 is in there every time.)


    > $string = "@lines";



    This adds space characters between each line. Is that want you wanted?


    > $string =~ /<span class=searchtitle><B> (.*?)<\/B><\/span><BR>/is;
    > print "$1\n"; # print html page title



    You should never use the dollar-digit variables unless you have
    first tested to ensure that the match _succeeded_


    if ( $string =~ /<span class=searchtitle><B> (.*?)<\/B><\/span><BR>/is ){
    print "$1\n"; # print html page title
    }

    > $string =~ /<span class=searchtitle><B> (.*?)<\/B><\/span><BR>/is;

    ^
    ^
    ^
    Is there really a space there in the string you are matching against?


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
    Tad McClellan, Apr 9, 2006
    #6
  7. Henry Law <> wrote:
    > wrote:
    >> I am new to the perl thing and i am trying to extract some date from
    >> some web pages and am having problems.... can someone please tell me
    >> what i am doing wrong... i think i have become a charter member of the
    >> "idiots 'r' us" club... :eek:)!


    > to get better and quicker help here.
    >
    > 1. Always code "use strict;" and "use warnings"; had you done so you
    > might have picked up the logic problem in your code, but it will
    > certainly ensure that you pick up many others.
    > 2. Code not only a test program (well done for doing that) but also
    > some suitable data. I had to make some in order to do the testing.



    Amen brother!


    > 3. Learn to use the Perl debugger (perl -d yourprog.pl) and to use the
    > breakpoint and examine commands. Doing that I found your problem in
    > one pass through the program.



    I've needed to use the Perl debugger about a dozen times in
    over 10 years of daily Perl coding.

    Carefully placed print() statements usually do it for me (warn()
    statements actually, because STDERR is not buffered).

    I'd not spend a lot of my limited time on the debugger for a while.


    >> foreach $file (@thefiles) {
    >> unless ( ($file eq ".") || ($file eq "..") || ($file eq ".DS_Store")
    >> ) {

    >
    > A regex could do this (untested)
    >
    > unless ( $file =~ /^\.{1,2}$|^\.DS_Store$/ ) {



    The code went from being easy to figure out, to requiring a bit
    of analysis.

    I would never use your regex alternative in a case like this.


    >> open FILE, "$dir/$file" or die "Can't open $file : $!";

    >
    > Well done for checking the file open result. Lots of beginners don't.



    And even more well done for remembering to glue the directory
    part back onto the filename from readdir().


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
    Tad McClellan, Apr 10, 2006
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Mikito Harakiri

    tree walking -- saved recursion state

    Mikito Harakiri, Jan 3, 2004, in forum: Java
    Replies:
    13
    Views:
    712
    Matt Humphrey
    Jan 5, 2004
  2. Stub

    B tree, B+ tree and B* tree

    Stub, Nov 12, 2003, in forum: C Programming
    Replies:
    3
    Views:
    10,110
  3. pembed2003

    walking a binary tree

    pembed2003, Apr 19, 2004, in forum: C Programming
    Replies:
    4
    Views:
    678
    Peter Slootweg
    Apr 20, 2004
  4. Thierry Lam

    Walking through directories and files

    Thierry Lam, Sep 16, 2005, in forum: Python
    Replies:
    1
    Views:
    285
    Fredrik Lundh
    Sep 16, 2005
  5. Replies:
    5
    Views:
    346
    Richard Herring
    Feb 8, 2006
Loading...

Share This Page