Q: Perl & LWP - HTML Processing with Regular Expressions

Discussion in 'Perl Misc' started by Voitec, Nov 9, 2003.

  1. Voitec

    Voitec Guest

    Hi,

    The following refers to this URL:
    http://www.homepriceguide.com.au/snapshot/price/index.cfm?action=view&suburbORpostcode=2040

    where the last 4 digits in the link is the rotating postcode.

    I'd like to get data from this site and form trendlines.

    Here's the code:
    ************
    #!/usr/bin/perl -w
    # Real estate price movement by suburb

    use strict;
    use LWP::Simple;
    my $Postcode;

    for ($Postcode = 2040; $Postcode < 2042; $Postcode++) {
    my $html = get("
    http://www.homepriceguide.com.au/snapshot/price/index.cfm?action=view&suburbORpostcode=$Postcode")
    or die "Couldn't fetch the Suburb page.";

    $html =~ m{<td align=\"center\" class=\"tbody\">(\$[\d,]+)</td>}g;
    my $House_Suburb_Avg = $1;
    my $House_Region_Avg = $1;
    my $House_Suburb_Median = $1;
    my $House_Region_Median = $1;

    $html =~ m{<td align=\"center\" class=\"tbody\">([+|-][\d]+%)</td>}g;
    my $House_Suburb_Median_Change = $1;
    my $House_Region_Median_Change = $1;

    $html =~ m{<td align=\"center\" class=\"tbody\">(\$[\d,]+)</td>}g;
    my $Unit_Suburb_Avg = $1;
    my $Unit_Region_Avg = $1;
    my $Unit_Suburb_Median = $1;
    my $Unit_Region_Median= $1;

    $html =~ m{<td align=\"center\" class=\"tbody\">([+|-][\d]+%)</td>}g;
    my $Unit_Suburb_Median_Change = $1;
    my $Unit_Region_Median_Change = $1;

    print "Here are 2002/2003 Prices for: $Postcode. \n";
    printf "Average House Price: $House_Suburb_Avg - $House_Region_Avg\n";
    printf "Median Price: $House_Suburb_Median - $House_Region_Median\n";
    printf "Median change over last 12 months: $House_Suburb_Median_Change -
    $House_Region_Median_Change\n";
    printf "Average Unit Price: $Unit_Suburb_Avg - $Unit_Region_Avg\n";
    printf "Median Price: $Unit_Suburb_Median - $Unit_Region_Median\n";
    printf "Median change over last 12 months: $Unit_Suburb_Median_Change -
    $Unit_Region_Median_Change\n";
    print "\n";
    }
    ************

    My problem is that $1 stays the same throughout as $650,682 for Postcode
    2040 & it stays as $1,040,070 for Postcode 2041.

    I'm sure I'm doing something surprisingly silly. Any help would be
    appreciated.

    Thanks,
    Voitec
    Voitec, Nov 9, 2003
    #1
    1. Advertising

  2. On Sun, 09 Nov 2003 15:05:46 GMT
    "Voitec" <> wrote:
    <snip>
    > My problem is that $1 stays the same throughout as $650,682 for
    > Postcode 2040 & it stays as $1,040,070 for Postcode 2041.
    >
    > I'm sure I'm doing something surprisingly silly. Any help would be
    > appreciated.


    'perldoc perlre' - pay close attention to the examples in the
    document.

    *Everything* you're matching is '$1' - which is not what I think you
    want to do.

    HTH

    --
    Jim

    Copyright notice: all code written by the author in this post is
    released under the GPL. http://www.gnu.org/licenses/gpl.txt
    for more information.

    a fortune quote ...
    "She is descended from a long line that her mother listened to."
    -- Gypsy Rose Lee
    James Willmore, Nov 9, 2003
    #2
    1. Advertising

  3. On Sun, 09 Nov 2003 15:58:33 GMT
    James Willmore <> wrote:

    > On Sun, 09 Nov 2003 15:05:46 GMT
    > "Voitec" <> wrote:
    > <snip>
    > > My problem is that $1 stays the same throughout as $650,682 for
    > > Postcode 2040 & it stays as $1,040,070 for Postcode 2041.
    > >
    > > I'm sure I'm doing something surprisingly silly. Any help would be
    > > appreciated.

    >
    > 'perldoc perlre' - pay close attention to the examples in the
    > document.
    >
    > *Everything* you're matching is '$1' - which is not what I think you
    > want to do.


    Let me re-phrase. When you try matching the _same_ regex and putting
    the match into different variables, you're going to wind up with the
    same value in _all_ the variables. Yes, you changed the matches in
    small ways, but are they different enough to get _exactly_ what you
    want? At first glance, it appears this may be where your trouble is.

    You should, after giving it _some_ thought, use an HTML parsing
    module to do the task. People went to a lot of trouble to produce
    modules to do this task. They may go hungry if you don't use them :)

    HTH

    --
    Jim

    Copyright notice: all code written by the author in this post is
    released under the GPL. http://www.gnu.org/licenses/gpl.txt
    for more information.

    a fortune quote ...
    It's not that I'm afraid to die. I just don't want to be there
    when it happens. -- Woody Allen
    James Willmore, Nov 9, 2003
    #3
  4. Voitec

    Voitec Guest

    Thanks very much James and Tad. Especially Tad for your exhaustive and quite
    simple explanations.
    I have retreated redfaced to my desk and fixed up the glitches.

    The script now does, in a roundabout way, what I was after. It's getting a
    few uninitilized valu concatenation errors but that's due to no error
    checking at this stage, ie. it warns whenever it comes up across a
    non-existant value or a string instead of a digit.

    Like you said Tad, this can break easily :)
    So I'll be off to CPAN later today for a browse.

    Last night, I said to one of my friends that I'm starting to like Perl. I'll
    be getting straight into "Perl & LWP" by O'Reilly to explore its web
    capabilities.

    Voitec


    "Tad McClellan" <> wrote in message
    news:...
    > Voitec <> wrote:
    >
    > > my $Postcode;
    > > for ($Postcode = 2040; $Postcode < 2042; $Postcode++) {

    >
    >
    > This is a less error-prone way to do the same thing, and
    > it is easier to read/understand as well:
    >
    > foreach my $Postcode ( 2040 .. 2041 ) {
    >
    > and has the added bonus of not advertising that you have
    > done too much C programming. :)
    >
    >
    > > or die "Couldn't fetch the Suburb page.";

    >
    >
    > You should include the value of the $! variable in diagnostic messages.
    >
    >
    > > $html =~ m{<td align=\"center\" class=\"tbody\">(\$[\d,]+)</td>}g;

    > ^ ^ ^ ^
    > ^ ^ ^ ^
    >
    > Double quotes are not "meta" in a regular expression so you
    > do not need any of those backslashes.
    >
    >
    > > my $House_Suburb_Avg = $1;
    > > my $House_Region_Avg = $1;
    > > my $House_Suburb_Median = $1;
    > > my $House_Region_Median = $1;
    > >
    > > $html =~ m{<td align=\"center\" class=\"tbody\">([+|-][\d]+%)</td>}g;

    > ^
    > ^
    > I don't think that does what you think it does.
    >
    > It allows a vertical bar character to match, eg: |22%
    >
    >
    > > my $House_Suburb_Median_Change = $1;
    > > my $House_Region_Median_Change = $1;
    > >
    > > $html =~ m{<td align=\"center\" class=\"tbody\">(\$[\d,]+)</td>}g;
    > > my $Unit_Suburb_Avg = $1;
    > > my $Unit_Region_Avg = $1;
    > > my $Unit_Suburb_Median = $1;
    > > my $Unit_Region_Median= $1;
    > >
    > > $html =~ m{<td align=\"center\" class=\"tbody\">([+|-][\d]+%)</td>}g;

    >
    >
    > backslash-d (\d) already _is_ a character class, no need
    > for the square brackets either.
    >
    >
    > > my $Unit_Suburb_Median_Change = $1;
    > > my $Unit_Region_Median_Change = $1;

    >
    >
    > > I'm sure I'm doing something surprisingly silly. Any help would be

    > ^^^^^^^^^
    > ^^^^^^^^^ make that plural :)
    > > appreciated.

    >
    >
    > 1) You should use a module that understands HTML for processing
    > of HTML data. The HTML::TableExtract module would be helpful
    > when you want to process <table> data.
    >
    > 2) You should never use the dollar-digit variables unless you
    > have first tested to see if the match _succeeded_.
    >
    > if ( $html =~ /some(.*)thing/ ) { # or: while (m//g)
    > # safe to use $1 here
    > }
    >
    > 3) The reason the values are the same is because you are copying
    > the values from the same place ($1). If that isn't what you
    > want, then don't do that. :)
    >
    > 4) The first group of four and the second group of four match
    > the same things. If you do them all together in list context,
    > you'll get the first 8 matches. If you do them separately
    > as above, you'll get the first 4 matches twice. Same for
    > the first group of two and the second group of two:
    >
    > # m//g in list context, get first 4, discard the rest (untested)
    > my( $House_Suburb_Median_Change, $House_Region_Median_Change,
    > $Unit_Suburb_Median_Change, $Unit_Region_Median_Change ) =
    > $html =~ m{<td align="center" class="tbody">([+-]\d+%)</td>}g;
    >
    > 5) Your program is very fragile and will break easily. If the site
    > does something as simple as change to using single quotes then
    > you get the opportunity to revisit this forgotten code and figure
    > out what it does so that you can fix it. Getting HTML parsing
    > correct is very hard to do.
    >
    > 6) Note that if you do #1 above, then you don't have to deal with
    > any of the other points made above!
    >
    >
    > You are doing it the hard way. The easy way is, well, easier:
    >
    > http://search.cpan.org/~msisk/HTML-TableExtract-1.08/
    >
    >
    > --
    > Tad McClellan SGML consulting
    > Perl programming
    > Fort Worth, Texas
    Voitec, Nov 10, 2003
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Dustin D.
    Replies:
    1
    Views:
    11,148
  2. Jay Douglas
    Replies:
    0
    Views:
    593
    Jay Douglas
    Aug 15, 2003
  3. Hubert Hung-Hsien Chang
    Replies:
    2
    Views:
    410
    Michael Foord
    Sep 17, 2004
  4. Jasen J.

    Processing regular expressions?

    Jasen J., Oct 15, 2010, in forum: Ruby
    Replies:
    2
    Views:
    90
    Ammar Ali
    Oct 15, 2010
  5. Noman Shapiro
    Replies:
    0
    Views:
    220
    Noman Shapiro
    Jul 17, 2013
Loading...

Share This Page