Q: Perl & LWP - HTML Processing with Regular Expressions

V

Voitec

Hi,

The following refers to this URL:
http://www.homepriceguide.com.au/snapshot/price/index.cfm?action=view&suburbORpostcode=2040

where the last 4 digits in the link is the rotating postcode.

I'd like to get data from this site and form trendlines.

Here's the code:
************
#!/usr/bin/perl -w
# Real estate price movement by suburb

use strict;
use LWP::Simple;
my $Postcode;

for ($Postcode = 2040; $Postcode < 2042; $Postcode++) {
my $html = get("
http://www.homepriceguide.com.au/snapshot/price/index.cfm?action=view&suburbORpostcode=$Postcode")
or die "Couldn't fetch the Suburb page.";

$html =~ m{<td align=\"center\" class=\"tbody\">(\$[\d,]+)</td>}g;
my $House_Suburb_Avg = $1;
my $House_Region_Avg = $1;
my $House_Suburb_Median = $1;
my $House_Region_Median = $1;

$html =~ m{<td align=\"center\" class=\"tbody\">([+|-][\d]+%)</td>}g;
my $House_Suburb_Median_Change = $1;
my $House_Region_Median_Change = $1;

$html =~ m{<td align=\"center\" class=\"tbody\">(\$[\d,]+)</td>}g;
my $Unit_Suburb_Avg = $1;
my $Unit_Region_Avg = $1;
my $Unit_Suburb_Median = $1;
my $Unit_Region_Median= $1;

$html =~ m{<td align=\"center\" class=\"tbody\">([+|-][\d]+%)</td>}g;
my $Unit_Suburb_Median_Change = $1;
my $Unit_Region_Median_Change = $1;

print "Here are 2002/2003 Prices for: $Postcode. \n";
printf "Average House Price: $House_Suburb_Avg - $House_Region_Avg\n";
printf "Median Price: $House_Suburb_Median - $House_Region_Median\n";
printf "Median change over last 12 months: $House_Suburb_Median_Change -
$House_Region_Median_Change\n";
printf "Average Unit Price: $Unit_Suburb_Avg - $Unit_Region_Avg\n";
printf "Median Price: $Unit_Suburb_Median - $Unit_Region_Median\n";
printf "Median change over last 12 months: $Unit_Suburb_Median_Change -
$Unit_Region_Median_Change\n";
print "\n";
}
************

My problem is that $1 stays the same throughout as $650,682 for Postcode
2040 & it stays as $1,040,070 for Postcode 2041.

I'm sure I'm doing something surprisingly silly. Any help would be
appreciated.

Thanks,
Voitec
 
J

James Willmore

On Sun, 09 Nov 2003 15:05:46 GMT
Voitec said:
My problem is that $1 stays the same throughout as $650,682 for
Postcode 2040 & it stays as $1,040,070 for Postcode 2041.

I'm sure I'm doing something surprisingly silly. Any help would be
appreciated.

'perldoc perlre' - pay close attention to the examples in the
document.

*Everything* you're matching is '$1' - which is not what I think you
want to do.

HTH

--
Jim

Copyright notice: all code written by the author in this post is
released under the GPL. http://www.gnu.org/licenses/gpl.txt
for more information.

a fortune quote ...
"She is descended from a long line that her mother listened to."
-- Gypsy Rose Lee
 
J

James Willmore

On Sun, 09 Nov 2003 15:05:46 GMT


'perldoc perlre' - pay close attention to the examples in the
document.

*Everything* you're matching is '$1' - which is not what I think you
want to do.

Let me re-phrase. When you try matching the _same_ regex and putting
the match into different variables, you're going to wind up with the
same value in _all_ the variables. Yes, you changed the matches in
small ways, but are they different enough to get _exactly_ what you
want? At first glance, it appears this may be where your trouble is.

You should, after giving it _some_ thought, use an HTML parsing
module to do the task. People went to a lot of trouble to produce
modules to do this task. They may go hungry if you don't use them :)

HTH

--
Jim

Copyright notice: all code written by the author in this post is
released under the GPL. http://www.gnu.org/licenses/gpl.txt
for more information.

a fortune quote ...
It's not that I'm afraid to die. I just don't want to be there
when it happens. -- Woody Allen
 
V

Voitec

Thanks very much James and Tad. Especially Tad for your exhaustive and quite
simple explanations.
I have retreated redfaced to my desk and fixed up the glitches.

The script now does, in a roundabout way, what I was after. It's getting a
few uninitilized valu concatenation errors but that's due to no error
checking at this stage, ie. it warns whenever it comes up across a
non-existant value or a string instead of a digit.

Like you said Tad, this can break easily :)
So I'll be off to CPAN later today for a browse.

Last night, I said to one of my friends that I'm starting to like Perl. I'll
be getting straight into "Perl & LWP" by O'Reilly to explore its web
capabilities.

Voitec


Tad McClellan said:
Voitec said:
my $Postcode;
for ($Postcode = 2040; $Postcode < 2042; $Postcode++) {


This is a less error-prone way to do the same thing, and
it is easier to read/understand as well:

foreach my $Postcode ( 2040 .. 2041 ) {

and has the added bonus of not advertising that you have
done too much C programming. :)

or die "Couldn't fetch the Suburb page.";


You should include the value of the $! variable in diagnostic messages.

$html =~ m{<td align=\"center\" class=\"tbody\">(\$[\d,]+)</td>}g;
^ ^ ^ ^
^ ^ ^ ^

Double quotes are not "meta" in a regular expression so you
do not need any of those backslashes.

my $House_Suburb_Avg = $1;
my $House_Region_Avg = $1;
my $House_Suburb_Median = $1;
my $House_Region_Median = $1;

$html =~ m{<td align=\"center\" class=\"tbody\">([+|-][\d]+%)</td>}g;
^
^
I don't think that does what you think it does.

It allows a vertical bar character to match, eg: |22%

my $House_Suburb_Median_Change = $1;
my $House_Region_Median_Change = $1;

$html =~ m{<td align=\"center\" class=\"tbody\">(\$[\d,]+)</td>}g;
my $Unit_Suburb_Avg = $1;
my $Unit_Region_Avg = $1;
my $Unit_Suburb_Median = $1;
my $Unit_Region_Median= $1;

$html =~ m{<td align=\"center\" class=\"tbody\">([+|-][\d]+%)</td>}g;


backslash-d (\d) already _is_ a character class, no need
for the square brackets either.

my $Unit_Suburb_Median_Change = $1;
my $Unit_Region_Median_Change = $1;

I'm sure I'm doing something surprisingly silly. Any help would be
^^^^^^^^^
^^^^^^^^^ make that plural :)
appreciated.


1) You should use a module that understands HTML for processing
of HTML data. The HTML::TableExtract module would be helpful
when you want to process <table> data.

2) You should never use the dollar-digit variables unless you
have first tested to see if the match _succeeded_.

if ( $html =~ /some(.*)thing/ ) { # or: while (m//g)
# safe to use $1 here
}

3) The reason the values are the same is because you are copying
the values from the same place ($1). If that isn't what you
want, then don't do that. :)

4) The first group of four and the second group of four match
the same things. If you do them all together in list context,
you'll get the first 8 matches. If you do them separately
as above, you'll get the first 4 matches twice. Same for
the first group of two and the second group of two:

# m//g in list context, get first 4, discard the rest (untested)
my( $House_Suburb_Median_Change, $House_Region_Median_Change,
$Unit_Suburb_Median_Change, $Unit_Region_Median_Change ) =
$html =~ m{<td align="center" class="tbody">([+-]\d+%)</td>}g;

5) Your program is very fragile and will break easily. If the site
does something as simple as change to using single quotes then
you get the opportunity to revisit this forgotten code and figure
out what it does so that you can fix it. Getting HTML parsing
correct is very hard to do.

6) Note that if you do #1 above, then you don't have to deal with
any of the other points made above!


You are doing it the hard way. The easy way is, well, easier:

http://search.cpan.org/~msisk/HTML-TableExtract-1.08/
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,578
Members
45,052
Latest member
LucyCarper

Latest Threads

Top