Perl script to extract data from webpage? (knucklehead newbie).

Discussion in 'Perl' started by Ryan Haskell, Jun 23, 2004.

  1. Ryan Haskell

    Ryan Haskell Guest

    Hello folks. I regret to announce that my understanding of Perl is
    virtually nonexistant, and I'm looking for a little instruction. My
    goal is to utilize a Perl script to extract specific numeric data from
    various web pages, and then feed that data to MRTG for graphing
    purposes. I have this running now using a script I found elsewhere,
    and am using it to pull current temperature for my area from
    www.weather.com and create a graph. Now I want to use the same
    technique for other data elsewhere. Problem is, I can't figure out
    how to modify this perl script to find the data of interest in a given
    page, because I don't understand how the script actually locates the
    data. The script itself is available from

    http://howto.aphroland.de/HOWTO/MRTG/Scripts/weather4.pl

    and here is a short excerpt from it, where the script parses the html
    page from www.weather.com for the humidity data:

    if ( /\%/ && /obsInfo2/ && ! /WIDTH/ ) {
    if (/[0-9]{1,3}\%/) {
    if ( $debug == 1 ) {
    unless ( $& ) { die "Cannot determine the humidity!\n"; }
    $humidity = $&;
    chop ($humidity);
    print "Humidity: $humidity\n";



    And below is the relevant section of the html code from
    www.weather.com that is being parsed:


    <BR>
    <TABLE BORDER=0 CELLPADDING=0 WIDTH=100% CELLSPACING=0>
    <TR><TD ALIGN=LEFT VALIGN=TOP CLASS=obsInfo1 WIDTH=40%>UV Index:</TD>
    <TD ALIGN=LEFT VALIGN=TOP CLASS=obsInfo2>3&nbsp;Low</TD></TR>
    <TR><TD ALIGN=LEFT VALIGN=TOP CLASS=obsInfo1>Dew Point:</TD>
    <TD ALIGN=LEFT VALIGN=TOP CLASS=obsInfo2>51&deg;F</TD></TR>
    <TR><TD ALIGN=LEFT VALIGN=TOP CLASS=obsInfo1>Humidity:</TD>
    <TD ALIGN=LEFT VALIGN=TOP CLASS=obsInfo2>40%</TD></TR>
    <TR><TD ALIGN=LEFT VALIGN=TOP CLASS=obsInfo1>Visibility:</TD>
    <TD ALIGN=LEFT VALIGN=TOP CLASS=obsInfo2>10.0 miles</TD></TR>
    <TR><TD ALIGN=LEFT VALIGN=TOP CLASS=obsInfo1>Pressure:</TD>
    <TD ALIGN=LEFT VALIGN=TOP CLASS=obsInfo2>29.79 inches and
    rising</TD></TR>
    <TR><TD ALIGN=LEFT VALIGN=TOP CLASS=obsInfo1>Wind:</TD>
    <TD ALIGN=LEFT VALIGN=TOP CLASS=obsInfo2>From the North at 13 gusting
    to 18&nbsp;mph</TD></TR>


    I can see that "&" and "obsInfo2" are text strings found within the
    html page on either side of the desired value, but I'm not clear on
    how the perl script pulls the actual value (in this case 40) out of
    the data and assigns it to the $humidity variable. How would I modify
    the perl script if I wanted to get, for example, the pressure instead?
    (which is 29.97 in the html example above.) I think if I could
    understand how this variable matching/assignment is occuring, I could
    then use this script to fetch almost any number from any web page,
    right?

    For another example, let's say I wanted to pull the value for "Heat
    Index" off the NWS Weather page at:

    http://weather.noaa.gov/weather/current/KVDF.html

    What would I do?

    Thanks for any help!
    Ryan Haskell
     
    Ryan Haskell, Jun 23, 2004
    #1
    1. Advertising

  2. Gunnar Hjalmarsson, Jun 23, 2004
    #2
    1. Advertising

  3. Ryan Haskell

    Ryan Haskell Guest

    Gunnar Hjalmarsson <> wrote in message news:<GUlCc.96739$>...
    > Ryan Haskell wrote:
    > > Hello folks. I regret to announce that my understanding of Perl is
    > > virtually nonexistant, and I'm looking for a little instruction.

    >
    > http://learn.perl.org/



    Been there already, Gunnar. I was hoping to get a little help from
    the community... what would take me 10 hours to figure out could be
    explained in less than 5 minutes by an experienced perl programmer.
    I'll all for RTFM, and have been doing so. Hopefully there are others
    out there a little more reminiscent of the days when they were first
    trying to learn perl, or anything else for that matter.

    Ryan
     
    Ryan Haskell, Jun 24, 2004
    #3
  4. Ryan Haskell wrote:
    > Gunnar Hjalmarsson wrote:
    >> Ryan Haskell wrote:
    >>> Hello folks. I regret to announce that my understanding of
    >>> Perl is virtually nonexistant, and I'm looking for a little
    >>> instruction.

    >>
    >> http://learn.perl.org/

    >
    > Been there already, Gunnar. I was hoping to get a little help from
    > the community...


    http://www.catb.org/~esr/faqs/smart-questions.html

    > what would take me 10 hours to figure out could be explained in
    > less than 5 minutes by an experienced perl programmer.


    You asked in your first post how to modify the script to get pressure
    instead. That would be easy:

    if ( /inches/ && /obsInfo2/ && ! /WIDTH/ ) {
    if (/\d\d\.\d\d/) {
    print "Pressure: $&\n";
    } else {
    die "Cannot determine the pressure!\n";
    }
    }

    Then you said: "I think if I could understand how this variable
    matching/assignment is occuring, I could then use this script to fetch
    almost any number from any web page, right?"

    That sentence reveals very unrealistic expectations. Either you spend
    quite some time learning Perl, or else you might be better off at

    http://jobs.perl.org/

    --
    Gunnar Hjalmarsson
    Email: http://www.gunnar.cc/cgi-bin/contact.pl
     
    Gunnar Hjalmarsson, Jun 24, 2004
    #4
  5. Ryan Haskell

    Jim Gibson Guest

    In article <>, Ryan
    Haskell <> wrote:

    [description of problem snipped]

    >
    > http://howto.aphroland.de/HOWTO/MRTG/Scripts/weather4.pl
    >
    > and here is a short excerpt from it, where the script parses the html
    > page from www.weather.com for the humidity data:
    >
    > if ( /\%/ && /obsInfo2/ && ! /WIDTH/ ) {
    > if (/[0-9]{1,3}\%/) {
    > if ( $debug == 1 ) {
    > unless ( $& ) { die "Cannot determine the humidity!\n"; }
    > $humidity = $&;
    > chop ($humidity);
    > print "Humidity: $humidity\n";
    >
    >
    >
    > And below is the relevant section of the html code from
    > www.weather.com that is being parsed:
    >
    >
    > <BR>
    > <TABLE BORDER=0 CELLPADDING=0 WIDTH=100% CELLSPACING=0>
    > <TR><TD ALIGN=LEFT VALIGN=TOP CLASS=obsInfo1 WIDTH=40%>UV Index:</TD>
    > <TD ALIGN=LEFT VALIGN=TOP CLASS=obsInfo2>3&nbsp;Low</TD></TR>
    > <TR><TD ALIGN=LEFT VALIGN=TOP CLASS=obsInfo1>Dew Point:</TD>
    > <TD ALIGN=LEFT VALIGN=TOP CLASS=obsInfo2>51&deg;F</TD></TR>
    > <TR><TD ALIGN=LEFT VALIGN=TOP CLASS=obsInfo1>Humidity:</TD>
    > <TD ALIGN=LEFT VALIGN=TOP CLASS=obsInfo2>40%</TD></TR>
    > <TR><TD ALIGN=LEFT VALIGN=TOP CLASS=obsInfo1>Visibility:</TD>
    > <TD ALIGN=LEFT VALIGN=TOP CLASS=obsInfo2>10.0 miles</TD></TR>
    > <TR><TD ALIGN=LEFT VALIGN=TOP CLASS=obsInfo1>Pressure:</TD>
    > <TD ALIGN=LEFT VALIGN=TOP CLASS=obsInfo2>29.79 inches and
    > rising</TD></TR>
    > <TR><TD ALIGN=LEFT VALIGN=TOP CLASS=obsInfo1>Wind:</TD>
    > <TD ALIGN=LEFT VALIGN=TOP CLASS=obsInfo2>From the North at 13 gusting
    > to 18&nbsp;mph</TD></TR>
    >
    >
    > I can see that "&" and "obsInfo2" are text strings found within the
    > html page on either side of the desired value, but I'm not clear on
    > how the perl script pulls the actual value (in this case 40) out of
    > the data and assigns it to the $humidity variable. How would I modify
    > the perl script if I wanted to get, for example, the pressure instead?
    > (which is 29.97 in the html example above.) I think if I could
    > understand how this variable matching/assignment is occuring, I could
    > then use this script to fetch almost any number from any web page,
    > right?


    The extraction method shown above depends upon the fact that the
    humidity value contains a percent sign, appears on the same line as the
    string 'obsInfo2', and doesn't appear on the same line as the WIDTH
    parameter, which also contains a percent sign. If any of these
    restrictions changes in the future, this script will fail.

    It works by finding the line with the humidity on it using the above
    three rules: if( /\%/ && /obsinfo2/ && ! /WIDTH/ )
    If it passes that test, it looks for 1 to 3 numerical digits followed
    by a percent sign: if( /[0-9]{1,3}\%/
    If that matches, the results of the match are placed in the special $&
    variable, and that is what is printed. Note that if the website ever
    adds a decimal point to the humidity reading, the script will fail. See
    'perldoc perlre' for information about regular expressions and 'perldoc
    perlvar' about Perl's special variables.

    If you want to do a good job extracting information from web pages, you
    should be using an HTML parser and not regular expressions. Check out
    HTML::parser, or look on www.cpan.org for HTML Table modules. You
    should be looking at both columns in this table to figure out what
    values are being displayed on the page.


    >
    > For another example, let's say I wanted to pull the value for "Heat
    > Index" off the NWS Weather page at:
    >
    > http://weather.noaa.gov/weather/current/KVDF.html
    >
    > What would I do?
    >
    > Thanks for any help!
    > Ryan Haskell
     
    Jim Gibson, Jun 24, 2004
    #5
  6. Your $& is a special perl variable that represents the string matched by
    the last successful pattern match...which in the case of your example
    happens to be /[0-9]{1,3}\%/.....a pattern match which basically says
    "return a pattern that contains a number from 1 to 3 digits long followed by
    a "%" character.

    Maybe an easier way of writing that same section of code would be:

    # true if $_ contains "CLASS=obsInfo2>" followed by a 1-3 digit number and a
    "%", concluded by a "</TD>"

    if ( /CLASS=obsInfo2>([0-9]{1,3}\%)<\/TD>/i ) {
    print "Humidity: $+\n" ;
    }

    # Note that I had to use \ to "quote" the / in </TD> or it would have been
    interpreted as the end of the pattern
    # Also used an "i" after the pattern to indicated case sensitivity
    checking is Case Insensitive.

    # "$+" is another special perl variable, that returns the value inside of
    the ( ) from the last successful match
    # "$&" returns the entire matched string
    # "$`" returns everything before the matched string
    # "$'" returns everything after the matched string

    To get pressure, you might add:

    # true if $_ contains the string "inches", and uses ".*" as a wildcard match
    for the text we want to return

    if ( /inches/i && /CLASS=obsInfo2>(.*)<\/TD>/i )
    print "Pressure: $+\n" ;
    }




    "Ryan Haskell" <> wrote in message
    news:...
    > Hello folks. I regret to announce that my understanding of Perl is
    > virtually nonexistant, and I'm looking for a little instruction. My
    > goal is to utilize a Perl script to extract specific numeric data from
    > various web pages, and then feed that data to MRTG for graphing
    > purposes. I have this running now using a script I found elsewhere,
    > and am using it to pull current temperature for my area from
    > www.weather.com and create a graph. Now I want to use the same
    > technique for other data elsewhere. Problem is, I can't figure out
    > how to modify this perl script to find the data of interest in a given
    > page, because I don't understand how the script actually locates the
    > data. The script itself is available from
    >
    > http://howto.aphroland.de/HOWTO/MRTG/Scripts/weather4.pl
    >
    > and here is a short excerpt from it, where the script parses the html
    > page from www.weather.com for the humidity data:
    >
    > if ( /\%/ && /obsInfo2/ && ! /WIDTH/ ) {
    > if (/[0-9]{1,3}\%/) {
    > if ( $debug == 1 ) {
    > unless ( $& ) { die "Cannot determine the humidity!\n"; }
    > $humidity = $&;
    > chop ($humidity);
    > print "Humidity: $humidity\n";
    >
    >
    >
    > And below is the relevant section of the html code from
    > www.weather.com that is being parsed:
    >
    >
    > <BR>
    > <TABLE BORDER=0 CELLPADDING=0 WIDTH=100% CELLSPACING=0>
    > <TR><TD ALIGN=LEFT VALIGN=TOP CLASS=obsInfo1 WIDTH=40%>UV Index:</TD>
    > <TD ALIGN=LEFT VALIGN=TOP CLASS=obsInfo2>3&nbsp;Low</TD></TR>
    > <TR><TD ALIGN=LEFT VALIGN=TOP CLASS=obsInfo1>Dew Point:</TD>
    > <TD ALIGN=LEFT VALIGN=TOP CLASS=obsInfo2>51&deg;F</TD></TR>
    > <TR><TD ALIGN=LEFT VALIGN=TOP CLASS=obsInfo1>Humidity:</TD>
    > <TD ALIGN=LEFT VALIGN=TOP CLASS=obsInfo2>40%</TD></TR>
    > <TR><TD ALIGN=LEFT VALIGN=TOP CLASS=obsInfo1>Visibility:</TD>
    > <TD ALIGN=LEFT VALIGN=TOP CLASS=obsInfo2>10.0 miles</TD></TR>
    > <TR><TD ALIGN=LEFT VALIGN=TOP CLASS=obsInfo1>Pressure:</TD>
    > <TD ALIGN=LEFT VALIGN=TOP CLASS=obsInfo2>29.79 inches and
    > rising</TD></TR>
    > <TR><TD ALIGN=LEFT VALIGN=TOP CLASS=obsInfo1>Wind:</TD>
    > <TD ALIGN=LEFT VALIGN=TOP CLASS=obsInfo2>From the North at 13 gusting
    > to 18&nbsp;mph</TD></TR>
    >
    >
    > I can see that "&" and "obsInfo2" are text strings found within the
    > html page on either side of the desired value, but I'm not clear on
    > how the perl script pulls the actual value (in this case 40) out of
    > the data and assigns it to the $humidity variable. How would I modify
    > the perl script if I wanted to get, for example, the pressure instead?
    > (which is 29.97 in the html example above.) I think if I could
    > understand how this variable matching/assignment is occuring, I could
    > then use this script to fetch almost any number from any web page,
    > right?
    >
    > For another example, let's say I wanted to pull the value for "Heat
    > Index" off the NWS Weather page at:
    >
    > http://weather.noaa.gov/weather/current/KVDF.html
    >
    > What would I do?
    >
    > Thanks for any help!
    > Ryan Haskell
     
    Gavin Williams, Jun 24, 2004
    #6
  7. Ryan Haskell

    Ryan Haskell Guest

    "Gavin Williams" <> wrote in message news:<>...
    > Your $& is a special perl variable that represents the string matched by
    > the last successful pattern match...which in the case of your example
    > happens to be /[0-9]{1,3}\%/.....a pattern match which basically says
    > "return a pattern that contains a number from 1 to 3 digits long followed by
    > a "%" character.
    >
    > Maybe an easier way of writing that same section of code would be:
    >
    > # true if $_ contains "CLASS=obsInfo2>" followed by a 1-3 digit number and a
    > "%", concluded by a "</TD>"
    >
    > if ( /CLASS=obsInfo2>([0-9]{1,3}\%)<\/TD>/i ) {
    > print "Humidity: $+\n" ;
    > }
    >

    <snip>

    Thanks for the help everyone. After much trial and error I've managed
    to produce a working script with which I've been successful in
    obtaining the info I need. It really wasn't that hard after I
    researched regular expressions for a while. I've come up with some
    other complications now that I know how to actually get the data, but
    I'll try to figure those out on my own...

    Ryan
     
    Ryan Haskell, Jun 25, 2004
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. dpackwood
    Replies:
    3
    Views:
    1,873
  2. Fiaz Idris
    Replies:
    13
    Views:
    2,013
    ifiaz
    Mar 17, 2005
  3. Tester
    Replies:
    3
    Views:
    138
    Tintin
    Jan 5, 2005
  4. Glory Regained
    Replies:
    5
    Views:
    297
    Keith Keller
    Feb 14, 2005
  5. PhEaSaNt PLuCKeR

    perl script to pass data to another perl script?

    PhEaSaNt PLuCKeR, Oct 30, 2005, in forum: Perl Misc
    Replies:
    1
    Views:
    175
Loading...

Share This Page