Website scraper

Discussion in 'Perl Misc' started by DVH, Sep 24, 2005.

  1. DVH

    DVH Guest

    Hi,

    I've been working through a Perl tutorial on using HTML::TokeParser, and
    trying to adapt the example script it gives.
    http://www.perl.com/pub/a/2001/11/15/creatingrss.html

    The script is meant to scrape headlines from the BBC website and put them
    into an RSS feed. It looks for CSS tags, then extracts the text nearby. I've
    modified it because the tags in the example don't match the tags on the site
    any more, but the script is still sticking at a certain point.

    I *think* it's sticking here:

    $headline = $stream->get_trimmed_text('/b') \
    if ($tag->[1]{class} =~ /^h[12]$/);

    I don't understand what that backslash is doing at the end of the first
    line. And I don't see where the loop following the "if" in the second line
    actually begins - shouldn't it begin with a curly bracket?

    Any advice gratefully received.

    DVH.
    DVH, Sep 24, 2005
    #1
    1. Advertising

  2. DVH wrote:
    > I've been working through a Perl tutorial on using HTML::TokeParser, and
    > trying to adapt the example script it gives.
    > http://www.perl.com/pub/a/2001/11/15/creatingrss.html


    Note: that article was written in 2001. Screen-scrapers are notoriously
    fragile - they often break in response to even the slightest change in
    the target.

    > The script is meant to scrape headlines from the BBC website and put them
    > into an RSS feed. It looks for CSS tags, then extracts the text nearby. I've
    > modified it because the tags in the example don't match the tags on the site
    > any more, but the script is still sticking at a certain point.


    As above, IIRC the BBC has changed their news page since 2001.

    > I *think* it's sticking here:
    >
    > $headline = $stream->get_trimmed_text('/b') \
    > if ($tag->[1]{class} =~ /^h[12]$/);
    >
    > I don't understand what that backslash is doing at the end of the first
    > line.


    I think the author got mixed up between Perl and shell scripting - where
    '\' is used to continue across newlines. That line should be:

    $headline = $stream->get_trimmed_text('/b')
    if ($tag->[1]{class} =~ /^h[12]$/);

    > And I don't see where the loop following the "if" in the second line
    > actually begins - shouldn't it begin with a curly bracket?


    It's an example of Perl's "statement if (cond)" syntax. So, just as you
    can say:

    if (foo) { bar; }

    you can also say:

    bar if (foo);

    Consequently, the above scraper code is *exactly* the same as:

    if ($tag->[1]{class} =~ /^h[12]$/)
    {
    $headline = $stream->get_trimmed_text('/b');
    }

    It's just a matter of preference and readability.

    HTH,
    Steve
    --
    Stephen Hildrey
    E-mail: / Tel: +442071931337
    Jabber: / MSN:
    Stephen Hildrey, Sep 24, 2005
    #2
    1. Advertising

  3. [ Newsgroups trimmed. I don't do the alt.* hierarchy ]


    Stephen Hildrey <> wrote:
    > DVH wrote:



    >> $headline = $stream->get_trimmed_text('/b') \
    >> if ($tag->[1]{class} =~ /^h[12]$/);
    >>
    >> I don't understand what that backslash is doing at the end of the first
    >> line.

    >
    > I think the author got mixed up between Perl and shell scripting - where
    > '\' is used to continue across newlines.



    So, the backslash at the end of the line is escaping the newline that
    follows it (but there is no need to escape that newline, so it does
    not do anything that is useful).


    > $headline = $stream->get_trimmed_text('/b')
    > if ($tag->[1]{class} =~ /^h[12]$/);
    >
    > > And I don't see where the loop following the "if" in the second line

    ^^^^^^^^
    > > actually begins - shouldn't it begin with a curly bracket?

    >
    > It's an example of Perl's "statement if (cond)" syntax.



    Note that there is no "loop" in the code the OP showed.


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
    Tad McClellan, Sep 24, 2005
    #3
  4. Tad McClellan wrote:
    > Stephen Hildrey <> wrote:
    >>DVH wrote:
    >>>$headline = $stream->get_trimmed_text('/b') \
    >>> if ($tag->[1]{class} =~ /^h[12]$/);
    >>>
    >>>I don't understand what that backslash is doing at the end of the first
    >>>line.

    >>
    >>I think the author got mixed up between Perl and shell scripting - where
    >>'\' is used to continue across newlines.

    >
    > So, the backslash at the end of the line is escaping the newline that
    > follows it (but there is no need to escape that newline, so it does
    > not do anything that is useful).


    No. This is Perl - the backslash is a syntax error:

    $ cat > backslash.pl << _EOF && perl backslash.pl
    > use strict;
    > use warnings;
    > my $foo = "foo" \
    > if (1);
    > _EOF

    syntax error at backslash.pl line 3, near "my ="
    Execution of backslash.pl aborted due to compilation errors.

    >>$headline = $stream->get_trimmed_text('/b')
    >> if ($tag->[1]{class} =~ /^h[12]$/);
    >>
    >>>And I don't see where the loop following the "if" in the second line

    >
    > ^^^^^^^^
    >
    >>>actually begins - shouldn't it begin with a curly bracket?

    >>
    >>It's an example of Perl's "statement if (cond)" syntax.

    >
    > Note that there is no "loop" in the code the OP showed.


    Good spot. I assume he meant "block".

    OP: if you are still experiencing difficulties with the code, do post
    back - I'm sure we'll be able to help :)

    Steve
    --
    Stephen Hildrey
    E-mail: / Tel: +442071931337
    Jabber: / MSN:
    Stephen Hildrey, Sep 24, 2005
    #4
  5. A. Sinan Unur, Sep 24, 2005
    #5
  6. A. Sinan Unur wrote:
    > "DVH" <> wrote in
    > news:dh34hg$d0i$-infra.bt.com:


    >>The script is meant to scrape headlines from the BBC website and put
    >>them into an RSS feed.


    > http://news.bbc.co.uk/rss/newsonline_world_edition/americas/rss.xml


    A valid point, well made :)

    Still, I think scraping is a useful technique to be aware of - I read
    that same article myself, and since have found numerous uses for
    scraping [1]:

    1. Being paid to write news aggregators,
    2. Getting text-message notifications in response to various ebay
    events,
    3. Being able to enjoy a night in the pub, despite $airline having
    lost my luggage (scrape the lost-luggage tracking site, send SMS
    hourly :) )

    The possibilities are endless!

    Steve

    [1] - yes, this may be a bit of a "grey area" in some AUPs/ToSs - YMMV.

    --
    Stephen Hildrey
    E-mail: / Tel: +442071931337
    Jabber: / MSN:
    Stephen Hildrey, Sep 24, 2005
    #6
  7. Stephen Hildrey <> wrote in news:1127576183.17734.0
    @damia.uk.clara.net:

    > A. Sinan Unur wrote:
    >> "DVH" <> wrote in
    >> news:dh34hg$d0i$-infra.bt.com:

    >
    >>>The script is meant to scrape headlines from the BBC website and put
    >>>them into an RSS feed.

    >
    >> http://news.bbc.co.uk/rss/newsonline_world_edition/americas/rss.xml

    >
    > A valid point, well made :)
    >
    > Still, I think scraping is a useful technique to be aware of - I read
    > that same article myself,


    Agreed. I was doing it before I knew it was called scraping. Even MS did
    it (in the form of being able to import data from HTML tables into Excel
    given a page URL).

    If the RSS feed exists in the first place, why not go ahead and use it
    without mucking about with the internals of some HTML code?

    Sinan

    --
    A. Sinan Unur <>
    (reverse each component and remove .invalid for email address)

    comp.lang.perl.misc guidelines on the WWW:
    http://mail.augustmail.com/~tadmc/clpmisc/clpmisc_guidelines.html
    A. Sinan Unur, Sep 24, 2005
    #7
  8. A. Sinan Unur wrote:
    > If the RSS feed exists in the first place, why not go ahead and use it
    > without mucking about with the internals of some HTML code?


    1. If the HTML exposes some information not present in the RSS.
    2. Note - there wasn't a BBC RSS feed at the time that article
    was written (November 2001).
    3. The OP says he wants to "adapt the example script", so I don't
    know that he is even going to use it to scrape the BBC.

    But I agree in principle with your point - in-house RSS feeds generated
    from back-end data sources are far more robust than a bespoke solution
    that is based on data acquired through an attempt to reverse a
    data-to-presentation transform.

    Steve
    --
    Stephen Hildrey
    E-mail: / Tel: +442071931337
    Jabber: / MSN:
    Stephen Hildrey, Sep 24, 2005
    #8
  9. DVH

    Matt Garrish Guest

    "Stephen Hildrey" <> wrote in message
    news:...
    > Tad McClellan wrote:
    >> Stephen Hildrey <> wrote:
    >>>DVH wrote:
    >>>>$headline = $stream->get_trimmed_text('/b') \
    >>>> if ($tag->[1]{class} =~ /^h[12]$/);
    >>>>
    >>>>I don't understand what that backslash is doing at the end of the first
    >>>>line.
    >>>
    >>>I think the author got mixed up between Perl and shell scripting - where
    >>>'\' is used to continue across newlines.

    >>
    >> So, the backslash at the end of the line is escaping the newline that
    >> follows it (but there is no need to escape that newline, so it does
    >> not do anything that is useful).

    >
    > No. This is Perl - the backslash is a syntax error:
    >
    > $ cat > backslash.pl << _EOF && perl backslash.pl
    > > use strict;
    > > use warnings;
    > > my $foo = "foo" \
    > > if (1);
    > > _EOF

    > syntax error at backslash.pl line 3, near "my ="
    > Execution of backslash.pl aborted due to compilation errors.
    >


    You shouldn't make claims you can't substantiate...

    my $time = localtime(\
    time);
    print $time;

    The interpreter can usually understand what you're trying to do, but in your
    case you broke the line before the conditional "if" and it's not going to
    assume that's what you meant so it gives an error. You can, however, break
    it anywhere else:

    my $foo = \
    "foo" if (1);

    or

    my $foo = "foo" if
    (1);

    Matt
    Matt Garrish, Sep 24, 2005
    #9
  10. DVH

    Matt Garrish Guest

    "Matt Garrish" <> wrote in message
    news:_2fZe.307$...
    >
    > my $foo = "foo" if


    my $foo = "foo" if \

    That's what I get for typing...

    Matt
    Matt Garrish, Sep 24, 2005
    #10
  11. Matt Garrish wrote:
    > "Stephen Hildrey" <> wrote in message
    > news:...
    >
    >>Tad McClellan wrote:
    >>
    >>>Stephen Hildrey <> wrote:
    >>>
    >>>>DVH wrote:
    >>>>
    >>>>>$headline = $stream->get_trimmed_text('/b') \
    >>>>> if ($tag->[1]{class} =~ /^h[12]$/);
    >>>>>
    >>>>>I don't understand what that backslash is doing at the end of the first
    >>>>>line.
    >>>>
    >>>>I think the author got mixed up between Perl and shell scripting - where
    >>>>'\' is used to continue across newlines.
    >>>
    >>>So, the backslash at the end of the line is escaping the newline that
    >>>follows it (but there is no need to escape that newline, so it does
    >>>not do anything that is useful).

    >>
    >>No. This is Perl - the backslash is a syntax error:
    >>
    >> $ cat > backslash.pl << _EOF && perl backslash.pl
    >> > use strict;
    >> > use warnings;
    >> > my $foo = "foo" \
    >> > if (1);
    >> > _EOF

    >> syntax error at backslash.pl line 3, near "my ="
    >> Execution of backslash.pl aborted due to compilation errors.
    >>

    >
    >
    > You shouldn't make claims you can't substantiate...
    >
    > my $time = localtime(\
    > time);
    > print $time;
    >
    > The interpreter can usually understand what you're trying to do, but in your
    > case you broke the line before the conditional "if" and it's not going to
    > assume that's what you meant so it gives an error. You can, however, break
    > it anywhere else:


    Sorry if I was ambiguous - I was trying to maintain the structure of the
    code in the OP's example, and not talking about the general case.

    Steve
    --
    Stephen Hildrey
    E-mail: / Tel: +442071931337
    Jabber: / MSN:
    Stephen Hildrey, Sep 24, 2005
    #11
  12. DVH

    Matt Garrish Guest

    "Stephen Hildrey" <> wrote in message
    news:...
    > Matt Garrish wrote:
    >>
    >> The interpreter can usually understand what you're trying to do, but in
    >> your case you broke the line before the conditional "if" and it's not
    >> going to assume that's what you meant so it gives an error. You can,
    >> however, break it anywhere else:

    >
    > Sorry if I was ambiguous - I was trying to maintain the structure of the
    > code in the OP's example, and not talking about the general case.
    >


    Clarity is key. I read your comment as a reference to Perl syntax in
    general. The OP's does cause a compilation error, as you were alluding to.

    Matt
    Matt Garrish, Sep 24, 2005
    #12
  13. DVH

    DVH Guest

    Stephen Hildrey <> wrote in message
    news:...

    >
    > OP: if you are still experiencing difficulties with the code, do post
    > back - I'm sure we'll be able to help :)


    Thanks Stephen.

    I removed the backslash, and tidied up a couple of other obvious bugs. My
    script now runs through the HTML and successfully creates a well-formatted
    RSS file. It's an empty file though, so I think the next stage is to look at
    the order of the tags and make sure the script can actually find what it's
    looking for.

    It isn't immediately obvious how to do this, so I may indeed come back...
    thanks for the offer.

    [I'm doing this because I want to scrape other sites which don't have an RSS
    feed - as you mention elsewhere in the thread, there are numerous uses for
    this sort of scraping. But it seemed logical to start with the technique
    described in the tutorial].
    DVH, Sep 24, 2005
    #13
  14. At 2005-09-24 11:11AM, Stephen Hildrey <> wrote:
    > No. This is Perl - the backslash is a syntax error:
    >
    > $ cat > backslash.pl << _EOF && perl backslash.pl
    > > use strict;
    > > use warnings;
    > > my $foo = "foo" \
    > > if (1);
    > > _EOF

    > syntax error at backslash.pl line 3, near "my ="
    > Execution of backslash.pl aborted due to compilation errors.


    No, your shell is doing variable substitution:

    $ foo=bar cat > foo.pl << _EOF
    > my $foo = `date`;
    > _EOF

    $ cat foo.pl
    my bar = Mon Sep 26 09:15:37 EDT 2005;

    That's the perl syntax error you're seeing.

    If you want to use shell here-docs to type perl programs, single-quote
    your delimiter:

    $ foo=bar cat > foo.pl << '_EOF'
    > my $foo = `date`;
    > _EOF

    $ cat foo.pl
    my $foo = `date`;

    --
    Glenn Jackman
    NCF Sysadmin
    Glenn Jackman, Sep 26, 2005
    #14
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. _eee_

    Screen Scraper

    _eee_, Feb 25, 2004, in forum: ASP .Net
    Replies:
    3
    Views:
    582
    _eee_
    Feb 25, 2004
  2. _eee_

    Screen scraper again

    _eee_, Feb 27, 2004, in forum: ASP .Net
    Replies:
    6
    Views:
    439
    _eee_
    Feb 28, 2004
  3. Dave Monroe

    Screen Scraper with Java API

    Dave Monroe, Oct 17, 2003, in forum: Java
    Replies:
    1
    Views:
    672
    Richard Reynolds
    Oct 17, 2003
  4. Rock
    Replies:
    3
    Views:
    421
  5. James Stroud

    Python Screen Scraper

    James Stroud, Apr 24, 2007, in forum: Python
    Replies:
    7
    Views:
    563
    skotjs
    Apr 25, 2007
Loading...

Share This Page