Q on regex of LWP::Simple data

Discussion in 'Perl Misc' started by Len Philpot, Mar 2, 2007.

  1. Len Philpot

    Len Philpot Guest

    I've read the FAQs (unless proven otherwise!) and examples, etc. but
    don't know why this doesn't work...


    #!perl # use your shebang of choice, this was on Windows

    use warnings;
    use strict;
    use LWP::Simple;

    # unwrap this line
    my @cachepage = \
    get('http://www.geocaching.com/seek/cache_details.aspx?wp=GC115K4');

    # line in question (in @cachepage) looks like :
    # <p><span id="ShortDescription">Should be quick and easy.</span></p>

    foreach my $line (@cachepage)
    {
    if($line =~ /Should be quick/)
    {
    print("$line");
    }
    }


    Instead of printing only the line that contains "Should be quick", it
    prints every line. Breaking it down to a minimum, I tried :

    #!perl

    use warnings;
    use strict;

    my @a = qw(one two three four five fiver);

    foreach my $line (@a)
    {
    if($line =~ /five/)
    {
    print("$line\n");
    }
    }

    Which, of course, prints :

    five
    fiver

    .... as expected. What's different except maybe the input data? Are the
    tags throwing a wrench in things?

    My apologies in advance if this is a FAQ or simple logical error. I'm
    very much in learning mode with Perl these days.

    Thanks!
    --

    ---- Len Philpot -------- l e n @ p h i l p o t . o r g (no spaces)
    ------- ><> ------------- http://pages.suddenlink.net/lenphilpot/
     
    Len Philpot, Mar 2, 2007
    #1
    1. Advertising

  2. Len Philpot

    Len Philpot Guest

    On Fri, 02 Mar 2007 16:02:12 +1100, Iain Chalmers wrote:

    > In article <18p2wnf2hexfv$>,
    > Len Philpot <> wrote:
    >
    >> my @cachepage = \
    >> get('http://www.geocaching.com/seek/cache_details.aspx?wp=GC115K4');
    >>

    >
    > I don't think @cachepage contains what you think it contains...
    >
    > try adding:
    >
    > use Data::Dumper;
    > print Dumper \@cachepage;
    >
    > after that line.


    So, it's one long string now... $#cachepage == 1

    What's the best way to break it back up again? Maybe a pointer in the
    right direction?

    The get() example used a scalar instead of an array, but I wanted to
    iterate through it to find a number of specific strings. Maybe I need to
    come up with a regex to simply extract what I need all at once without
    iterating.

    Or am I looking at this wrong? My final objective, more or less, is to
    retrieve a file from a website and extract two or three specific strings
    from it, located via a couple of specific HTML tags and subsequently
    extracted using back references, but I'm not there yet.

    Perhaps I'm being dense... After all, it /has/ been a very long
    DST-fix-infested day :)

    Thanks.
    --

    ---- Len Philpot -------- l e n @ p h i l p o t . o r g (no spaces)
    ------- ><> ------------- http://pages.suddenlink.net/lenphilpot/
     
    Len Philpot, Mar 2, 2007
    #2
    1. Advertising

  3. Len Philpot

    Len Philpot Guest

    On Fri, 02 Mar 2007 17:39:51 +1100, Iain Chalmers wrote:

    > In article <epdxww5gfd0l.5psw3jcl6it4$>,
    > Len Philpot <> wrote:
    >> Or am I looking at this wrong?

    >
    > Yep. LWP::Simple::get doesn't return an array of lines no matter _how_
    > much you want it too.
    >
    > Either split the scalar you get into an array of lines yourself
    >
    > @cachepage=split(/\n/,$scalar_version_ofOcachepage);
    >
    > or throw the whole scalar at an appropriate regex.


    That's what I thought about after posting.


    > Unless the file you're getting is very well defined, the usual advice is
    > to parse html using an html parser. Regexs are not the right tool to
    > deal with arbitrary html (though your case might be far enough from
    > "arbitrary html" that regexs will work for you).


    At this point, I'm very low on the Perl learning cliff (oh, for the
    simplicity and clarity of C! :), so I'll probably take an
    incrementally-complex approach to parsing it. This whole exercise is for
    my own use and edification, anyway.

    Thanks.
    --

    ---- Len Philpot -------- l e n @ p h i l p o t . o r g (no spaces)
    ------- ><> ------------- http://pages.suddenlink.net/lenphilpot/
     
    Len Philpot, Mar 2, 2007
    #3
  4. Len Philpot

    gf Guest

    On Mar 2, 6:15 am, Len Philpot <> wrote:

    > At this point, I'm very low on the Perl learning cliff (oh, for the
    > simplicity and clarity of C! :), so I'll probably take an
    > incrementally-complex approach to parsing it. This whole exercise is for
    > my own use and edification, anyway.


    Ok. I think you meant "curve" instead of "cliff"...

    And "the simplicity and clarity of C"? Perl and C are so similar as
    far as their allowing the programmer to write terse and cryptic code,
    or very verbose code, and still maintain speed. It's the programmers
    choice and not something enforced by the language. That said...

    The problem with finding strings or data in HTML pages is the
    variablity of the format of the pages. HTML is unstructured and relies
    on the browser to turn the data into human-readable form. For our
    purposes as programmers it makes our job more difficult because we
    want to grab the easiest tool to do the job and regex seems to be the
    tool to handle finding data in lines that change.

    The problem is that HTML allows arbitrary line breaks in the file and
    the browser will gobble them then parse the page then format it for
    us. Perl doesn't do that. It's doing what you told it to (usually)
    and, in this case, what you told it to do is not nearly as complex as
    what the browser is doing.

    You can get closer to what the browser is doing by stripping all the
    line-end characters from the document, then applying your regex
    pattern reiteratively to the resulting single line, OR you can tell
    the regex engine to ignore line-ends for you. Check out the 'm' and
    's' options to regex. Combined with 'g' you should be homing in on the
    data you want. Usually.

    Sometimes those are still going to fail so you have to dig out the big
    guns and parse the document like a browser. There's HTML::parser and
    various derived modules. Of those I like HTML::TreeBuilder. Pass it
    HTML using

    my $t = HTML::TreeBuilder->new_from_content(get('your url'));

    and it will parse it and build a tree. It'll lock the tree and turn it
    into an HTML::Element object which you can search and extract info
    using the methods of that object. Of those I like the 'look_down()'
    method because it's so flexible. Give it the right parameters and
    it'll let you loop through the page and find whatever you want. Of
    course, as always you have to tell it correctly, and that can be a
    tough thing to determine, but that's a different subject for a
    different time and probably a different group.

    Another way to attack the same problem is to use the various xpath
    implementations for HTML in Perl. Search on CPAN and you'll find some.
    xpath is a cool way of looking at HTML but, at least for me, it's not
    as intuitive as how TreeBuilder and the parsers do it.
     
    gf, Mar 2, 2007
    #4
  5. Len Philpot

    Len Philpot Guest

    On 2 Mar 2007 10:16:32 -0800, gf wrote:

    > On Mar 2, 6:15 am, Len Philpot <> wrote:
    >
    >> At this point, I'm very low on the Perl learning cliff (oh, for the
    >> simplicity and clarity of C! :), so I'll probably take an
    >> incrementally-complex approach to parsing it. This whole exercise is for
    >> my own use and edification, anyway.

    >
    > Ok. I think you meant "curve" instead of "cliff"...
    >
    > And "the simplicity and clarity of C"? Perl and C are so similar as
    > far as their allowing the programmer to write terse and cryptic code,
    > or very verbose code, and still maintain speed. It's the programmers
    > choice and not something enforced by the language. That said...


    Actually, 'cliff' was intentional, as was the C reference - A weak
    attempt at humor, I guess. I'm just trying to come to terms with the
    looseness that Perl allows (although doesn't require). It's purely my
    preference : I like algorithmic flexibility, but with a tighter
    syntactic regimen, i.e., for me TIMTOWTDI gets in the way of learning
    "the best/right way to do X". However, I'm sure its's very different for
    others (as is obviously the case). I really like the way C is not as
    abstracted - "the machine prints through" - but once again that's my
    preference. Lots of very knowledgeable people feel differently. :)


    > The problem with finding strings or data in HTML pages is the
    > variablity of the format of the pages. HTML is unstructured and relies
    > on the browser to turn the data into human-readable form. For our
    > purposes as programmers it makes our job more difficult because we
    > want to grab the easiest tool to do the job and regex seems to be the
    > tool to handle finding data in lines that change.


    Fortunately in this case, what I'm looking for is (AFAICT) uniquely
    labeled and fairly contained. However, newlines do occur and I'll haev
    to deal with that.


    > Sometimes those are still going to fail so you have to dig out the big
    > guns and parse the document like a browser. There's HTML::parser and
    > various derived modules. Of those I like HTML::TreeBuilder. Pass it
    > HTML using
    >
    > my $t = HTML::TreeBuilder->new_from_content(get('your url'));


    Thanks for the suggestions - I'll take a look at them.
    --

    ---- Len Philpot -------- l e n @ p h i l p o t . o r g (no spaces)
    ------- ><> ------------- http://pages.suddenlink.net/lenphilpot/
     
    Len Philpot, Mar 2, 2007
    #5
  6. Len Philpot

    Mirco Wahab Guest

    Len Philpot wrote:

    > # unwrap this line
    > my @cachepage = \
    > get('http://www.geocaching.com/seek/cache_details.aspx?wp=GC115K4');
    > # line in question (in @cachepage) looks like :
    > # <p><span id="ShortDescription">Should be quick and easy.</span></p>
    > foreach my $line (@cachepage)
    > {
    > if($line =~ /Should be quick/)
    > {
    > print("$line");
    > }
    > }
    >
    >
    > Instead of printing only the line that contains "Should be quick", it
    > prints every line.


    After reading all the really good advice
    given to yu by others here, i'd like
    to point you in the direction mentioned
    by Iain.

    The minimum working solution for your
    question "w/appropriate regex" would
    therefore be:


    ...
    my $cachepage = get 'http://www.geocaching.com/seek/cache_details.aspx?wp=GC115K4';
    my $searchstr = 'Should be quick';

    if( $cachepage =~ /^(.*?$searchstr.*?)$/m ) {
    print "$1\n"
    }
    ...


    I read you are/have been a C programmer (as I am),
    I'd like to stress the idea you should *really* try
    to get somehow into the "regex metalanguage" because
    knowing it would have enabled you to spit out a solution
    after learning what "LWP::Simple::get" returns.

    The Regex modifier /m (http://www.perl.com/doc/manual/html/pod/perlre.html)
    does exaclty what you need here, it 'anchors' the expression
    in parentheses (.*?$searchstr.*?) between line start and line end.

    The conntent of the (first and only) parentheses will then
    be available in the pattern match variable $1.

    Regards

    Mirco
     
    Mirco Wahab, Mar 2, 2007
    #6
  7. Len Philpot

    Len Philpot Guest

    On Fri, 02 Mar 2007 20:52:15 +0100, Mirco Wahab wrote:

    > The minimum working solution for your
    > question "w/appropriate regex" would
    > therefore be:
    >
    > ...
    > my $cachepage = get 'http://www.geocaching.com/seek/cache_details.aspx?wp=GC115K4';
    > my $searchstr = 'Should be quick';
    >
    > if( $cachepage =~ /^(.*?$searchstr.*?)$/m ) {
    > print "$1\n"
    > }
    > ...
    >
    > I read you are/have been a C programmer (as I am),


    Let me clarify - I find C fascinating and have played with it off and on
    over the years. I hesitate to call myself a programmer in any language,
    much less C (and it's been a while since I spent any serious time with
    it), but I do find it very interesting. I'm not a programmer by
    profession... although in the strictest sense of the term, I /have/ been
    technically paid to write a couple of programs. :)


    > I'd like to stress the idea you should *really* try
    > to get somehow into the "regex metalanguage" because


    Absolutely. I'm a Solaris admin by day, so I use them here and again,
    although I need to make an effort to learn it beyond just what I use on
    the job.


    > The conntent of the (first and only) parentheses will then
    > be available in the pattern match variable $1.


    That's what I had in mind (and have done, temporarily): to use a back
    reference to grab what I need. The string I used above was a test case.
    Actually I look for a specific set of tags followed by a specific HTML
    ID value, which are hardwired in the regex, followed by the back
    referenced payload.

    Thanks.
    --

    ---- Len Philpot -------- l e n @ p h i l p o t . o r g (no spaces)
    ------- ><> ------------- http://pages.suddenlink.net/lenphilpot/
     
    Len Philpot, Mar 2, 2007
    #7
  8. Len Philpot

    -berlin.de Guest

    Len Philpot <> wrote in comp.lang.perl.misc:
    > On Fri, 02 Mar 2007 17:39:51 +1100, Iain Chalmers wrote:
    >
    > > In article <epdxww5gfd0l.5psw3jcl6it4$>,
    > > Len Philpot <> wrote:


    > At this point, I'm very low on the Perl learning cliff (oh, for the
    > simplicity and clarity of C! :),


    As in chasing macros and typedefs through header files? As in
    Duff's device? :)

    Nah, C is a fine programming language. It is *smaller* than Perl,
    in that Perl has more constructs and concepts to learn, but taken
    individually, Perl's constructs and concepts are no more difficult
    than C's.

    Anno
     
    -berlin.de, Mar 3, 2007
    #8
  9. -berlin.de <-berlin.de> wrote:
    > Len Philpot <> wrote in comp.lang.perl.misc:
    >> On Fri, 02 Mar 2007 17:39:51 +1100, Iain Chalmers wrote:
    >>
    >> > In article <epdxww5gfd0l.5psw3jcl6it4$>,
    >> > Len Philpot <> wrote:

    >
    >> At this point, I'm very low on the Perl learning cliff (oh, for the
    >> simplicity and clarity of C! :),

    >
    > As in chasing macros and typedefs through header files? As in
    > Duff's device? :)
    >
    > Nah, C is a fine programming language. It is *smaller* than Perl,
    > in that Perl has more constructs and concepts to learn, but taken
    > individually, Perl's constructs and concepts are no more difficult
    > than C's.



    Except for the concept of scalar and list context. :)

    Did Larry borrow that concept from somewhere, or did it first
    show up in Perl?


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
     
    Tad McClellan, Mar 4, 2007
    #9
  10. Len Philpot

    -berlin.de Guest

    Tad McClellan <> wrote in comp.lang.perl.misc:
    > -berlin.de <-berlin.de> wrote:
    > > Len Philpot <> wrote in comp.lang.perl.misc:
    > >> On Fri, 02 Mar 2007 17:39:51 +1100, Iain Chalmers wrote:
    > >>
    > >> > In article <epdxww5gfd0l.5psw3jcl6it4$>,
    > >> > Len Philpot <> wrote:

    > >
    > >> At this point, I'm very low on the Perl learning cliff (oh, for the
    > >> simplicity and clarity of C! :),

    > >
    > > As in chasing macros and typedefs through header files? As in
    > > Duff's device? :)
    > >
    > > Nah, C is a fine programming language. It is *smaller* than Perl,
    > > in that Perl has more constructs and concepts to learn, but taken
    > > individually, Perl's constructs and concepts are no more difficult
    > > than C's.

    >
    >
    > Except for the concept of scalar and list context. :)
    >
    > Did Larry borrow that concept from somewhere, or did it first
    > show up in Perl?


    I'm pretty sure Perl is the first major language to implement anything
    similar. It's one of the few features that are original with Perl.

    If anything, interpretation and propagation of context is Perl's answer
    to the inflexible typing systems of other languages, but it goes far
    beyond that.

    Anno
     
    -berlin.de, Mar 4, 2007
    #10
  11. Len Philpot

    Dr.Ruud Guest

    gf schreef:


    >
    HTML:
    > You can get closer to what the browser is doing by stripping all the
    > line-end characters from the document,[/color]
    
    Better replace them by a space, or some things will run together.
    It can still do damage, like inside <pre> </pre>.
    
    --
    Affijn, Ruud
    
    "Gewoon is een tijger."
     
    Dr.Ruud, Mar 11, 2007
    #11
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. cp
    Replies:
    3
    Views:
    6,064
  2. Thomas =?ISO-8859-15?Q?G=F6tz?=

    LWP::Simple and utf8 problem

    Thomas =?ISO-8859-15?Q?G=F6tz?=, Apr 19, 2004, in forum: Perl
    Replies:
    0
    Views:
    721
    Thomas =?ISO-8859-15?Q?G=F6tz?=
    Apr 19, 2004
  3. Replies:
    0
    Views:
    342
  4. Replies:
    13
    Views:
    2,759
    Arne Vajhøj
    Mar 18, 2008
  5. Replies:
    3
    Views:
    780
    Reedick, Andrew
    Jul 1, 2008
Loading...

Share This Page