Q on regex of LWP::Simple data

L

Len Philpot

I've read the FAQs (unless proven otherwise!) and examples, etc. but
don't know why this doesn't work...


#!perl # use your shebang of choice, this was on Windows

use warnings;
use strict;
use LWP::Simple;

# unwrap this line
my @cachepage = \
get('http://www.geocaching.com/seek/cache_details.aspx?wp=GC115K4');

# line in question (in @cachepage) looks like :
# <p><span id="ShortDescription">Should be quick and easy.</span></p>

foreach my $line (@cachepage)
{
if($line =~ /Should be quick/)
{
print("$line");
}
}


Instead of printing only the line that contains "Should be quick", it
prints every line. Breaking it down to a minimum, I tried :

#!perl

use warnings;
use strict;

my @a = qw(one two three four five fiver);

foreach my $line (@a)
{
if($line =~ /five/)
{
print("$line\n");
}
}

Which, of course, prints :

five
fiver

.... as expected. What's different except maybe the input data? Are the
tags throwing a wrench in things?

My apologies in advance if this is a FAQ or simple logical error. I'm
very much in learning mode with Perl these days.

Thanks!
 
L

Len Philpot

I don't think @cachepage contains what you think it contains...

try adding:

use Data::Dumper;
print Dumper \@cachepage;

after that line.

So, it's one long string now... $#cachepage == 1

What's the best way to break it back up again? Maybe a pointer in the
right direction?

The get() example used a scalar instead of an array, but I wanted to
iterate through it to find a number of specific strings. Maybe I need to
come up with a regex to simply extract what I need all at once without
iterating.

Or am I looking at this wrong? My final objective, more or less, is to
retrieve a file from a website and extract two or three specific strings
from it, located via a couple of specific HTML tags and subsequently
extracted using back references, but I'm not there yet.

Perhaps I'm being dense... After all, it /has/ been a very long
DST-fix-infested day :)

Thanks.
 
L

Len Philpot

Yep. LWP::Simple::get doesn't return an array of lines no matter _how_
much you want it too.

Either split the scalar you get into an array of lines yourself

@cachepage=split(/\n/,$scalar_version_ofOcachepage);

or throw the whole scalar at an appropriate regex.

That's what I thought about after posting.

Unless the file you're getting is very well defined, the usual advice is
to parse html using an html parser. Regexs are not the right tool to
deal with arbitrary html (though your case might be far enough from
"arbitrary html" that regexs will work for you).

At this point, I'm very low on the Perl learning cliff (oh, for the
simplicity and clarity of C! :), so I'll probably take an
incrementally-complex approach to parsing it. This whole exercise is for
my own use and edification, anyway.

Thanks.
 
G

gf

At this point, I'm very low on the Perl learning cliff (oh, for the
simplicity and clarity of C! :), so I'll probably take an
incrementally-complex approach to parsing it. This whole exercise is for
my own use and edification, anyway.

Ok. I think you meant "curve" instead of "cliff"...

And "the simplicity and clarity of C"? Perl and C are so similar as
far as their allowing the programmer to write terse and cryptic code,
or very verbose code, and still maintain speed. It's the programmers
choice and not something enforced by the language. That said...

The problem with finding strings or data in HTML pages is the
variablity of the format of the pages. HTML is unstructured and relies
on the browser to turn the data into human-readable form. For our
purposes as programmers it makes our job more difficult because we
want to grab the easiest tool to do the job and regex seems to be the
tool to handle finding data in lines that change.

The problem is that HTML allows arbitrary line breaks in the file and
the browser will gobble them then parse the page then format it for
us. Perl doesn't do that. It's doing what you told it to (usually)
and, in this case, what you told it to do is not nearly as complex as
what the browser is doing.

You can get closer to what the browser is doing by stripping all the
line-end characters from the document, then applying your regex
pattern reiteratively to the resulting single line, OR you can tell
the regex engine to ignore line-ends for you. Check out the 'm' and
's' options to regex. Combined with 'g' you should be homing in on the
data you want. Usually.

Sometimes those are still going to fail so you have to dig out the big
guns and parse the document like a browser. There's HTML::parser and
various derived modules. Of those I like HTML::TreeBuilder. Pass it
HTML using

my $t = HTML::TreeBuilder->new_from_content(get('your url'));

and it will parse it and build a tree. It'll lock the tree and turn it
into an HTML::Element object which you can search and extract info
using the methods of that object. Of those I like the 'look_down()'
method because it's so flexible. Give it the right parameters and
it'll let you loop through the page and find whatever you want. Of
course, as always you have to tell it correctly, and that can be a
tough thing to determine, but that's a different subject for a
different time and probably a different group.

Another way to attack the same problem is to use the various xpath
implementations for HTML in Perl. Search on CPAN and you'll find some.
xpath is a cool way of looking at HTML but, at least for me, it's not
as intuitive as how TreeBuilder and the parsers do it.
 
L

Len Philpot

Ok. I think you meant "curve" instead of "cliff"...

And "the simplicity and clarity of C"? Perl and C are so similar as
far as their allowing the programmer to write terse and cryptic code,
or very verbose code, and still maintain speed. It's the programmers
choice and not something enforced by the language. That said...

Actually, 'cliff' was intentional, as was the C reference - A weak
attempt at humor, I guess. I'm just trying to come to terms with the
looseness that Perl allows (although doesn't require). It's purely my
preference : I like algorithmic flexibility, but with a tighter
syntactic regimen, i.e., for me TIMTOWTDI gets in the way of learning
"the best/right way to do X". However, I'm sure its's very different for
others (as is obviously the case). I really like the way C is not as
abstracted - "the machine prints through" - but once again that's my
preference. Lots of very knowledgeable people feel differently. :)

The problem with finding strings or data in HTML pages is the
variablity of the format of the pages. HTML is unstructured and relies
on the browser to turn the data into human-readable form. For our
purposes as programmers it makes our job more difficult because we
want to grab the easiest tool to do the job and regex seems to be the
tool to handle finding data in lines that change.

Fortunately in this case, what I'm looking for is (AFAICT) uniquely
labeled and fairly contained. However, newlines do occur and I'll haev
to deal with that.

Sometimes those are still going to fail so you have to dig out the big
guns and parse the document like a browser. There's HTML::parser and
various derived modules. Of those I like HTML::TreeBuilder. Pass it
HTML using

my $t = HTML::TreeBuilder->new_from_content(get('your url'));

Thanks for the suggestions - I'll take a look at them.
 
M

Mirco Wahab

Len said:
# unwrap this line
my @cachepage = \
get('http://www.geocaching.com/seek/cache_details.aspx?wp=GC115K4');
# line in question (in @cachepage) looks like :
# <p><span id="ShortDescription">Should be quick and easy.</span></p>
foreach my $line (@cachepage)
{
if($line =~ /Should be quick/)
{
print("$line");
}
}


Instead of printing only the line that contains "Should be quick", it
prints every line.

After reading all the really good advice
given to yu by others here, i'd like
to point you in the direction mentioned
by Iain.

The minimum working solution for your
question "w/appropriate regex" would
therefore be:


...
my $cachepage = get 'http://www.geocaching.com/seek/cache_details.aspx?wp=GC115K4';
my $searchstr = 'Should be quick';

if( $cachepage =~ /^(.*?$searchstr.*?)$/m ) {
print "$1\n"
}
...


I read you are/have been a C programmer (as I am),
I'd like to stress the idea you should *really* try
to get somehow into the "regex metalanguage" because
knowing it would have enabled you to spit out a solution
after learning what "LWP::Simple::get" returns.

The Regex modifier /m (http://www.perl.com/doc/manual/html/pod/perlre.html)
does exaclty what you need here, it 'anchors' the expression
in parentheses (.*?$searchstr.*?) between line start and line end.

The conntent of the (first and only) parentheses will then
be available in the pattern match variable $1.

Regards

Mirco
 
L

Len Philpot

The minimum working solution for your
question "w/appropriate regex" would
therefore be:

...
my $cachepage = get 'http://www.geocaching.com/seek/cache_details.aspx?wp=GC115K4';
my $searchstr = 'Should be quick';

if( $cachepage =~ /^(.*?$searchstr.*?)$/m ) {
print "$1\n"
}
...

I read you are/have been a C programmer (as I am),

Let me clarify - I find C fascinating and have played with it off and on
over the years. I hesitate to call myself a programmer in any language,
much less C (and it's been a while since I spent any serious time with
it), but I do find it very interesting. I'm not a programmer by
profession... although in the strictest sense of the term, I /have/ been
technically paid to write a couple of programs. :)

I'd like to stress the idea you should *really* try
to get somehow into the "regex metalanguage" because

Absolutely. I'm a Solaris admin by day, so I use them here and again,
although I need to make an effort to learn it beyond just what I use on
the job.

The conntent of the (first and only) parentheses will then
be available in the pattern match variable $1.

That's what I had in mind (and have done, temporarily): to use a back
reference to grab what I need. The string I used above was a test case.
Actually I look for a specific set of tags followed by a specific HTML
ID value, which are hardwired in the regex, followed by the back
referenced payload.

Thanks.
 
A

anno4000

Len Philpot said:
At this point, I'm very low on the Perl learning cliff (oh, for the
simplicity and clarity of C! :),

As in chasing macros and typedefs through header files? As in
Duff's device? :)

Nah, C is a fine programming language. It is *smaller* than Perl,
in that Perl has more constructs and concepts to learn, but taken
individually, Perl's constructs and concepts are no more difficult
than C's.

Anno
 
T

Tad McClellan

Len Philpot said:
As in chasing macros and typedefs through header files? As in
Duff's device? :)

Nah, C is a fine programming language. It is *smaller* than Perl,
in that Perl has more constructs and concepts to learn, but taken
individually, Perl's constructs and concepts are no more difficult
than C's.


Except for the concept of scalar and list context. :)

Did Larry borrow that concept from somewhere, or did it first
show up in Perl?
 
A

anno4000

Tad McClellan said:
Except for the concept of scalar and list context. :)

Did Larry borrow that concept from somewhere, or did it first
show up in Perl?

I'm pretty sure Perl is the first major language to implement anything
similar. It's one of the few features that are original with Perl.

If anything, interpretation and propagation of context is Perl's answer
to the inflexible typing systems of other languages, but it goes far
beyond that.

Anno
 
D

Dr.Ruud

gf schreef:

HTML:
You can get closer to what the browser is doing by stripping all the
line-end characters from the document,[/QUOTE]

Better replace them by a space, or some things will run together.
It can still do damage, like inside <pre> </pre>.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,764
Messages
2,569,566
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top