Q on regex of LWP::Simple data

Len Philpot · Mar 2, 2007

I've read the FAQs (unless proven otherwise!) and examples, etc. but
don't know why this doesn't work...

#!perl # use your shebang of choice, this was on Windows

use warnings;
use strict;
use LWP::Simple;

# unwrap this line
my @cachepage = \
get('http://www.geocaching.com/seek/cache_details.aspx?wp=GC115K4');

# line in question (in @cachepage) looks like :
# Should be quick and easy.

foreach my $line (@cachepage)
{
if($line =~ /Should be quick/)
{
print("$line");
}
}

Instead of printing only the line that contains "Should be quick", it
prints every line. Breaking it down to a minimum, I tried :

#!perl

use warnings;
use strict;

my @a = qw(one two three four five fiver);

foreach my $line (@a)
{
if($line =~ /five/)
{
print("$line\n");
}
}

Which, of course, prints :

five
fiver

.... as expected. What's different except maybe the input data? Are the
tags throwing a wrench in things?

My apologies in advance if this is a FAQ or simple logical error. I'm
very much in learning mode with Perl these days.

Thanks!

Len Philpot · Mar 2, 2007

I don't think @cachepage contains what you think it contains...

try adding:

use Data:umper;
print Dumper \@cachepage;

after that line.

So, it's one long string now... $#cachepage == 1

What's the best way to break it back up again? Maybe a pointer in the
right direction?

The get() example used a scalar instead of an array, but I wanted to
iterate through it to find a number of specific strings. Maybe I need to
come up with a regex to simply extract what I need all at once without
iterating.

Or am I looking at this wrong? My final objective, more or less, is to
retrieve a file from a website and extract two or three specific strings
from it, located via a couple of specific HTML tags and subsequently
extracted using back references, but I'm not there yet.

Perhaps I'm being dense... After all, it /has/ been a very long
DST-fix-infested day

Thanks.

Len Philpot · Mar 2, 2007

Yep. LWP::Simple::get doesn't return an array of lines no matter _how_
much you want it too.

Either split the scalar you get into an array of lines yourself

@cachepage=split(/\n/,$scalar_version_ofOcachepage);

or throw the whole scalar at an appropriate regex.

That's what I thought about after posting.

Unless the file you're getting is very well defined, the usual advice is
to parse html using an html parser. Regexs are not the right tool to
deal with arbitrary html (though your case might be far enough from
"arbitrary html" that regexs will work for you).

At this point, I'm very low on the Perl learning cliff (oh, for the
simplicity and clarity of C!

, so I'll probably take an
incrementally-complex approach to parsing it. This whole exercise is for
my own use and edification, anyway.

Thanks.

gf · Mar 2, 2007

At this point, I'm very low on the Perl learning cliff (oh, for the
simplicity and clarity of C! , so I'll probably take an
incrementally-complex approach to parsing it. This whole exercise is for
my own use and edification, anyway.

Ok. I think you meant "curve" instead of "cliff"...

And "the simplicity and clarity of C"? Perl and C are so similar as
far as their allowing the programmer to write terse and cryptic code,
or very verbose code, and still maintain speed. It's the programmers
choice and not something enforced by the language. That said...

The problem with finding strings or data in HTML pages is the
variablity of the format of the pages. HTML is unstructured and relies
on the browser to turn the data into human-readable form. For our
purposes as programmers it makes our job more difficult because we
want to grab the easiest tool to do the job and regex seems to be the
tool to handle finding data in lines that change.

The problem is that HTML allows arbitrary line breaks in the file and
the browser will gobble them then parse the page then format it for
us. Perl doesn't do that. It's doing what you told it to (usually)
and, in this case, what you told it to do is not nearly as complex as
what the browser is doing.

You can get closer to what the browser is doing by stripping all the
line-end characters from the document, then applying your regex
pattern reiteratively to the resulting single line, OR you can tell
the regex engine to ignore line-ends for you. Check out the 'm' and
's' options to regex. Combined with 'g' you should be homing in on the
data you want. Usually.

Sometimes those are still going to fail so you have to dig out the big
guns and parse the document like a browser. There's HTML:

arser and
various derived modules. Of those I like HTML::TreeBuilder. Pass it
HTML using

my $t = HTML::TreeBuilder->new_from_content(get('your url'));

and it will parse it and build a tree. It'll lock the tree and turn it
into an HTML::Element object which you can search and extract info
using the methods of that object. Of those I like the 'look_down()'
method because it's so flexible. Give it the right parameters and
it'll let you loop through the page and find whatever you want. Of
course, as always you have to tell it correctly, and that can be a
tough thing to determine, but that's a different subject for a
different time and probably a different group.

Another way to attack the same problem is to use the various xpath
implementations for HTML in Perl. Search on CPAN and you'll find some.
xpath is a cool way of looking at HTML but, at least for me, it's not
as intuitive as how TreeBuilder and the parsers do it.

Len Philpot · Mar 2, 2007

Ok. I think you meant "curve" instead of "cliff"...

And "the simplicity and clarity of C"? Perl and C are so similar as
far as their allowing the programmer to write terse and cryptic code,
or very verbose code, and still maintain speed. It's the programmers
choice and not something enforced by the language. That said...

Actually, 'cliff' was intentional, as was the C reference - A weak
attempt at humor, I guess. I'm just trying to come to terms with the
looseness that Perl allows (although doesn't require). It's purely my
preference : I like algorithmic flexibility, but with a tighter
syntactic regimen, i.e., for me TIMTOWTDI gets in the way of learning
"the best/right way to do X". However, I'm sure its's very different for
others (as is obviously the case). I really like the way C is not as
abstracted - "the machine prints through" - but once again that's my
preference. Lots of very knowledgeable people feel differently.

The problem with finding strings or data in HTML pages is the
variablity of the format of the pages. HTML is unstructured and relies
on the browser to turn the data into human-readable form. For our
purposes as programmers it makes our job more difficult because we
want to grab the easiest tool to do the job and regex seems to be the
tool to handle finding data in lines that change.

Fortunately in this case, what I'm looking for is (AFAICT) uniquely
labeled and fairly contained. However, newlines do occur and I'll haev
to deal with that.

Sometimes those are still going to fail so you have to dig out the big
guns and parse the document like a browser. There's HTML:arser and
various derived modules. Of those I like HTML::TreeBuilder. Pass it
HTML using

my $t = HTML::TreeBuilder->new_from_content(get('your url'));

Thanks for the suggestions - I'll take a look at them.

Mirco Wahab · Mar 2, 2007

Len said:
# unwrap this line
my @cachepage = \
get('http://www.geocaching.com/seek/cache_details.aspx?wp=GC115K4');
# line in question (in @cachepage) looks like :
# Should be quick and easy.
foreach my $line (@cachepage)
{
if($line =~ /Should be quick/)
{
print("$line");
}
}

Instead of printing only the line that contains "Should be quick", it
prints every line.

After reading all the really good advice
given to yu by others here, i'd like
to point you in the direction mentioned
by Iain.

The minimum working solution for your
question "w/appropriate regex" would
therefore be:

...
my $cachepage = get 'http://www.geocaching.com/seek/cache_details.aspx?wp=GC115K4';
my $searchstr = 'Should be quick';

if( $cachepage =~ /^(.*?$searchstr.*?)$/m ) {
print "$1\n"
}
...

I read you are/have been a C programmer (as I am),
I'd like to stress the idea you should *really* try
to get somehow into the "regex metalanguage" because
knowing it would have enabled you to spit out a solution
after learning what "LWP::Simple::get" returns.

The Regex modifier /m (http://www.perl.com/doc/manual/html/pod/perlre.html)
does exaclty what you need here, it 'anchors' the expression
in parentheses (.*?$searchstr.*?) between line start and line end.

The conntent of the (first and only) parentheses will then
be available in the pattern match variable $1.

Regards

Mirco

Len Philpot · Mar 2, 2007

The minimum working solution for your
question "w/appropriate regex" would
therefore be:

...
my $cachepage = get 'http://www.geocaching.com/seek/cache_details.aspx?wp=GC115K4';
my $searchstr = 'Should be quick';

if( $cachepage =~ /^(.*?$searchstr.*?)$/m ) {
print "$1\n"
}
...

I read you are/have been a C programmer (as I am),

Let me clarify - I find C fascinating and have played with it off and on
over the years. I hesitate to call myself a programmer in any language,
much less C (and it's been a while since I spent any serious time with
it), but I do find it very interesting. I'm not a programmer by
profession... although in the strictest sense of the term, I /have/ been
technically paid to write a couple of programs.

I'd like to stress the idea you should *really* try
to get somehow into the "regex metalanguage" because

Absolutely. I'm a Solaris admin by day, so I use them here and again,
although I need to make an effort to learn it beyond just what I use on
the job.

The conntent of the (first and only) parentheses will then
be available in the pattern match variable $1.

That's what I had in mind (and have done, temporarily): to use a back
reference to grab what I need. The string I used above was a test case.
Actually I look for a specific set of tags followed by a specific HTML
ID value, which are hardwired in the regex, followed by the back
referenced payload.

Thanks.

anno4000 · Mar 3, 2007

Len Philpot said:
At this point, I'm very low on the Perl learning cliff (oh, for the
simplicity and clarity of C! ,

Click to expand...

As in chasing macros and typedefs through header files? As in
Duff's device?

Nah, C is a fine programming language. It is *smaller* than Perl,
in that Perl has more constructs and concepts to learn, but taken
individually, Perl's constructs and concepts are no more difficult
than C's.

Anno

Click to expand...

Tad McClellan · Mar 4, 2007

Len Philpot said:
Len Philpot said:

As in chasing macros and typedefs through header files? As in
Duff's device?

Nah, C is a fine programming language. It is *smaller* than Perl,
in that Perl has more constructs and concepts to learn, but taken
individually, Perl's constructs and concepts are no more difficult
than C's.

Click to expand...

Except for the concept of scalar and list context.

Did Larry borrow that concept from somewhere, or did it first
show up in Perl?

Click to expand...

anno4000 · Mar 4, 2007

Tad McClellan said:
Except for the concept of scalar and list context.

Did Larry borrow that concept from somewhere, or did it first
show up in Perl?

Click to expand...

I'm pretty sure Perl is the first major language to implement anything
similar. It's one of the few features that are original with Perl.

If anything, interpretation and propagation of context is Perl's answer
to the inflexible typing systems of other languages, but it goes far
beyond that.

Anno

Dr.Ruud · Mar 11, 2007

gf schreef:

HTML:

You can get closer to what the browser is doing by stripping all the
line-end characters from the document,[/QUOTE]

Better replace them by a space, or some things will run together.
It can still do damage, like inside <pre> </pre>.

LWP::Simple went haywire this morning	4	Mar 3, 2011
Need Help: LibXML + LWP::Simple Is Making Perl Crash	3	Dec 26, 2007
error printing page using LWP::Simple	42	Feb 4, 2009
Regex basic question	2	Jul 29, 2013
Problem Splitting Text String	2	Dec 29, 2022
LWP::UserAgent infinite hang	1	Mar 5, 2007
LWP::Simple crashes on VMSperl	5	Mar 24, 2006
LWP and firewall	1	Jan 10, 2006

Q on regex of LWP::Simple data

Len Philpot

Len Philpot

Len Philpot

gf

Len Philpot

Mirco Wahab

Len Philpot

anno4000

Tad McClellan

anno4000

Dr.Ruud

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads