Pattern Matching problem!

F

Francis Sylvester

Hi,

I'm a Perl newbie and am having a nightmare trying to get the code below
working. I'm trying to fetch a webpage and if a link within the page matches
the search criterion - return the text after the link. It doesn't seem to be
working and I'm wondering if it's because the pattern match is within the
while loop. If anybody can shed some light I'd be eternally grateful!

Cheers,
Francis

# --------------------------
use LWP::Simple;
use HTML::TokeParser;

my $document = get("http://www.anexamplesite.com");
my $mymatch = "searchstring";

my $parser = HTML::TokeParser->new(\$document);

while ($token = $parser->get_tag("a")) {
if ($token->[1]->{"href"} =~ /$mymatch/) {
# print $server.$token->[1]->{href}."\n";
$document =~ /$searchstring(.+?)someidentifier/;
print "$1";
}
}
 
A

A. Sinan Unur

I'm a Perl newbie and am having a nightmare trying to get the code
below working. I'm trying to fetch a webpage and if a link within the
page matches the search criterion - return the text after the link. It
doesn't seem to be working and I'm wondering

As it is, we have no idea "doesn't seem to be working means". Please
read the posting guidelines to find out how you can help yourself, and,
in the process, help others help you.

use strict;
use warnings;

missing.
use LWP::Simple;
use HTML::TokeParser;

my $document = get("http://www.anexamplesite.com");
my $mymatch = "searchstring";

my $parser = HTML::TokeParser->new(\$document);

while ($token = $parser->get_tag("a")) {
if ($token->[1]->{"href"} =~ /$mymatch/) {
# print $server.$token->[1]->{href}."\n";
$document =~ /$searchstring(.+?)someidentifier/;

The exact contents of $mymatch, $searchstring and whatever
someidentifier might have something to do with what's actually being
matched, no?
print "$1";

You are not capturing anything, why do you expect there to be anything
valid in $1?

Sinan
 
I

it_says_BALLS_on_your forehead

Francis said:
Hi,

I'm a Perl newbie and am having a nightmare trying to get the code below
working. I'm trying to fetch a webpage and if a link within the page matches
the search criterion - return the text after the link. It doesn't seem to be
working and I'm wondering if it's because the pattern match is within the
while loop. If anybody can shed some light I'd be eternally grateful!

Cheers,
Francis

# --------------------------
use LWP::Simple;
use HTML::TokeParser;

my $document = get("http://www.anexamplesite.com");
my $mymatch = "searchstring";

my $parser = HTML::TokeParser->new(\$document);

while ($token = $parser->get_tag("a")) {
if ($token->[1]->{"href"} =~ /$mymatch/) {

try:
if ( $token->[1]{href} =~ /$mymatch/o ) {
# print $server.$token->[1]->{href}."\n";
$document =~ /$searchstring(.+?)someidentifier/;
print "$1";
}
}
 
G

Gunnar Hjalmarsson

it_says_BALLS_on_your forehead said:
Francis said:
while ($token = $parser->get_tag("a")) {
if ($token->[1]->{"href"} =~ /$mymatch/) {

try:
if ( $token->[1]{href} =~ /$mymatch/o ) {

I fail to see why that would make a difference. Could you please explain
why you think it would?
 
I

it_says_BALLS_on_your forehead

Gunnar said:
it_says_BALLS_on_your forehead said:
Francis said:
while ($token = $parser->get_tag("a")) {
if ($token->[1]->{"href"} =~ /$mymatch/) {

try:
if ( $token->[1]{href} =~ /$mymatch/o ) {

I fail to see why that would make a difference. Could you please explain
why you think it would?

I looked up HTML::TokeParse in CPAN.

The first Example displayed illustrated that the way to get the href
was:

my $url = $token->[1]{href} || "-";

....i noticed that the OP did not use the same syntax. I didn't know if
this was causing his problem. the 'o' at the end of the pattern was
just to optimize the pattern match, since it doesn't seem like the OP
needed to recompile the regex every time...
 
G

Gunnar Hjalmarsson

it_says_BALLS_on_your forehead said:
Gunnar said:
it_says_BALLS_on_your forehead said:
Francis Sylvester wrote:

while ($token = $parser->get_tag("a")) {
if ($token->[1]->{"href"} =~ /$mymatch/) {

try:
if ( $token->[1]{href} =~ /$mymatch/o ) {

I fail to see why that would make a difference. Could you please explain
why you think it would?

I looked up HTML::TokeParse in CPAN.

That's a good start, I suppose. :)
The first Example displayed illustrated that the way to get the href
was:

my $url = $token->[1]{href} || "-";

...i noticed that the OP did not use the same syntax. I didn't know if
this was causing his problem.

The reason why I asked is that I thought that

$token->[1]->{"href"}

is always the same as

$token->[1]{href}

following Perl's syntax for references and data structures.
 
I

it_says_BALLS_on_your forehead

Gunnar said:
it_says_BALLS_on_your forehead said:
Gunnar said:
it_says_BALLS_on_your forehead wrote:
Francis Sylvester wrote:

while ($token = $parser->get_tag("a")) {
if ($token->[1]->{"href"} =~ /$mymatch/) {

try:
if ( $token->[1]{href} =~ /$mymatch/o ) {

I fail to see why that would make a difference. Could you please explain
why you think it would?

I looked up HTML::TokeParse in CPAN.

That's a good start, I suppose. :)
The first Example displayed illustrated that the way to get the href
was:

my $url = $token->[1]{href} || "-";

...i noticed that the OP did not use the same syntax. I didn't know if
this was causing his problem.

The reason why I asked is that I thought that

$token->[1]->{"href"}

is always the same as

$token->[1]{href}

following Perl's syntax for references and data structures.

ahh, i think you're right. pg. 254 Programming Perl 3rd ed.

"The arrow is optional between brackets or braces, or between a closing
bracket or brace and a parenthesis for an indirect function call."
 
A

A. Sinan Unur

A. Sinan Unur ([email protected]) wrote on MMMMCDLVIII
September MCMXCIII in
<URL::)
....

:) > $document =~ /$searchstring(.+?)someidentifier/;
:)
:) The exact contents of $mymatch, $searchstring and whatever
:) someidentifier might have something to do with what's actually
:) being matched, no?
:)
:) > print "$1";
:)
:) You are not capturing anything, why do you expect there to be
:) anything valid in $1?

Not capturing? I'd say the parens in
/$searchstring(.+?)someidentifier/ capture (if the match is
succesful), or there's a bug in perl.

Arrgh! Thank you very much for catching that.

Sinan
 
G

Gunnar Hjalmarsson

Francis said:
I'm a Perl newbie and am having a nightmare trying to get the code below
working. I'm trying to fetch a webpage and if a link within the page matches
the search criterion - return the text after the link.

use LWP::Simple;
use HTML::TokeParser;

Yes, using a module for parsing an HTML document is a good idea.
my $document = get("http://www.anexamplesite.com");
my $mymatch = "searchstring";

my $parser = HTML::TokeParser->new(\$document);

while ($token = $parser->get_tag("a")) {
if ($token->[1]->{"href"} =~ /$mymatch/) {
# print $server.$token->[1]->{href}."\n";
$document =~ /$searchstring(.+?)someidentifier/;

What's that? After you have possibly found your search string, you let
the program search the whole document using a simple regex. Doing so
makes no sense to me.

Either you'd better stick to a simple regex, and skip the parsing
module, or (better) taking advantage of the module you are using, and
doing something like:

while ( my $token = $parser->get_tag('a') ) {
if ($token->[1]{href} =~ /$mymatch/) {
print $parser->get_text('a')."\n";
}
}

(I'm not sure if that's what you're looking for, but hopefully you get
the idea.)
 
F

Francis Sylvester

Either you'd better stick to a simple regex, and skip the parsing module,
or (better) taking advantage of the module you are using, and doing
something like:

while ( my $token = $parser->get_tag('a') ) {
if ($token->[1]{href} =~ /$mymatch/) {
print $parser->get_text('a')."\n";
}
}

(I'm not sure if that's what you're looking for, but hopefully you get the
idea.)

Many thanks for all your replies. I'm sorry, I should have been clearer -
the code executes without error messages but I sometimes get unwanted
results in $1. After closer inspection, I think it's because sometimes it's
returning $1 from the earlier pattern match ( if ($token->[1]->{"href"} =~
/$mymatch/) rather than the pattern match I wanted ($document =~
/$searchstring(.+?)someidentifier/;)
Is there a way to reset the value of $1?

Many thanks,
Francis
 
T

Tad McClellan

Francis Sylvester said:
if ($token->[1]{href} =~ /$mymatch/) {

I sometimes get unwanted
results in $1. After closer inspection, I think it's because sometimes it's
returning $1 from the earlier pattern match ( if ($token->[1]->{"href"} =~
/$mymatch/)


Note that that code ensures that the pattern match *succeeded*.

rather than the pattern match I wanted ($document =~
/$searchstring(.+?)someidentifier/;)


We don't really know, since you did not quote that part of the code,
but you should always ensure that the match succeeded before
using the dollar-digit variables, so:

Is _your_ pattern match being tested for success?

Is there a way to reset the value of $1?


Yes. They are reset on every _successful_ pattern match.
 
G

Gunnar Hjalmarsson

Francis said:
Either you'd better stick to a simple regex, and skip the parsing module,
or (better) taking advantage of the module you are using, and doing
something like:

while ( my $token = $parser->get_tag('a') ) {
if ($token->[1]{href} =~ /$mymatch/) {
print $parser->get_text('a')."\n";
}
}

(I'm not sure if that's what you're looking for, but hopefully you get the
idea.)

the code executes without error messages but I sometimes get unwanted
results in $1.

And that may well be a result of the fact that you don't actually make
use of the module you are using for parsing HTML...

Didn't you understand my objection to your code?
http://groups.google.com/group/comp.lang.perl.misc/msg/60f72a205520c4b1
 
F

Francis Sylvester

(I'm not sure if that's what you're looking for, but hopefully you get
And that may well be a result of the fact that you don't actually make use
of the module you are using for parsing HTML...

Didn't you understand my objection to your code?
http://groups.google.com/group/comp.lang.perl.misc/msg/60f72a205520c4b1

--

Thanks Gunnar. I did understand your objection but thought I needed to
resort to pattern matching for a specific section of the text I'm retrieving
after the link. Having read your message and looking at the module docs
again now - I think I might be able to achieve the desired result without
the pattern match. I'm very grateful to you for the responses - you've
probably saved me hours!

Thanks again,
Francis
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top