Pattern Matching problem!

Francis Sylvester · Nov 14, 2005

Hi,

I'm a Perl newbie and am having a nightmare trying to get the code below
working. I'm trying to fetch a webpage and if a link within the page matches
the search criterion - return the text after the link. It doesn't seem to be
working and I'm wondering if it's because the pattern match is within the
while loop. If anybody can shed some light I'd be eternally grateful!

Cheers,
Francis

# --------------------------
use LWP::Simple;
use HTML::TokeParser;

my $document = get("http://www.anexamplesite.com");
my $mymatch = "searchstring";

my $parser = HTML::TokeParser->new(\$document);

while ($token = $parser->get_tag("a")) {
if ($token->[1]->{"href"} =~ /$mymatch/) {
# print $server.$token->[1]->{href}."\n";
$document =~ /$searchstring(.+?)someidentifier/;
print "$1";
}
}

A. Sinan Unur · Nov 14, 2005

I'm a Perl newbie and am having a nightmare trying to get the code
below working. I'm trying to fetch a webpage and if a link within the
page matches the search criterion - return the text after the link. It
doesn't seem to be working and I'm wondering

As it is, we have no idea "doesn't seem to be working means". Please
read the posting guidelines to find out how you can help yourself, and,
in the process, help others help you.

use strict;
use warnings;

missing.

use LWP::Simple;
use HTML::TokeParser;

my $document = get("http://www.anexamplesite.com");
my $mymatch = "searchstring";

my $parser = HTML::TokeParser->new(\$document);

while ($token = $parser->get_tag("a")) {
if ($token->[1]->{"href"} =~ /$mymatch/) {
# print $server.$token->[1]->{href}."\n";
$document =~ /$searchstring(.+?)someidentifier/;

The exact contents of $mymatch, $searchstring and whatever
someidentifier might have something to do with what's actually being
matched, no?

print "$1";

You are not capturing anything, why do you expect there to be anything
valid in $1?

Sinan

it_says_BALLS_on_your forehead · Nov 14, 2005

Francis said:
Hi,

I'm a Perl newbie and am having a nightmare trying to get the code below
working. I'm trying to fetch a webpage and if a link within the page matches
the search criterion - return the text after the link. It doesn't seem to be
working and I'm wondering if it's because the pattern match is within the
while loop. If anybody can shed some light I'd be eternally grateful!

Cheers,
Francis

# --------------------------
use LWP::Simple;
use HTML::TokeParser;

my $document = get("http://www.anexamplesite.com");
my $mymatch = "searchstring";

my $parser = HTML::TokeParser->new(\$document);

while ($token = $parser->get_tag("a")) {
if ($token->[1]->{"href"} =~ /$mymatch/) {

try:
if ( $token->[1]{href} =~ /$mymatch/o ) {

# print $server.$token->[1]->{href}."\n";
$document =~ /$searchstring(.+?)someidentifier/;
print "$1";
}
}

Gunnar Hjalmarsson · Nov 15, 2005

it_says_BALLS_on_your forehead said:
Francis said:

while ($token = $parser->get_tag("a")) {
if ($token->[1]->{"href"} =~ /$mymatch/) {

Click to expand...

try:
if ( $token->[1]{href} =~ /$mymatch/o ) {

I fail to see why that would make a difference. Could you please explain
why you think it would?

it_says_BALLS_on_your forehead · Nov 15, 2005

Gunnar said:
it_says_BALLS_on_your forehead said:

Francis said:

while ($token = $parser->get_tag("a")) {
if ($token->[1]->{"href"} =~ /$mymatch/) {

Click to expand...

try:
if ( $token->[1]{href} =~ /$mymatch/o ) {

Click to expand...

I fail to see why that would make a difference. Could you please explain
why you think it would?

I looked up HTML::TokeParse in CPAN.

The first Example displayed illustrated that the way to get the href
was:

my $url = $token->[1]{href} || "-";

....i noticed that the OP did not use the same syntax. I didn't know if
this was causing his problem. the 'o' at the end of the pattern was
just to optimize the pattern match, since it doesn't seem like the OP
needed to recompile the regex every time...

Gunnar Hjalmarsson · Nov 15, 2005

it_says_BALLS_on_your forehead said:
Gunnar said:

it_says_BALLS_on_your forehead said:

Francis Sylvester wrote:

while ($token = $parser->get_tag("a")) {
if ($token->[1]->{"href"} =~ /$mymatch/) {

try:
if ( $token->[1]{href} =~ /$mymatch/o ) {

Click to expand...

I fail to see why that would make a difference. Could you please explain
why you think it would?

Click to expand...

I looked up HTML::TokeParse in CPAN.

That's a good start, I suppose.

The first Example displayed illustrated that the way to get the href
was:

my $url = $token->[1]{href} || "-";

...i noticed that the OP did not use the same syntax. I didn't know if
this was causing his problem.

The reason why I asked is that I thought that

$token->[1]->{"href"}

is always the same as

$token->[1]{href}

following Perl's syntax for references and data structures.

it_says_BALLS_on_your forehead · Nov 15, 2005

Gunnar said:
it_says_BALLS_on_your forehead said:

Gunnar said:

it_says_BALLS_on_your forehead wrote:
Francis Sylvester wrote:

while ($token = $parser->get_tag("a")) {
if ($token->[1]->{"href"} =~ /$mymatch/) {

try:
if ( $token->[1]{href} =~ /$mymatch/o ) {

I fail to see why that would make a difference. Could you please explain
why you think it would?

Click to expand...

I looked up HTML::TokeParse in CPAN.

Click to expand...

That's a good start, I suppose.

The first Example displayed illustrated that the way to get the href
was:

my $url = $token->[1]{href} || "-";

...i noticed that the OP did not use the same syntax. I didn't know if
this was causing his problem.

Click to expand...

The reason why I asked is that I thought that

$token->[1]->{"href"}

is always the same as

$token->[1]{href}

following Perl's syntax for references and data structures.

ahh, i think you're right. pg. 254 Programming Perl 3rd ed.

"The arrow is optional between brackets or braces, or between a closing
bracket or brace and a parenthesis for an indirect function call."

A. Sinan Unur · Nov 15, 2005

A. Sinan Unur ([email protected]) wrote on MMMMCDLVIII
September MCMXCIII in
<URL:
....

> $document =~ /$searchstring(.+?)someidentifier/;

The exact contents of $mymatch, $searchstring and whatever
someidentifier might have something to do with what's actually
being matched, no?

> print "$1";

You are not capturing anything, why do you expect there to be
anything valid in $1?

Not capturing? I'd say the parens in
/$searchstring(.+?)someidentifier/ capture (if the match is
succesful), or there's a bug in perl.

Arrgh! Thank you very much for catching that.

Sinan

Gunnar Hjalmarsson · Nov 15, 2005

Francis said:
I'm a Perl newbie and am having a nightmare trying to get the code below
working. I'm trying to fetch a webpage and if a link within the page matches
the search criterion - return the text after the link.

use LWP::Simple;
use HTML::TokeParser;

Yes, using a module for parsing an HTML document is a good idea.

my $document = get("http://www.anexamplesite.com");
my $mymatch = "searchstring";

my $parser = HTML::TokeParser->new(\$document);

while ($token = $parser->get_tag("a")) {
if ($token->[1]->{"href"} =~ /$mymatch/) {
# print $server.$token->[1]->{href}."\n";
$document =~ /$searchstring(.+?)someidentifier/;

What's that? After you have possibly found your search string, you let
the program search the whole document using a simple regex. Doing so
makes no sense to me.

Either you'd better stick to a simple regex, and skip the parsing
module, or (better) taking advantage of the module you are using, and
doing something like:

while ( my $token = $parser->get_tag('a') ) {
if ($token->[1]{href} =~ /$mymatch/) {
print $parser->get_text('a')."\n";
}
}

(I'm not sure if that's what you're looking for, but hopefully you get
the idea.)

Francis Sylvester · Nov 15, 2005

Either you'd better stick to a simple regex, and skip the parsing module,

or (better) taking advantage of the module you are using, and doing
something like:

while ( my $token = $parser->get_tag('a') ) {
if ($token->[1]{href} =~ /$mymatch/) {
print $parser->get_text('a')."\n";
}
}

(I'm not sure if that's what you're looking for, but hopefully you get the
idea.)

Many thanks for all your replies. I'm sorry, I should have been clearer -
the code executes without error messages but I sometimes get unwanted
results in $1. After closer inspection, I think it's because sometimes it's
returning $1 from the earlier pattern match ( if ($token->[1]->{"href"} =~
/$mymatch/) rather than the pattern match I wanted ($document =~
/$searchstring(.+?)someidentifier/

Is there a way to reset the value of $1?

Many thanks,
Francis

Tad McClellan · Nov 15, 2005

Francis Sylvester said:
if ($token->[1]{href} =~ /$mymatch/) {

Click to expand...

I sometimes get unwanted
results in $1. After closer inspection, I think it's because sometimes it's
returning $1 from the earlier pattern match ( if ($token->[1]->{"href"} =~
/$mymatch/)

Note that that code ensures that the pattern match *succeeded*.

rather than the pattern match I wanted ($document =~
/$searchstring(.+?)someidentifier/

We don't really know, since you did not quote that part of the code,
but you should always ensure that the match succeeded before
using the dollar-digit variables, so:

Is _your_ pattern match being tested for success?

Is there a way to reset the value of $1?

Yes. They are reset on every _successful_ pattern match.

Gunnar Hjalmarsson · Nov 15, 2005

Francis said:
Either you'd better stick to a simple regex, and skip the parsing module,
or (better) taking advantage of the module you are using, and doing
something like:

while ( my $token = $parser->get_tag('a') ) {
if ($token->[1]{href} =~ /$mymatch/) {
print $parser->get_text('a')."\n";
}
}

(I'm not sure if that's what you're looking for, but hopefully you get the
idea.)

Click to expand...

the code executes without error messages but I sometimes get unwanted
results in $1.

And that may well be a result of the fact that you don't actually make
use of the module you are using for parsing HTML...

Didn't you understand my objection to your code?
http://groups.google.com/group/comp.lang.perl.misc/msg/60f72a205520c4b1

Francis Sylvester · Nov 15, 2005

(I'm not sure if that's what you're looking for, but hopefully you get

And that may well be a result of the fact that you don't actually make use
of the module you are using for parsing HTML...

Didn't you understand my objection to your code?
http://groups.google.com/group/comp.lang.perl.misc/msg/60f72a205520c4b1

--

Thanks Gunnar. I did understand your objection but thought I needed to
resort to pattern matching for a specific section of the text I'm retrieving
after the link. Having read your message and looking at the module docs
again now - I think I might be able to achieve the desired result without
the pattern match. I'm very grateful to you for the responses - you've
probably saved me hours!

Thanks again,
Francis

pattern matching and abstract functions	12	Mar 29, 2011
why the loop-break after pattern-matching	1	Jan 1, 2012
Pattern matching problem.	5	Jul 24, 2005
How do I follow links stored in an array?	3	Apr 29, 2008
FAQ 4.23 How do I find matching/nesting anything?	0	Apr 2, 2011
Survey details won't go through using php, ajax, Mysql	0	Oct 26, 2023
Pattern matching : not matching problem	9	Jan 8, 2004
Pattern Matching Problem	12	Jul 6, 2007

Pattern Matching problem!

Francis Sylvester

A. Sinan Unur

it_says_BALLS_on_your forehead

Gunnar Hjalmarsson

it_says_BALLS_on_your forehead

Gunnar Hjalmarsson

it_says_BALLS_on_your forehead

A. Sinan Unur

Gunnar Hjalmarsson

Francis Sylvester

Tad McClellan

Gunnar Hjalmarsson

Francis Sylvester

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads